Computational detection of natural selection in gene family expansion and contraction

Tóm tắt Computational detection of natural selection in gene family expansion and contraction: ...actical implementation of the algorithm, we need to make the assumption that the maximal gene family size is limited. However, since the conditional probability distribu- tion associated with the BD model drops of quickly for large values, this assumption is very reasonable for a large enough upp...are indicated in Figure 1 as time, t, in million years. We estimated the evolutionary rate parameter λ as 0:002 per million years . In the 32 million years since the most recent common ancestor of the five species, 1254 of the 3517 gene families shared among them has changed in size; the remainin...ls that include heterogeneous gain and loss rates across gene families. Although large families are expected to show greater change in number between species simply because there are more chances for gain and loss–and the opposite is true for small families–we will in the future be able to test w...

pdf14 trang | Chia sẻ: havih72 | Lượt xem: 145 | Lượt tải: 0download
Nội dung tài liệu Computational detection of natural selection in gene family expansion and contraction, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
orrelated with the
number of expansions and contractions.
Identification of unusually evolving gene families in Saccharomyces
As explained above, the PGM also allows us to compute p-values to identify gene families
that are highly unlikely under the random BD process. Of the 1254 gene families that differed
in number between genomes, 58 had p-values less than 0.01 (35 are expected). The unlikely
families are summarized in Table 2, along with the specific branch that is responsible for the
violation (when such a branch could be identified). The two methods that we used to identify
the offending branch agreed in most cases (see Table 2).
For the first four families identified in Table 2 the observed gene family sizes are so unlikely
that it is hard to determine where any one unlikely event occurred. Two of these gene families
are of unknown function, and the other two are transposable elements (TEs). While it is
interesting to see these large changes, transposable elements violate the assumptions of the
BD model in a number of ways and it can therefore be seen as a validation of our approach
that they are identified as unlikely (see Discussion).
9. DISCUSSION
In this paper we have presented and evaluated a method for studying the evolution of
gene families over a phylogeny. Based on data from multiple whole genomes, the method can
be used to examine the rates and direction of change in gene family size among taxa. Our
method also allows for hypothesis testing: we have shown how we can identify gene families
that have had unlikely histories given a model of random gene birth and death. Importantly,
the PGM methodology used here scales linearly with the number of new genomes added;the
most challenging aspect of future analyses may simply be getting reliable phylogenetic trees for
COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 9
the species considered. This PGM approach is conceptually similar to the maximum-likelihood
approach taken by others to study the evolution of phenotypic quantitative characters (e.g.
Pagel 1999).
Our analyses have revealed a large number of changes in gene family size across the Sac-
charomyces tree: 1254 of 3517 families changed in size. Every branch of the phylogeny was
inferred to have changes along it, with longer branches having commensurately more changes
(Table 1). One concern we had prior to our analysis was that the uneven sequence coverage
of these five genomes would affect our results; this did not appear to be the case. S. cerevisiae
is in fact the only eukaryotic with a fully sequenced genome; all of the other yeast genomes
are covered to differing extents. S. paradoxus was sequenced to 7X coverage (i.e shotgun
sequencing was done equivalent to seven times the length of the genome), while S. bayanus,
S. kudriavzevii, and S. mikatae were sequenced to 2-3X (Cliften et al. 2003). Despite this
unevenness among taxa, our results do not seem to have been affected: S. kudriavzevii and S.
mikatae were predicted to have both the largest number of genes and the largest number of
gene family expansions. If the lack of sequence coverage had been a problem we would have
expected these genomes to show fewer genes and smaller gene family sizes on average.
As described above, the null BD model can be used to test whether gene families are on
average diffusing evenly along the tree. This model can be violated when processes such as
natural selection give a direction to the expected random walk, causing extreme expansions or
contractions to gene family size. We were able to detect such changes on almost every branch
of the tree, and on every external branch leading to an extant species. In cases where we
we did not reject the null hypothesis it does not mean that natural selection is not acting on
members of a gene family, only that we cannot detect its role in affecting the differences in
size of the family. Natural selection may have played a role in the taxation of a small number
of duplicates within a family, but, much like other statistical tests in molecular evolution, we
only have the power to detect the repeated occurrence of events.
One of the most extreme examples that we found was in the helicase family, where S.
cerevisiae has 34 members of this family while none of the other species have more than 3. We
were also able to identify a significant expansion of the flocculin gene family in S. cerevisiae,
a change that is unsurprising considering the fact that flocculation has been selected for in
the domestication of this brewer’s yeast (Jin and Speers 1998). Like other genes that have
undergone artificial selection during domestication (e.g. Wang et al. 1999), we detected the
signature of adaptive natural selection on the flocculins. This is the first example to our
knowledge, however, of selection on gene family size being implicated in domestication.
Any inference of natural selection with our method comes with a number of caveats that
must be mentioned. One caveat is that we have implicitly assumed that there is no relationship
between family size and duplication and deletion rates. It may be, for instance, that large
gene families are more likely to undergo non-homologous pairing, unequal crossing over, and
therefore more duplication and eventual taxation due to drift (Li 1997). A homogeneous birth
and death model may also not be absolutely correct for small gene families, as under the
BD model families will always eventually reach the absorbing state of zero genes. Because
many genes appear to be conserved over very long periods of time (e.g. Theissen et al. 2003),
there may be a decreased loss rate in small families in order to prevent extinction of required
10 CHI NGUYEN, NELLO CRISTIANINI
gene functions. The possibility of non-homogeneities in very large or very small gene families
suggests that models incorporating these processes be studied. Karev et al. (2002) found
that a random BD model with added parameters for birth and death rates for the largest and
smallest families fit the distribution of gene families in a single genome slightly better than a
completely homogeneous model. The improved fit to the data, however, was not shown to be
significantly better than models without the two extra parameters. The framework we have
provided here should allow for the testing of models that include heterogeneous gain and loss
rates across gene families. Although large families are expected to show greater change in
number between species simply because there are more chances for gain and loss–and the
opposite is true for small families–we will in the future be able to test whether the observed
changes are more or less than are expected.
The issue of gene families having intrinsically different birth and death rates extends
beyond the consideration of family size. For example, one family of genes that does not
follow this assumption is transposable elements (TEs): they can multiply in number in a non-
mendelian manner, and are often selected against by the organisms they inhabit. Because the
parameters for gain and loss of TEs can be quite difierent than those for other gene families
(see, e.g.Kidwell 2002; Li 1997), the disparity in TE number between genomes can be due to
processes unique to this family. So our finding that TEs are at the top of our list of unusual
gene families is not surprising. Results for transposable element families or other genomic
parasites using the BD model, therefore, should not be parameterized with gain and loss rates
inferred from the majority of protein coding genes.
In addition to the assumptions of equivalent birth and death mechanisms among families,
one other very important aspect of any random point process is the assumption of indepen-
dence among individual genes. The BD model assumes that each gene in a family has an
independent probability of being duplicated or deleted: any large-scale chromosomal duplica-
tion, deletion, or polyploidization may act on multiple members of a family at once. This is
potentially a common violation of the model in light of the frequency of larger scale duplica-
tions and deletions that include gene duplicates (Friedman and Hughes 2001). As a result, we
cannot compare taxa that are separated by a whole genome duplication in the same manner as
has been presented here. This also means that any unusual gene family should be examined in
more detail to determine the nature of the changes in gene family size; obvious duplications of
large regions containing multiple members of a family, for example, may moderate conclusions
about natural selection.
Our hypothesis-testing framework requires an estimate of λ, the birth and death parameter
determining the rate of evolution. In the above sections we show how we can estimate the
value of that makes the entire dataset maximally likely (using Expectation Maximization);
reassuringly, the resulting value we obtained (0.002 per million years) is very close to the
previous estimate of λ found using data from only S. cerevisiae (0.004 per million years;
Lynch and Conery 2003). In the future we hope to extend the model by making it possible
to allow to vary along branches of a phylogenetic tree or by allowing the birth and death
rates to be unequal on any branch. We can also analyze the data under a range of values for
the branch lengths, t, as the analyses presented here assume that the estimates are accurate.
These refinements may then provide a clearer picture of the evolution of gene family size.
COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 11
Table 1. The number of gene families that showed an expansion, no change, or a contraction
along the 8 branches, according to the most likely assignments of the gene family sizes of the
ancestors. The first column contains the branch number, along with the length of the branch,
t, in millions of years. The last column shows the average gene family expansion among all
families along each branch, where a contraction is counted as a negative expansion.
Branch # Expansions No change Contractions Average expansion
1 (t = 32) 97 3181 239 -0.050
2 (t = 27) 383 3032 102 0.095
3 (t = 22) 509 2922 86 0.147
4 (t = 12) 96 3383 38 0.019
5 (t = 12) 44 3426 47 0.021
6 (t = 5) 3 3491 23 -0.005
7 (t = 10) 10 3313 194 -0.052
8 (t = 5) 2 3515 0 0.001
Table 2 shows the gene families identified as unlikely under the BD model. The first
column gives the gene family name; the second column describes the gene family size among
the five Saccharomyces species in Newick notation. The third column gives the branch that is
predicted to be responsible for the overall low p-value of the family; two numbers are provided,
the first one from the branch deletion method (method 1), the second one from the transition
probabilities along each branch (method 2). In most cases both methods give the same answer.
Newick numbers in bold indicate the branch identified by method 1. The fourth column gives
the resulting p-value after deleting the responsible branch as identified by method 1, and the
last column gives the p-value of the least likely branch transition as computed in method 2.
Note that for the first four gene families neither method was able to identify one single branch
that violates the BD model, and only method 2 was able to identify a branch for the fifth and
sixth families listed. The four gene families that were missed by the approximate sampling
method are marked with an asterix in the first column.
Table 2
Family name Family sizes Pred. Method Method
in Newick notation branch 1 2
Transposon (2 (8 (15 (34 83)))) ?/? <0.01
Unknown (7 (16 (7 (20 17)))) ?/? <0.01
Transposon (17 (14 (15 (1 5)))) ?/? <0.01
Unknown (5 (11 (14 (4 2)))) ?/? <0.01
Stress response (15 (33 (24 (30 31)))) ?/1 <0.01 0.000
Flocculation (10 (6 (8 (11 14)))) ?/2 <0.01 0.002
Amino acid biosynthesis (3 (8 (6 (6 5)))) 1/1 0.137 0.001
*PGM/PMM (1 (3 (3 (2 1)))) 1/3 0.045 0.007
*Ribosomal L1 (1 (4 (1 (1 1)))) 2/2 0.661 0.000
Elongation factor (1 (4 (2 (1 1)))) 2/2 0.197 0.003
Chaperone (1 (4 (2 (2 1)))) 2/2 0.112 0.003
Phosphatidylinositol 4-kinase (2 (9 (4 (2 2)))) 2/2 0.064 0.000
12 CHI NGUYEN, NELLO CRISTIANINI
Carbamoyl-phosphate synthase (2 (6 (5 (3 3)))) 2/1 0.048 0.003
Alpha/beta hydrolase (2 (2 (6 (2 2)))) 3/3 0.777 0.000
Dihydrouridine synthase (1 (1 (6 (1 1)))) 3/3 0.657 0.000
Type I phosphodiesterase (1 (1 (4 (1 1)))) 3/3 0.657 0.000
Guanine nucleotide exchange factor (2 (2 (5 (2 3)))) 3/3 0.243 0.006
DNA binding domain (2 (2 (5 (2 1)))) 3/3 0.199 0.000
Ankyrin repeat (1 (2 (7 (1 1)))) 3/3 0.195 0.000
-Unknown -Unknown (1 (2 (4 (1 1)))) 3/3 0.195 0.002
Acetate transporter (2 (4 (5 (2 2)))) 3/3 0.118 0.006
*TruD (1 (1 (3 (1 2)))) 3/3 0.115 0.000
*Unknown (1 (1 (3 (2 1)))) 3/3 0.115 0.000
Flavodoxin (2 (3 (5 (1 1)))) 3/7 0.110 0.000
Swi2/Snf2 ATPase (17 (20 (25 (18 15)))) 3/3 0.061 0.000
GTPase-activating protein (2 (4 (6 (3 2)))) 3/1 0.047 0.004
Maltose transport (4 (7 (8 (5 4)))) 3/1 0.043 0.010
Trichothecene pump (5 (5 (7 (10 6)))) 4/4 0.331 0.000
RNA polymerase Rpb1 (4 (3 (5 (7 4)))) 4/4 0.252 0.000
ATPase (1 (1 (2 (3 1)))) 4/4 0.122 0.000
MAL transcription factor (2 (5 (4 (7 4)))) 4/4 0.086 0.000
Hydroxymethylpyrimidine synthesis (3 (5 (2 (7 4)))) 4/4 0.015 0.000
Ribosomal protein (60S) (2 (1 (1 (1 3)))) 5/5 0.305 0.000
eIF4E-associated protein (1 (2 (1 (1 3)))) 5/5 0.228 0.000
Hydrolase (8 (11 (12 (11 7)))) 5/5 0.161 0.000
Metal-dependent phosphohydrolases (1 (1 (2 (1 5)))) 5/5 0.122 0.000
Sortilin (5 (4 (7 (4 8)))) 5/5 0.045 0.000
Helicase (1 (3 (3 (2 34)))) 5/5 0.038 0.000
NAD kinase (3 (1 (1 (2 4)))) 5/5 0.038 0.001
Hydroxyisocaproate dehydrogenases (3 (1 (2 (1 3)))) 5/5 0.038 0.002
ABC transporter (15 (18 (17 (12 8)))) 5/5 0.013 0.000
Thiol oxidase (1 (1 (4 (2 3)))) 6/3 0.212 0.002
Leucine rich repeat (4 (3 (1 (2 1)))) 6/1 0.076 0.027
HSP70 Chaperone (13 (17 (18 (12 13)))) 7/7 0.141 0.006
-Transcription factor -PolIII transcription
factor-Cytoplasmic protein that binds (1 (3 (3 (1 1)))) 7/3 0.124 0.007
Tor2p-Ribosomal SSU (40S) -Adenylate
cyclase activity, G-protein signaling -RRM1
Myosin (5 (9 (9 (5 5)))) 7/7 0.068 0.001
Cation transport enzymes (8 (10 (13 (6 5)))) 7/7 0.048 0.000
S-methyltransferase (2 (5 (5 (1 1)))) 7/7 0.037 0.000
-PDRE transcription factor-Component
of peripheral vacuolar membrane (1 (4 (4 (1 1)))) 7/3 0.024 0.002
protein complex
1,3-beta-D-glucan synthase (3 (8 (7 (3 3)))) 7/7 0.015 0.000
COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 13
10. CONCLUSION
This paper has attempted to provide the model needed to study gene family evolution
among multiple whole genomes. The methodology can be used for parameter estimation,
inferences on the direction and magnitude of evolutionary change, and hypothesis-testing. As
more genome sequences become available, we hope that this framework makes it possible to
identify the genetic changes that are responsible for the phenotypic diversity found in nature.
Correlated changes between families or with environmental conditions can then tell us about
the mechanisms and modes of natural selection.
REFERENCES
[1] N.T. J. Bailey , The elements of stochastic processes John Wiley & Sons, Inc., New York.
1964.
[2] R.R. Copley, L. Goodstadt, and C. Ponting, Eukaryotic domain evolution inferred from
genome comparisons, Current Opinion in Genetics & Development 13 (2003) 623—628.
[3] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston,
B. A. Cohen, M. Johnston, Finding functional features in Saccharomyces genomes by
phylogenetic footprinting, Science (301) (2003) 71—76.
[4] J. H. Darwin, The behaviour of an estimator for a simple birth and death process, Bio-
metrika 43 (1956) 23—31.
[5] N. R. Friedma, and A. L. Hughes, Gene duplication and the structure of eukaryotic
genomes, Genome Research 11 (2001) 373—381.
[6] M. A. Huynen, and E. Van Nimwegen, The frequency distribution of gene family sizes in
complete genomes, Molecular Biology and Evolution 15 (1998) 583—589.
[7] Y. L. Jin, and R. A. Speers, Flocculation of Saccharomyces cerevisiae, Food Res. Int.
31(1998) 421—440.
[8] I. M. Jordan, Graphical models (To appear: Statistical Science 2004 (Special issue on
Bayes Statistics).
[9] S. Karlin, and H. M. Taylor, A first course in stochastic processes, Academic Press, New
York. 1975.
[10] G. P. Karev, Y. I. Wolf, A. Y. Rzhetsky, F. S. Berezovskaya, and E. V. Koonin, Birth
and death of protein domains: A simple model of evolution explains power law behavior,
BMC Evolutionary Biology 2 (2) (2002).
[11] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. Lander, Sequencing and com-
parison of yeast species to identify genes and regulatory elements, Nature (423) (2003)
241—254.
[12] M. G. Kidwell, Transposable elements and the evolution of genome size in eukaryotes,
Genetica (115) (2002) 49—63.
[13] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody et al., Initial sequencing
and analysis of the human genome, Nature (409) (2001) 860—921.
14 CHI NGUYEN, NELLO CRISTIANINI
[14] O. Lespinet, Y. I. Wolf, E. V. Koonin, and L. Aravind, The role of lineage-specific gene
family expansion in the evolution of eukaryotes, Genome Research 12 (2002) 1048—1059.
[15] W. H. Li, Molecular evolution, Sinauer Associates, Sunderland, Mass. 1997.
[16] M. Lynch, and J. S. Conery, The evolutionary fate and consequences of duplicate genes,
Science (290) (2000) 1151—1155.
[17] M. Lynch, and J. S. Conery, The evolutionary demography of duplicate genes, Journal
of Structural and Functional Genomics 3 (2003) 35—44.
[18] M. Nei, X. Gu, and T. Sitnikova. Evolution by the birth-and-death process in multigene
families of the vertebrate immune system, PNAS 94 (1997) 7799—7806.
[19] J. G. Oakeshott, C. Claudianos, R. J. Russell, and G. C. Robin, Carboxyl/cholinesterases:
a case study of the evolution of a successful multigene family, BioEssays 21 (1999) 1031—
1042.
[20] M. D. Pagel, The maximum likelihood approach to reconstructing ancestral character
states of discrete characters on phylogenies, Syst. Biol. 48 (1999) 612—622.
[21] J. Qian, N. M. Luscombe, and M. Gerstein, Protein family and fold occurrence in genomes:
Power-law behaviour and evolutionary model, Journal of Molecular Biology (313) (2001)
673—681.
[22] W. J. Reed, and B. D. Hughes, A model explaining the size distribution of gene and
protein families, Mathematical Biosciences 189 (2004) 97—102.
[23] A. Rokas, B. L. Williams, N. King, and S. B. Carroll, Genome-scale approaches to re-
solving incongruence in molecular phylogenies, Nature (425) (2003) 798—804.
[24] H. J. Sims, and K. J. Mcconway, Nonstochastic variation of species-level diversification
rates within angiosperms, Evolution 57 (2003) 460—479.
[25] B. Snel, P. Bork, and M. A. Huynen, Genomes in flux: The evolution of archaeal and
proteobacterial gene content, Genome Research 12 (2002) 17—25.
[26] R. L. Tatusov, E. V. Koonin, and D. J. Lipman, A genomic perspective on protein families,
Science (278) (1997) 631—637.
[27] U. Theissen, M. Hoffmeister, M. Grieshaber, and W. Martin, Single eubacterial origin
of eukaryotic sulfide:quinone oxidoreductase, a mitochondrial enzyme conserved from the
early evolution of eukaryotes during anoxic and sulfidic times, Molecular Biology and
Evolution 20 (2003) 1564—1574.
[28] R. L. Wang, A. Stec, J. Hey, L. Lukens, and J. Doebley, The limits of selection during
maize domestication, Nature (398) (1999) 236—239.
[29] Z. H. Yang, and J. P. Bielawski, Statistical methods for detecting molecular evolution,
Trends in Ecology and Evolution 15 (2000) 496—503.
[30] Link 
Received on March 7 - 2005
Revised on October 15 - 2006

File đính kèm:

  • pdfcomputational_detection_of_natural_selection_in_gene_family.pdf