Computational detection of natural selection in gene family expansion and contraction
Tóm tắt Computational detection of natural selection in gene family expansion and contraction: ...actical implementation of the algorithm, we need to make the assumption that the maximal gene family size is limited. However, since the conditional probability distribu- tion associated with the BD model drops of quickly for large values, this assumption is very reasonable for a large enough upp...are indicated in Figure 1 as time, t, in million years. We estimated the evolutionary rate parameter λ as 0:002 per million years . In the 32 million years since the most recent common ancestor of the five species, 1254 of the 3517 gene families shared among them has changed in size; the remainin...ls that include heterogeneous gain and loss rates across gene families. Although large families are expected to show greater change in number between species simply because there are more chances for gain and loss–and the opposite is true for small families–we will in the future be able to test w...
orrelated with the number of expansions and contractions. Identification of unusually evolving gene families in Saccharomyces As explained above, the PGM also allows us to compute p-values to identify gene families that are highly unlikely under the random BD process. Of the 1254 gene families that differed in number between genomes, 58 had p-values less than 0.01 (35 are expected). The unlikely families are summarized in Table 2, along with the specific branch that is responsible for the violation (when such a branch could be identified). The two methods that we used to identify the offending branch agreed in most cases (see Table 2). For the first four families identified in Table 2 the observed gene family sizes are so unlikely that it is hard to determine where any one unlikely event occurred. Two of these gene families are of unknown function, and the other two are transposable elements (TEs). While it is interesting to see these large changes, transposable elements violate the assumptions of the BD model in a number of ways and it can therefore be seen as a validation of our approach that they are identified as unlikely (see Discussion). 9. DISCUSSION In this paper we have presented and evaluated a method for studying the evolution of gene families over a phylogeny. Based on data from multiple whole genomes, the method can be used to examine the rates and direction of change in gene family size among taxa. Our method also allows for hypothesis testing: we have shown how we can identify gene families that have had unlikely histories given a model of random gene birth and death. Importantly, the PGM methodology used here scales linearly with the number of new genomes added;the most challenging aspect of future analyses may simply be getting reliable phylogenetic trees for COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 9 the species considered. This PGM approach is conceptually similar to the maximum-likelihood approach taken by others to study the evolution of phenotypic quantitative characters (e.g. Pagel 1999). Our analyses have revealed a large number of changes in gene family size across the Sac- charomyces tree: 1254 of 3517 families changed in size. Every branch of the phylogeny was inferred to have changes along it, with longer branches having commensurately more changes (Table 1). One concern we had prior to our analysis was that the uneven sequence coverage of these five genomes would affect our results; this did not appear to be the case. S. cerevisiae is in fact the only eukaryotic with a fully sequenced genome; all of the other yeast genomes are covered to differing extents. S. paradoxus was sequenced to 7X coverage (i.e shotgun sequencing was done equivalent to seven times the length of the genome), while S. bayanus, S. kudriavzevii, and S. mikatae were sequenced to 2-3X (Cliften et al. 2003). Despite this unevenness among taxa, our results do not seem to have been affected: S. kudriavzevii and S. mikatae were predicted to have both the largest number of genes and the largest number of gene family expansions. If the lack of sequence coverage had been a problem we would have expected these genomes to show fewer genes and smaller gene family sizes on average. As described above, the null BD model can be used to test whether gene families are on average diffusing evenly along the tree. This model can be violated when processes such as natural selection give a direction to the expected random walk, causing extreme expansions or contractions to gene family size. We were able to detect such changes on almost every branch of the tree, and on every external branch leading to an extant species. In cases where we we did not reject the null hypothesis it does not mean that natural selection is not acting on members of a gene family, only that we cannot detect its role in affecting the differences in size of the family. Natural selection may have played a role in the taxation of a small number of duplicates within a family, but, much like other statistical tests in molecular evolution, we only have the power to detect the repeated occurrence of events. One of the most extreme examples that we found was in the helicase family, where S. cerevisiae has 34 members of this family while none of the other species have more than 3. We were also able to identify a significant expansion of the flocculin gene family in S. cerevisiae, a change that is unsurprising considering the fact that flocculation has been selected for in the domestication of this brewer’s yeast (Jin and Speers 1998). Like other genes that have undergone artificial selection during domestication (e.g. Wang et al. 1999), we detected the signature of adaptive natural selection on the flocculins. This is the first example to our knowledge, however, of selection on gene family size being implicated in domestication. Any inference of natural selection with our method comes with a number of caveats that must be mentioned. One caveat is that we have implicitly assumed that there is no relationship between family size and duplication and deletion rates. It may be, for instance, that large gene families are more likely to undergo non-homologous pairing, unequal crossing over, and therefore more duplication and eventual taxation due to drift (Li 1997). A homogeneous birth and death model may also not be absolutely correct for small gene families, as under the BD model families will always eventually reach the absorbing state of zero genes. Because many genes appear to be conserved over very long periods of time (e.g. Theissen et al. 2003), there may be a decreased loss rate in small families in order to prevent extinction of required 10 CHI NGUYEN, NELLO CRISTIANINI gene functions. The possibility of non-homogeneities in very large or very small gene families suggests that models incorporating these processes be studied. Karev et al. (2002) found that a random BD model with added parameters for birth and death rates for the largest and smallest families fit the distribution of gene families in a single genome slightly better than a completely homogeneous model. The improved fit to the data, however, was not shown to be significantly better than models without the two extra parameters. The framework we have provided here should allow for the testing of models that include heterogeneous gain and loss rates across gene families. Although large families are expected to show greater change in number between species simply because there are more chances for gain and loss–and the opposite is true for small families–we will in the future be able to test whether the observed changes are more or less than are expected. The issue of gene families having intrinsically different birth and death rates extends beyond the consideration of family size. For example, one family of genes that does not follow this assumption is transposable elements (TEs): they can multiply in number in a non- mendelian manner, and are often selected against by the organisms they inhabit. Because the parameters for gain and loss of TEs can be quite difierent than those for other gene families (see, e.g.Kidwell 2002; Li 1997), the disparity in TE number between genomes can be due to processes unique to this family. So our finding that TEs are at the top of our list of unusual gene families is not surprising. Results for transposable element families or other genomic parasites using the BD model, therefore, should not be parameterized with gain and loss rates inferred from the majority of protein coding genes. In addition to the assumptions of equivalent birth and death mechanisms among families, one other very important aspect of any random point process is the assumption of indepen- dence among individual genes. The BD model assumes that each gene in a family has an independent probability of being duplicated or deleted: any large-scale chromosomal duplica- tion, deletion, or polyploidization may act on multiple members of a family at once. This is potentially a common violation of the model in light of the frequency of larger scale duplica- tions and deletions that include gene duplicates (Friedman and Hughes 2001). As a result, we cannot compare taxa that are separated by a whole genome duplication in the same manner as has been presented here. This also means that any unusual gene family should be examined in more detail to determine the nature of the changes in gene family size; obvious duplications of large regions containing multiple members of a family, for example, may moderate conclusions about natural selection. Our hypothesis-testing framework requires an estimate of λ, the birth and death parameter determining the rate of evolution. In the above sections we show how we can estimate the value of that makes the entire dataset maximally likely (using Expectation Maximization); reassuringly, the resulting value we obtained (0.002 per million years) is very close to the previous estimate of λ found using data from only S. cerevisiae (0.004 per million years; Lynch and Conery 2003). In the future we hope to extend the model by making it possible to allow to vary along branches of a phylogenetic tree or by allowing the birth and death rates to be unequal on any branch. We can also analyze the data under a range of values for the branch lengths, t, as the analyses presented here assume that the estimates are accurate. These refinements may then provide a clearer picture of the evolution of gene family size. COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 11 Table 1. The number of gene families that showed an expansion, no change, or a contraction along the 8 branches, according to the most likely assignments of the gene family sizes of the ancestors. The first column contains the branch number, along with the length of the branch, t, in millions of years. The last column shows the average gene family expansion among all families along each branch, where a contraction is counted as a negative expansion. Branch # Expansions No change Contractions Average expansion 1 (t = 32) 97 3181 239 -0.050 2 (t = 27) 383 3032 102 0.095 3 (t = 22) 509 2922 86 0.147 4 (t = 12) 96 3383 38 0.019 5 (t = 12) 44 3426 47 0.021 6 (t = 5) 3 3491 23 -0.005 7 (t = 10) 10 3313 194 -0.052 8 (t = 5) 2 3515 0 0.001 Table 2 shows the gene families identified as unlikely under the BD model. The first column gives the gene family name; the second column describes the gene family size among the five Saccharomyces species in Newick notation. The third column gives the branch that is predicted to be responsible for the overall low p-value of the family; two numbers are provided, the first one from the branch deletion method (method 1), the second one from the transition probabilities along each branch (method 2). In most cases both methods give the same answer. Newick numbers in bold indicate the branch identified by method 1. The fourth column gives the resulting p-value after deleting the responsible branch as identified by method 1, and the last column gives the p-value of the least likely branch transition as computed in method 2. Note that for the first four gene families neither method was able to identify one single branch that violates the BD model, and only method 2 was able to identify a branch for the fifth and sixth families listed. The four gene families that were missed by the approximate sampling method are marked with an asterix in the first column. Table 2 Family name Family sizes Pred. Method Method in Newick notation branch 1 2 Transposon (2 (8 (15 (34 83)))) ?/? <0.01 Unknown (7 (16 (7 (20 17)))) ?/? <0.01 Transposon (17 (14 (15 (1 5)))) ?/? <0.01 Unknown (5 (11 (14 (4 2)))) ?/? <0.01 Stress response (15 (33 (24 (30 31)))) ?/1 <0.01 0.000 Flocculation (10 (6 (8 (11 14)))) ?/2 <0.01 0.002 Amino acid biosynthesis (3 (8 (6 (6 5)))) 1/1 0.137 0.001 *PGM/PMM (1 (3 (3 (2 1)))) 1/3 0.045 0.007 *Ribosomal L1 (1 (4 (1 (1 1)))) 2/2 0.661 0.000 Elongation factor (1 (4 (2 (1 1)))) 2/2 0.197 0.003 Chaperone (1 (4 (2 (2 1)))) 2/2 0.112 0.003 Phosphatidylinositol 4-kinase (2 (9 (4 (2 2)))) 2/2 0.064 0.000 12 CHI NGUYEN, NELLO CRISTIANINI Carbamoyl-phosphate synthase (2 (6 (5 (3 3)))) 2/1 0.048 0.003 Alpha/beta hydrolase (2 (2 (6 (2 2)))) 3/3 0.777 0.000 Dihydrouridine synthase (1 (1 (6 (1 1)))) 3/3 0.657 0.000 Type I phosphodiesterase (1 (1 (4 (1 1)))) 3/3 0.657 0.000 Guanine nucleotide exchange factor (2 (2 (5 (2 3)))) 3/3 0.243 0.006 DNA binding domain (2 (2 (5 (2 1)))) 3/3 0.199 0.000 Ankyrin repeat (1 (2 (7 (1 1)))) 3/3 0.195 0.000 -Unknown -Unknown (1 (2 (4 (1 1)))) 3/3 0.195 0.002 Acetate transporter (2 (4 (5 (2 2)))) 3/3 0.118 0.006 *TruD (1 (1 (3 (1 2)))) 3/3 0.115 0.000 *Unknown (1 (1 (3 (2 1)))) 3/3 0.115 0.000 Flavodoxin (2 (3 (5 (1 1)))) 3/7 0.110 0.000 Swi2/Snf2 ATPase (17 (20 (25 (18 15)))) 3/3 0.061 0.000 GTPase-activating protein (2 (4 (6 (3 2)))) 3/1 0.047 0.004 Maltose transport (4 (7 (8 (5 4)))) 3/1 0.043 0.010 Trichothecene pump (5 (5 (7 (10 6)))) 4/4 0.331 0.000 RNA polymerase Rpb1 (4 (3 (5 (7 4)))) 4/4 0.252 0.000 ATPase (1 (1 (2 (3 1)))) 4/4 0.122 0.000 MAL transcription factor (2 (5 (4 (7 4)))) 4/4 0.086 0.000 Hydroxymethylpyrimidine synthesis (3 (5 (2 (7 4)))) 4/4 0.015 0.000 Ribosomal protein (60S) (2 (1 (1 (1 3)))) 5/5 0.305 0.000 eIF4E-associated protein (1 (2 (1 (1 3)))) 5/5 0.228 0.000 Hydrolase (8 (11 (12 (11 7)))) 5/5 0.161 0.000 Metal-dependent phosphohydrolases (1 (1 (2 (1 5)))) 5/5 0.122 0.000 Sortilin (5 (4 (7 (4 8)))) 5/5 0.045 0.000 Helicase (1 (3 (3 (2 34)))) 5/5 0.038 0.000 NAD kinase (3 (1 (1 (2 4)))) 5/5 0.038 0.001 Hydroxyisocaproate dehydrogenases (3 (1 (2 (1 3)))) 5/5 0.038 0.002 ABC transporter (15 (18 (17 (12 8)))) 5/5 0.013 0.000 Thiol oxidase (1 (1 (4 (2 3)))) 6/3 0.212 0.002 Leucine rich repeat (4 (3 (1 (2 1)))) 6/1 0.076 0.027 HSP70 Chaperone (13 (17 (18 (12 13)))) 7/7 0.141 0.006 -Transcription factor -PolIII transcription factor-Cytoplasmic protein that binds (1 (3 (3 (1 1)))) 7/3 0.124 0.007 Tor2p-Ribosomal SSU (40S) -Adenylate cyclase activity, G-protein signaling -RRM1 Myosin (5 (9 (9 (5 5)))) 7/7 0.068 0.001 Cation transport enzymes (8 (10 (13 (6 5)))) 7/7 0.048 0.000 S-methyltransferase (2 (5 (5 (1 1)))) 7/7 0.037 0.000 -PDRE transcription factor-Component of peripheral vacuolar membrane (1 (4 (4 (1 1)))) 7/3 0.024 0.002 protein complex 1,3-beta-D-glucan synthase (3 (8 (7 (3 3)))) 7/7 0.015 0.000 COMPUTATIONAL DETECTION OF NATURAL SELECTION IN GENE FAMILY EXPANSION 13 10. CONCLUSION This paper has attempted to provide the model needed to study gene family evolution among multiple whole genomes. The methodology can be used for parameter estimation, inferences on the direction and magnitude of evolutionary change, and hypothesis-testing. As more genome sequences become available, we hope that this framework makes it possible to identify the genetic changes that are responsible for the phenotypic diversity found in nature. Correlated changes between families or with environmental conditions can then tell us about the mechanisms and modes of natural selection. REFERENCES [1] N.T. J. Bailey , The elements of stochastic processes John Wiley & Sons, Inc., New York. 1964. [2] R.R. Copley, L. Goodstadt, and C. Ponting, Eukaryotic domain evolution inferred from genome comparisons, Current Opinion in Genetics & Development 13 (2003) 623—628. [3] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B. A. Cohen, M. Johnston, Finding functional features in Saccharomyces genomes by phylogenetic footprinting, Science (301) (2003) 71—76. [4] J. H. Darwin, The behaviour of an estimator for a simple birth and death process, Bio- metrika 43 (1956) 23—31. [5] N. R. Friedma, and A. L. Hughes, Gene duplication and the structure of eukaryotic genomes, Genome Research 11 (2001) 373—381. [6] M. A. Huynen, and E. Van Nimwegen, The frequency distribution of gene family sizes in complete genomes, Molecular Biology and Evolution 15 (1998) 583—589. [7] Y. L. Jin, and R. A. Speers, Flocculation of Saccharomyces cerevisiae, Food Res. Int. 31(1998) 421—440. [8] I. M. Jordan, Graphical models (To appear: Statistical Science 2004 (Special issue on Bayes Statistics). [9] S. Karlin, and H. M. Taylor, A first course in stochastic processes, Academic Press, New York. 1975. [10] G. P. Karev, Y. I. Wolf, A. Y. Rzhetsky, F. S. Berezovskaya, and E. V. Koonin, Birth and death of protein domains: A simple model of evolution explains power law behavior, BMC Evolutionary Biology 2 (2) (2002). [11] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. Lander, Sequencing and com- parison of yeast species to identify genes and regulatory elements, Nature (423) (2003) 241—254. [12] M. G. Kidwell, Transposable elements and the evolution of genome size in eukaryotes, Genetica (115) (2002) 49—63. [13] E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody et al., Initial sequencing and analysis of the human genome, Nature (409) (2001) 860—921. 14 CHI NGUYEN, NELLO CRISTIANINI [14] O. Lespinet, Y. I. Wolf, E. V. Koonin, and L. Aravind, The role of lineage-specific gene family expansion in the evolution of eukaryotes, Genome Research 12 (2002) 1048—1059. [15] W. H. Li, Molecular evolution, Sinauer Associates, Sunderland, Mass. 1997. [16] M. Lynch, and J. S. Conery, The evolutionary fate and consequences of duplicate genes, Science (290) (2000) 1151—1155. [17] M. Lynch, and J. S. Conery, The evolutionary demography of duplicate genes, Journal of Structural and Functional Genomics 3 (2003) 35—44. [18] M. Nei, X. Gu, and T. Sitnikova. Evolution by the birth-and-death process in multigene families of the vertebrate immune system, PNAS 94 (1997) 7799—7806. [19] J. G. Oakeshott, C. Claudianos, R. J. Russell, and G. C. Robin, Carboxyl/cholinesterases: a case study of the evolution of a successful multigene family, BioEssays 21 (1999) 1031— 1042. [20] M. D. Pagel, The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies, Syst. Biol. 48 (1999) 612—622. [21] J. Qian, N. M. Luscombe, and M. Gerstein, Protein family and fold occurrence in genomes: Power-law behaviour and evolutionary model, Journal of Molecular Biology (313) (2001) 673—681. [22] W. J. Reed, and B. D. Hughes, A model explaining the size distribution of gene and protein families, Mathematical Biosciences 189 (2004) 97—102. [23] A. Rokas, B. L. Williams, N. King, and S. B. Carroll, Genome-scale approaches to re- solving incongruence in molecular phylogenies, Nature (425) (2003) 798—804. [24] H. J. Sims, and K. J. Mcconway, Nonstochastic variation of species-level diversification rates within angiosperms, Evolution 57 (2003) 460—479. [25] B. Snel, P. Bork, and M. A. Huynen, Genomes in flux: The evolution of archaeal and proteobacterial gene content, Genome Research 12 (2002) 17—25. [26] R. L. Tatusov, E. V. Koonin, and D. J. Lipman, A genomic perspective on protein families, Science (278) (1997) 631—637. [27] U. Theissen, M. Hoffmeister, M. Grieshaber, and W. Martin, Single eubacterial origin of eukaryotic sulfide:quinone oxidoreductase, a mitochondrial enzyme conserved from the early evolution of eukaryotes during anoxic and sulfidic times, Molecular Biology and Evolution 20 (2003) 1564—1574. [28] R. L. Wang, A. Stec, J. Hey, L. Lukens, and J. Doebley, The limits of selection during maize domestication, Nature (398) (1999) 236—239. [29] Z. H. Yang, and J. P. Bielawski, Statistical methods for detecting molecular evolution, Trends in Ecology and Evolution 15 (2000) 496—503. [30] Link Received on March 7 - 2005 Revised on October 15 - 2006
File đính kèm:
- computational_detection_of_natural_selection_in_gene_family.pdf