1 Frontiers in Microbiology 2013 Vol: 4():. DOI: 10.3389/fmicb.2013.00269

Conservation vs. variation of dinucleotide frequencies across bacterial and archaeal genomes: evolutionary implications

During the long history of biological evolution, genome structures have undergone enormous changes. Nevertheless, some traits or vestiges of the primordial genome (defined as the most primitive nucleic acid genome for life on earth in this paper) may remain in modern genetic systems. It is of great importance to find these traits or vestiges for the study of the origin and evolution of genomes. As the shorter is a sequence, the less probable it would be modified during genome evolution. And if mutated, it would be easier to reappear at the same site or another site. Consequently, the genomic frequencies of very short nucleotide sequences, such as dinucleotides, would have considerable chances to be conserved during billions of years of evolution. Prokaryotic genomes are very diverse and with a wide range of GC content. Therefore, in order to find traits or vestiges of the primordial genome remained in modern genetic systems, we have studied the characteristics of dinucleotide frequencies across bacterial and archaeal genomes. We analyzed the dinucleotide frequency patterns of the whole-genome sequences from more than 1300 prokaryotic species (bacterial and archaeal genomes available as of December 2012). The results show that the frequencies of the dinucleotides AC, AG, CA, CT, GA, GT, TC, and TG are well-conserved across various genomes, while the frequencies of other dinucleotides vary considerably among species. The dinucleotide frequency conservation/variation pattern seems to correlate with the distributions of dinucleotides throughout a genome and across genomes. Further analysis indicates that the phenomenon would be determined by strand symmetry of genomic sequences (the second parity rule) and GC content variations among genomes. We discussed some possible origins of strand symmetry. And we propose that the phenomenon of frequency conservation of some dinucleotides may provide insights into the genomic composition of the primordial genetic system.

Mentions
Figures
Figure 1: Distribution of 1442 archaeal and bacterial genomes in terms of GC content. Figure 2: Dinucleotide frequency distribution patterns of 133 archaeal genomes and 1309 bacterial genomes. Each genome is represented by a dash (black dash, archaeal genome; red dash, bacterial genome).
Altmetric
References
  1. G. Albrecht-Buehler Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions Proc. Natl. Acad. Sci. U.S.A 103, 17828-17833 (2006) .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
    • . . . The mechanisms for maintaining strand symmetry (see Nussinov, 1982; Lobry and Lobry, 1999; Sanchez and Jose, 2002; Albrecht-Buehler, 2006) would also help maintain the frequency conservation pattern . . .
  2. P. F. Baisnée; S. Hampson; P. Baldi Why are complementary DNA strands symmetric? Bioinformatics 18, 1021-1033 (2002) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
    • . . . Even for the counts of a dinucleotide and those of its reverse complement across genomes, they are significantly different in the χ2 test (see also Baisnée et al., 2002). . . .
  3. D. Birnbaum; F. Coulier; M. J. Pebusque; P. Pontarotti “Paleogenomics”: looking in the past to the future J. Exp. Zool 288, 21-22 (2000) .
    • . . . Indeed, the only way to reconstruct ancient genetic systems in the absence of fossil DNA may be the deduction from the comparative analysis of the structures of present-day genomes (see also Birnbaum et al., 2000). . . .
  4. C. Burge; A. M. Campbell; S. Karlin Over- and under-representation of short oligonucleotides in DNA sequences Proc. Natl. Acad. Sci. U.S.A 89, 1358-1362 (1992) .
    • . . . However, as assumed above, what we need for the purpose of our study is the occurrence frequencies, which are generally not congruent with the relative abundances (Burge et al., 1992) . . .
  5. J. W. Fickett; D. C. Torney; D. R. Wolf Base compositional structure of genomes Genomics 13, 1056-1064 (1992) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
  6. D. R. Forsdyke A stem-loop “kissing” model for the initiation of recombination and the origin of introns Mol. Biol. Evol 12, 949-958 (1995a) .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
  7. D. R. Forsdyke Relative roles of primary sequence and (G + C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species J. Mol. Evol 41, 573-581 (1995b) .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
  8. D. R. Forsdyke; J. R. Mortimer Chargaff's legacy Gene 261, 127-137 (2000) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
  9. D. Häring; J. Kypr Variations of the mononucleotide and short oligonucleotide distributions in the genomes of various organisms J. Theor. Biol 201, 141-156 (1999) .
    • . . . Also, the distributions of oligonucleotides containing similar and especially the same numbers of the strong and weak nucleotides, but no CG or TA dinucleotide, are the most uniform in six representative genomes (yet the authors considered their distributions not informative; Häring and Kypr, 1999) . . .
  10. S. Karlin; C. Burge Dinucleotide relative abundance extremes: a genomic signature Trends Genet 11, 283-290 (1995) .
    • . . . With more sequences available, one of the most studied aspects in this field is the characteristics of dinucleotide relative abundances, which access contrasts between the observed dinucleotide frequencies and those expected from the component nucleotide frequencies (Karlin and Burge, 1995) . . .
    • . . . They are not the same as the expected frequencies for the calculation of relative abundances (Karlin and Burge, 1995) that are species-specific or taxon-specific (see also Introduction section). . . .
  11. S. Karlin; I. Ladunga; B. E. Blaisdell Heterogeneity of genomes: measures and values Proc. Natl. Acad. Sci. U.S.A 91, 12837-12841 (1994) .
    • . . . The profiles of relative abundances of dinucleotides in genomic sequences are rather species-specific or taxon-specific (Karlin et al., 1994, 1997) . . .
  12. S. Karlin; J. Mrazek; A. M. Campbell Compositional biases of bacterial genomes and evolutionary implications J. Bacteriol 179, 3899-3913 (1997) .
    • . . . The profiles of relative abundances of dinucleotides in genomic sequences are rather species-specific or taxon-specific (Karlin et al., 1994, 1997) . . .
  13. C. G. Kozhukhin; P. A. Pevzner Genome inhomogeneity is determined mainly by WW and SS dinucleotides Comput. Appl. Biosci 7, 39-49 (1991) .
    • . . . It has been shown that genome inhomogeneity is determined mainly by AA, TT, GG, CC, AT, TA, GC, and CG dinucleotides (consisting of two strong nucleotides or two weak nucleotides), which are closely associated with polyW and polyS tracts (W and S stand for weak nucleotides and strong nucleotides, respectively; Kozhukhin and Pevzner, 1991) . . .
  14. J. R. Lobry; C. Lobry Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant Mol. Biol. Evol 16, 719-723 (1999) .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
    • . . . The mechanisms for maintaining strand symmetry (see Nussinov, 1982; Lobry and Lobry, 1999; Sanchez and Jose, 2002; Albrecht-Buehler, 2006) would also help maintain the frequency conservation pattern . . .
  15. H. Nishida Genome DNA sequence variation, evolution, and function in bacteria and archaea Curr. Issues Mol. Biol 15, 19-24 (2013) .
    • . . . Also, it is comparable to the GC content variation pattern of another more recent study (see Nishida, 2013; the difference, especially for the distribution of genomes with GC content of 50%, may be due to different sampling strategies) . . .
  16. R. Nussinov Some rules in the ordering of nucleotides in the DNA Nucleic Acids Res 8, 4545-4562 (1980) .
    • . . . Many researches have been done in the field of dinucleotide frequencies even when sequence data were limited (e.g., Nussinov, 1980, 1981, 1984), revealing hierarchies in the frequencies (preferences) of different dinucleotides in natural nucleic acid sequences . . .
  17. R. Nussinov Nearest neighbor nucleotide patterns: structural and biological implications J. Biol. Chem 256, 8458-8462 (1981) .
    • . . . Many researches have been done in the field of dinucleotide frequencies even when sequence data were limited (e.g., Nussinov, 1980, 1981, 1984), revealing hierarchies in the frequencies (preferences) of different dinucleotides in natural nucleic acid sequences . . .
  18. R. Nussinov Some indications for inverse DNA duplication J. Theor. Biol 95, 783-791 (1982) .
    • . . . The mechanisms for maintaining strand symmetry (see Nussinov, 1982; Lobry and Lobry, 1999; Sanchez and Jose, 2002; Albrecht-Buehler, 2006) would also help maintain the frequency conservation pattern . . .
  19. R. Nussinov Doublet frequencies in evolutionary distinct groups Nucleic Acids Res 12, 1749-1763 (1984) .
    • . . . Many researches have been done in the field of dinucleotide frequencies even when sequence data were limited (e.g., Nussinov, 1980, 1981, 1984), revealing hierarchies in the frequencies (preferences) of different dinucleotides in natural nucleic acid sequences . . .
  20. V. V. Prabhu Symmetry observations in long nucleotide sequences Nucleic Acids Res 21, 2797-2800 (1993) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
    • . . . The correlation/regression analysis has been employed to measure the similarity/difference between the frequencies (counts) of an oligonucleotide and its reverse complement (see for example, Prabhu, 1993; Qi and Cuticchia, 2001), and between those of different oligonucleotides (Zhang and Huang, 2010) . . .
  21. D. Qi; A. J. Cuticchia Compositional symmetries in complete genomes Bioinformatics 17, 557-559 (2001) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
    • . . . The correlation/regression analysis has been employed to measure the similarity/difference between the frequencies (counts) of an oligonucleotide and its reverse complement (see for example, Prabhu, 1993; Qi and Cuticchia, 2001), and between those of different oligonucleotides (Zhang and Huang, 2010) . . .
  22. A. C. Rogerson There appear to be conserved constraints on the distribution of nucleotide sequences in cellular genomes J. Mol. Evol 32, 24-30 (1991) .
    • . . . An early study indicates that there are significant correlations between genomic libraries in terms of tetranucleotide frequency distribution, suggesting an overall correlation of frequency profiles of short nucleotides among genomes (Rogerson, 1991) . . .
  23. J. Sanchez; M. V. Jose Analysis of bilateral inverse symmetry in whole bacterial chromosomes Biochem. Biophys. Res. Commun 299, 126-134 (2002) .
    • . . . The mechanisms for maintaining strand symmetry (see Nussinov, 1982; Lobry and Lobry, 1999; Sanchez and Jose, 2002; Albrecht-Buehler, 2006) would also help maintain the frequency conservation pattern . . .
  24. N. Sueoka On the genetic basis of variation and heterogeneity of DNA base composition Proc. Natl. Acad. Sci. U.S.A 48, 582-592 (1962) .
    • . . . For mononucleotides, it has been known that their frequencies vary greatly among species, especially in prokaryotes (Sueoka, 1962) . . .
  25. S.-H. Zhang On the origin and evolution of organic genomes Zhongshan Da Xue Xue Bao Zi Ran Ke Xue Ban 35, 96-101 (1996) .
    • . . . Finding these traits or vestiges is very important for the study of the origin and evolution of genomes (Zhang, 1996) . . .
  26. S.-H. Zhang The origin and evolution of repeated sequences and introns Speculations Sci. Technol 21, 7-16 (1998) .
    • . . . In fact, it has been proposed that the biological diversity in the primordial biosphere (the number of species of DNA or RNA macromolecules capable of self-replicating on a large scale) would be very low due to competitive exclusion, and that repeats of a certain species of self-replicating macromolecules made up the most primitive genomes (Zhang, 1998) . . .
  27. S.-H. Zhang; Y.-Z. Huang Characteristics of oligonucleotide frequencies across genomes: conservation versus variation, strand symmetry, and evolutionary implications Link Nature Precedings , (2008) .
    • . . . Though our results concern only prokaryotic genomes, actually they apply also to eukaryotic genomes (for a preliminary analysis, see Zhang and Huang, 2008) . . .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
    • . . . Alternatively, we have suggested that strand symmetry would probably exist from the very beginning of genome evolution (Zhang and Huang, 2008, 2010) . . .
  28. S.-H. Zhang; Y.-Z. Huang Limited contribution of stem-loop potential to symmetry of single-stranded genomic DNA Bioinformatics 26, 478-485 (2010) .
    • . . . We propose that the frequency conservation patterns would be vestiges of the primordial genome, considering that the phenomenon would depend on strand symmetry of genomic sequences (also called the second parity rule, which is the marked similarity of the frequencies of nucleotides and oligonucleotides to those of their respective reverse complements within single strands of sufficiently long genomic sequences; see Fickett et al., 1992; Prabhu, 1993; Forsdyke and Mortimer, 2000; Qi and Cuticchia, 2001; Baisnée et al., 2002; Zhang and Huang, 2010) and on GC content variations among genomes. . . .
    • . . . The correlation/regression analysis has been employed to measure the similarity/difference between the frequencies (counts) of an oligonucleotide and its reverse complement (see for example, Prabhu, 1993; Qi and Cuticchia, 2001), and between those of different oligonucleotides (Zhang and Huang, 2010) . . .
    • . . . We converted the correlation coefficients, the slopes and the intercepts to absolute differences from 1, 1 and 0, respectively (the transformed correlation coefficients, the transformed slopes and the transformed intercepts, respectively; see also Zhang and Huang, 2010), so as to measure the levels of frequency conservation/variation for the dinucleotides . . .
    • . . . The current hypotheses about the origins of strand symmetry can actually be classified into four categories: (1) selection of stem-loop structures (Forsdyke, 1995a,b); (2) no strand biases for mutation and selection (Lobry and Lobry, 1999); (3) strand inversion/inverted transposition (Fickett et al., 1992; Albrecht-Buehler, 2006); and (4) original trait of the primordial genome (Zhang and Huang, 2008, 2010) . . .
    • . . . However, the contribution of stem-loop potential of single-stranded DNA to strand symmetry would be very limited (Zhang and Huang, 2010) . . .
    • . . . Alternatively, we have suggested that strand symmetry would probably exist from the very beginning of genome evolution (Zhang and Huang, 2008, 2010) . . .
  29. S.-H. Zhang; L. Wang A novel common triplet profile for GC-rich prokaryotic genomes Genomics 97, 330-331 (2011) .
    • . . . This variation of GC content among genomes is on the whole similar to the distribution of genomic GC content for prokaryotic species with whole-genome sequences available as of December 2008 (Zhang and Wang, 2011) . . .
  30. S.-H. Zhang; L. Wang Two common profiles exist for genomic oligonucleotide frequencies BMC Res. Notes 5, 639 (2012) .
    • . . . If there is a difference of more than 1% of genomic GC content between individual strains or subspecies of a species, we selected also the strains or subspecies whose GC content is different from that of others by at least 1% (see also Zhang and Wang, 2012) . . .
  31. S.-H. Zhang; J.-H. Yang Conservation versus variation of dinucleotide frequencies across genomes: evolutionary implications Genome Biol 6, P12 (2005) .
    • . . . In this paper we analyzed, following our previous preliminary work (Zhang and Yang, 2005), the dinucleotide frequency patterns of the whole-genome sequences from over 1300 prokaryotic species . . .
Expand