1 Nature 2011 Vol: 477(7365):419-423. DOI: 10.1038/nature10414

Multiple reference genomes and transcriptomes for Arabidopsis thaliana

Genetic differences between Arabidopsis thaliana accessions underlie the plant’s extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.

Mentions
Figures
Figure 1: Assembly and variation of 18 genomes of A. thaliana. a, Classification of sequence, SNPs and indels based on the Col-0 genome. b, Assembly accuracy (y axis; base substitution errors per 10 kb) measured relative to four validation data sets at each of eight stages in the IMR/DENOM assembly pipeline (x axis). Bur-0 survey (blue line): 1,442 survey sequences (about 417 bp each) in predominantly genic regions19; Bur-0 divergent (red line): 188 sequences (each about 254 bp) highly divergent from Col-0 (ref. 3); Ler-0 nonrepetitive (orange line): a predominantly single-copy 175-kb Ler-0 sequence on chromosome 5; Ler-0 repetitive (purple line): a highly repetitive 339-kb Ler-0 locus on chromosome 3 (ref. 18; Supplementary Information section 4). Iter, iteration. c, Genome-wide distribution of the minimum clade size for all pairs of accessions (excluding Po-0). Each pair is represented by a grey line, the mean over all pairs by the black line and the random distribution by the green line. d, Decay in linkage disequilibrium with distance (Po-0 excluded). The black line shows r2 between SNPs; the red line shows phylogenetic r2 (Supplementary Information section 6). Figure 2: Transcript and protein variation. a, Example of a splice site change between two haplotypes for the gene AT1G64970. Haplotype I (Col-0) is spliced with an intron 6 bp (two amino acids) shorter than haplotype II (Ler-0); Po-0 (heterozygous) shows allele-specific expression of both. b, Re-annotation of the FRIGIDA locus showing annotations for accessions Sf-2 (functional), and Col-0 (truncated by a premature stop) and Ler-0 (non-functional) (Supplementary Figs 18 and 42). Right: the 19 accessions are shown clustered on the basis of the AA distance between their FRIGIDA amino-acid sequences. Common isoform clusters (at distance 2% or less; red line) are shown, leading to three clusters with three, seven and nine accessions. c, Proteome diversity for coding genes, pseudogenes and A. lyrata genes (top) and for genes with disruptions (bottom). Reported is the fraction of genes with relative AA distance to other accessions (average over pairs) in the given colour-coded interval (Supplementary Information section 10.7). d, Frequency of isoforms of coding genes and pseudogenes (top), and those associated with different disruptions (bottom). Figure 3: Quantitative variation of coding gene expression. a, The overlap between heritable (more than 30%) and differentially expressed (FDR 5%) genes, and genes with a cis-eQTL (FDR 5%). b, Differentially expressed genes and genes with cis-eQTLs (FDR 5%) categorized by fold change. Nucleotide variants (orange bars; 647 cis-eQTLs) are SNPs and single-base indels; copy-number variants (green bars; 42 cis-eQTLs) are regions with elevated coverage in aligned genomic reads in at least one accession; gene structural variants (black bars; 227 cis-eQTLs) are accession-specific deletions, insertions or changes to the gene model. c, The spatial distribution of nucleotide-variant eQTLs relative to the start of protein-coding genes (FDR 5%, overlapping genes removed; n = 647). The line shows density of gene length. d, Frequencies of nucleotide-variant eQTLs in protein-coding genes, classified by component (bar widths are proportional to the components’ average physical lengths): red bars, upstream; yellow bars, 5′ untranslated region; green bars, coding sequence exons; blue bars, introns; cyan bars, 3′ untranslated region; grey bars, downstream. Figure 4: Protein diversity and gene expression vary by gene category or family. The numbers next to each row are gene counts. The gene families were selected from Supplementary Figs 26 and 39–41 to represent the breadth of observed variation. a, Distribution of average AA distances to other accessions (compare with c). b, Fraction of unexpressed, expressed and differentially expressed genes (expressed is a superset of differentially expressed). c, Distribution of genes categorized by fold change (between lowest and highest across 19 accessions). d, Distribution of the numbers of accessions contributing to differential expression. TF, transcription factor; CC, coiled-coil; TIR, Toll interleukin-1 receptor; NB-LRR, nucleotide-binding leucine-rich repeat.
Altmetric
References
  1. Johanson, U. Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time Science 290, 344-347 (2000) .
    • . . . However, this may cause bias, because genes may be inactive in the reference but expressed in the population1, suggesting that sequencing and re-annotating individual genomes is necessary . . .
  2. Bentley, D. R. Accurate whole human genome sequencing using reversible terminator chemistry Nature 456, 53-59 (2008) .
    • . . . Advances in sequencing2 make this tractable for Arabidopsis thaliana3, 4, 5, whose natural accessions (strains) are typically homozygous . . .
    • . . . Accessions were sequenced with Illumina paired-end reads2 (Supplementary Table 1), generally with two libraries with 200-bp and 400-bp inserts and reads of 36 and 51 bp, respectively, to between 27-fold and 60-fold coverage . . .
  3. Ossowski, S. Sequencing of natural strains of Arabidopsis thaliana with short reads Genome Res. 18, 2024-2033 (2008) .
    • . . . Bur-0 survey (blue line): 1,442 survey sequences (about 417 bp each) in predominantly genic regions19; Bur-0 divergent (red line): 188 sequences (each about 254 bp) highly divergent from Col-0 (ref. 3); Ler-0 nonrepetitive (orange line): a predominantly single-copy 175-kb Ler-0 sequence on chromosome 5; Ler-0 repetitive (purple line): a highly repetitive 339-kb Ler-0 locus on chromosome 3 (ref. 18; Supplementary Information section 4) . . .
    • . . . Advances in sequencing2 make this tractable for Arabidopsis thaliana3, 4, 5, whose natural accessions (strains) are typically homozygous . . .
    • . . . Relative to the 119-megabase (Mb) high-quality reference sequence from Col-0 (ref. 6), diverse accessions harbour a single nucleotide polymorphism (SNP) about every 200 base pairs (bp) (ref. 3), and indel variation is pervasive3, 7, 8 . . .
    • . . . The assembled genomes also contribute to the A. thaliana 1001 Genomes Project3, 4, 5, 13. . . .
    • . . . At unique loci, polymorphic regions probably reflect complex polymorphisms3, 8 . . .
    • . . . As assessed with about 1.2 Mb of genomic dideoxy data3, 18, 19 (Supplementary Information section 4), the substitution error rate was about 1 per 10 kb in single-copy regions, and about tenfold higher in transposable-element-rich regions . . .
    • . . . As expected3, 7, disease resistance genes of the coiled-coil and Toll interleukin 1 receptor subfamilies of the Nucleotide-Binding Leucine Rich Repeat (NB-LRR) gene family were predicted to encode the most variable proteins (Fig. 4a and Supplementary Fig. 26) . . .
  4. Schneeberger, K. Reference-guided assembly of four diverse Arabidopsis thaliana genomes Proc. Natl Acad. Sci. USA 108, 10249-10254 (2011) .
    • . . . Advances in sequencing2 make this tractable for Arabidopsis thaliana3, 4, 5, whose natural accessions (strains) are typically homozygous . . .
    • . . . The assembled genomes also contribute to the A. thaliana 1001 Genomes Project3, 4, 5, 13. . . .
    • . . . The substitution error rate for our assemblies was comparable to that reported for four other A. thaliana genome assemblies4. . . .
    • . . . Our study goes beyond cataloguing polymorphisms7, 17 to provide genome sequences for a moderately sized population sample (see also refs 4, 16) . . .
  5. Weigel, D.; Mott, R. The 1001 genomes project for Arabidopsis thaliana Genome Biol. 10, 107 (2009) .
    • . . . Advances in sequencing2 make this tractable for Arabidopsis thaliana3, 4, 5, whose natural accessions (strains) are typically homozygous . . .
    • . . . The assembled genomes also contribute to the A. thaliana 1001 Genomes Project3, 4, 5, 13. . . .
    • . . . The methods we developed are of immediate relevance to the broader A. thaliana 1001 Genomes Project5 and to other organisms, and highlight the importance of RNA-seq data for annotation. . . .
  6. The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408, 796-815 (2000) .
    • . . . Relative to the 119-megabase (Mb) high-quality reference sequence from Col-0 (ref. 6), diverse accessions harbour a single nucleotide polymorphism (SNP) about every 200 base pairs (bp) (ref. 3), and indel variation is pervasive3, 7, 8 . . .
  7. Clark, R. M. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana Science 317, 338-342 (2007) .
    • . . . Relative to the 119-megabase (Mb) high-quality reference sequence from Col-0 (ref. 6), diverse accessions harbour a single nucleotide polymorphism (SNP) about every 200 base pairs (bp) (ref. 3), and indel variation is pervasive3, 7, 8 . . .
    • . . . The probability of recent co-ancestry is slightly higher than expected for a few pairs of accessions, with extended haplotype sharing at a minority of loci (Supplementary Figs 11–15), perhaps reflecting selective sweeps7 . . .
    • . . . Variation among the 18 accessions is similar to a diverse global A. thaliana sample7, 8 in nucleotide diversity (Supplementary Figs 11–15), correlation with genomic features (Supplementary Tables 9–12) and structural variants (Supplementary Fig. 17). . . .
    • . . . As expected3, 7, disease resistance genes of the coiled-coil and Toll interleukin 1 receptor subfamilies of the Nucleotide-Binding Leucine Rich Repeat (NB-LRR) gene family were predicted to encode the most variable proteins (Fig. 4a and Supplementary Fig. 26) . . .
    • . . . Our data suggest that high turnover for some F-box families in the A. thaliana lineage7 extends to gene expression as well. . . .
    • . . . Our study goes beyond cataloguing polymorphisms7, 17 to provide genome sequences for a moderately sized population sample (see also refs 4, 16) . . .
  8. Zeller, G. Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays Genome Res. 18, 918-929 (2008) .
    • . . . Relative to the 119-megabase (Mb) high-quality reference sequence from Col-0 (ref. 6), diverse accessions harbour a single nucleotide polymorphism (SNP) about every 200 base pairs (bp) (ref. 3), and indel variation is pervasive3, 7, 8 . . .
    • . . . We aligned reads to the final assemblies to detect polymorphic regions8 lacking read coverage (2.1–3.7 Mb per accession; Supplementary Table 3 and Supplementary Fig. 2) . . .
    • . . . At unique loci, polymorphic regions probably reflect complex polymorphisms3, 8 . . .
    • . . . Variation among the 18 accessions is similar to a diverse global A. thaliana sample7, 8 in nucleotide diversity (Supplementary Figs 11–15), correlation with genomic features (Supplementary Tables 9–12) and structural variants (Supplementary Fig. 17). . . .
  9. Kover, P. X. A multiparent advanced generation inter-cross to fine-map quantitative traits in Arabidopsis thaliana PLoS Genet. 5, e1000551 (2009) .
    • . . . Characterizing this variation is crucial for dissecting the genetic architecture of traits by quantitative trait locus mapping in recombinant inbred lines (see, for example, ref. 9) or genome-wide association in natural accessions10. . . .
    • . . . Here we have sequenced and accurately assembled the single-copy genomes of 18 accessions that, with Col-0, are the parents of more than 700 Multiparent Advanced Generation Inter-Cross (MAGIC) lines9, similar to the maize Nested Association Mapping (NAM)11 population and the murine Collaborative Cross12 . . .
    • . . . These accessions comprise a geographically and phenotypically diverse sample across the species9 . . .
    • . . . Our findings indicate that the MAGIC lines, for which population structure is largely mitigated9, will be an important and complementary resource to genome-wide association studies in A. thaliana populations10. . . .
  10. Atwell, S. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines Nature 465, 627-631 (2010) .
    • . . . Characterizing this variation is crucial for dissecting the genetic architecture of traits by quantitative trait locus mapping in recombinant inbred lines (see, for example, ref. 9) or genome-wide association in natural accessions10. . . .
    • . . . Our findings indicate that the MAGIC lines, for which population structure is largely mitigated9, will be an important and complementary resource to genome-wide association studies in A. thaliana populations10. . . .
  11. McMullen, M. D. Genetic properties of the maize nested association mapping population Science 325, 737-740 (2009) .
    • . . . Here we have sequenced and accurately assembled the single-copy genomes of 18 accessions that, with Col-0, are the parents of more than 700 Multiparent Advanced Generation Inter-Cross (MAGIC) lines9, similar to the maize Nested Association Mapping (NAM)11 population and the murine Collaborative Cross12 . . .
  12. Collaborative cross mice and their power to map host susceptibility to Aspergillus fumigatus infection Genome Res. 21, 1239-1248 (2011) .
    • . . . Here we have sequenced and accurately assembled the single-copy genomes of 18 accessions that, with Col-0, are the parents of more than 700 Multiparent Advanced Generation Inter-Cross (MAGIC) lines9, similar to the maize Nested Association Mapping (NAM)11 population and the murine Collaborative Cross12 . . .
  13. Cao, J. Whole-genome sequencing of multiple Arabidopsis thaliana populations Nature Genet , (28 August 2011) .
    • . . . The assembled genomes also contribute to the A. thaliana 1001 Genomes Project3, 4, 5, 13. . . .
  14. Lunter, G.; Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads Genome Res. 21, 936-939 (2011) .
    • . . . Each genome was assembled by using five cycles of iterative read mapping14 combined with de novo assembly15 (Supplementary Information sections 2 and 3, and Supplementary Tables 1 and 2) . . .
  15. Li, R. De novo assembly of human genomes with massively parallel short read sequencing Genome Res. 20, 265-272 (2010) .
    • . . . Each genome was assembled by using five cycles of iterative read mapping14 combined with de novo assembly15 (Supplementary Information sections 2 and 3, and Supplementary Tables 1 and 2) . . .
  16. Keane, T. M. Mouse genomic variation and its effect on phenotypes and gene regulation Nature , (in the press) .
    • . . . The density of sequence differences is greater than between classical inbred strains of mice16, but less than between lines of maize17. . . .
    • . . . Our study goes beyond cataloguing polymorphisms7, 17 to provide genome sequences for a moderately sized population sample (see also refs 4, 16) . . .
  17. Gore, M. A. A first-generation haplotype map of maize Science 326, 1115-1117 (2009) .
    • . . . The density of sequence differences is greater than between classical inbred strains of mice16, but less than between lines of maize17. . . .
    • . . . Our study goes beyond cataloguing polymorphisms7, 17 to provide genome sequences for a moderately sized population sample (see also refs 4, 16) . . .
  18. Lai, A. G.; Denton-Giles, M.; Mueller-Roeber, B.; Schippers, J. H.; Dijkwel, P. P. Positional information resolves structural variations and uncovers an evolutionarily divergent genetic locus in accessions of Arabidopsis thaliana Genome Biol. Evol. , (27 May 2011) .
    • . . . Bur-0 survey (blue line): 1,442 survey sequences (about 417 bp each) in predominantly genic regions19; Bur-0 divergent (red line): 188 sequences (each about 254 bp) highly divergent from Col-0 (ref. 3); Ler-0 nonrepetitive (orange line): a predominantly single-copy 175-kb Ler-0 sequence on chromosome 5; Ler-0 repetitive (purple line): a highly repetitive 339-kb Ler-0 locus on chromosome 3 (ref. 18; Supplementary Information section 4) . . .
    • . . . As assessed with about 1.2 Mb of genomic dideoxy data3, 18, 19 (Supplementary Information section 4), the substitution error rate was about 1 per 10 kb in single-copy regions, and about tenfold higher in transposable-element-rich regions . . .
  19. Nordborg, M. The pattern of polymorphism in Arabidopsis thaliana PLoS Biol. 3, e196 (2005) .
    • . . . Bur-0 survey (blue line): 1,442 survey sequences (about 417 bp each) in predominantly genic regions19; Bur-0 divergent (red line): 188 sequences (each about 254 bp) highly divergent from Col-0 (ref. 3); Ler-0 nonrepetitive (orange line): a predominantly single-copy 175-kb Ler-0 sequence on chromosome 5; Ler-0 repetitive (purple line): a highly repetitive 339-kb Ler-0 locus on chromosome 3 (ref. 18; Supplementary Information section 4) . . .
    • . . . As assessed with about 1.2 Mb of genomic dideoxy data3, 18, 19 (Supplementary Information section 4), the substitution error rate was about 1 per 10 kb in single-copy regions, and about tenfold higher in transposable-element-rich regions . . .
  20. Song, Y. S.; Hein, J. Constructing minimal ancestral recombination graphs J. Comput. Biol. 12, 147-169 (2005) .
    • . . . We computed phylogenies20 across 1.25 million biallelic, non-private SNPs (Supplementary Information section 6) . . .
  21. Jean, G.; Kahles, A.; Sreedharan, V. T.; De Bona, F.; Ratsch, G. Current Protocols in Bioinformatics , (2010) .
    • . . . We integrated read alignments21 with sequence-based gene predictions22 by using mGene.ngs (Supplementary Information sections 9–10.3, and Supplementary Fig. 19) . . .
  22. Schweikert, G. mGene: accurate SVM-based gene finding with an application to nematode genomes Genome Res. 19, 2133-2143 (2009) .
    • . . . We integrated read alignments21 with sequence-based gene predictions22 by using mGene.ngs (Supplementary Information sections 9–10.3, and Supplementary Fig. 19) . . .
    • . . . Comparison of Col-0 de novo predictions with TAIR10 annotations (Supplementary Table 16) showed that these predictions are more accurate (transcript F-score 65.2%) than using the genome sequence (mGene22, 59.6%) or RNA-seq alignments alone (Cufflinks23, 37.5%; Supplementary Table 17) . . .
  23. Trapnell, C. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnol. 28, 511-515 (2010) .
    • . . . Comparison of Col-0 de novo predictions with TAIR10 annotations (Supplementary Table 16) showed that these predictions are more accurate (transcript F-score 65.2%) than using the genome sequence (mGene22, 59.6%) or RNA-seq alignments alone (Cufflinks23, 37.5%; Supplementary Table 17) . . .
  24. Hu, T. T. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change Nature Genet. 43, 476-481 (2011) .
    • . . . As expected, variation between A. thaliana and its congener A. lyrata24 exceeds that observed among A. thaliana accessions (Fig. 2c and Supplementary Fig. 23) . . .
  25. Silverstein, K. A.; Graham, M. A.; Paape, T. D.; VandenBosch, K. A. Genome organization of more than 300 defensin-like genes in Arabidopsis Plant Physiol. 138, 600-610 (2005) .
    • . . . F-box and defensin-like genes implicated in diverse processes including defence25, 26 were also highly variable . . .
    • . . . F-box and defensin-like genes were exceptional in that expression was restricted in a minority of genes (41% and 12%, respectively; Fig. 4b), perhaps reflecting tissue-specific or environment-specific expression25, 37 . . .
  26. Gagne, J. M.; Downes, B. P.; Shiu, S. H.; Durski, A. M.; Vierstra, R. D. The F-box subunit of the SCF E3 complex is encoded by a diverse superfamily of genes in Arabidopsis Proc. Natl Acad. Sci. USA 99, 11519-11524 (2002) .
    • . . . F-box and defensin-like genes implicated in diverse processes including defence25, 26 were also highly variable . . .
  27. Anders, S.; Huber, W. Differential expression analysis for sequence count data Genome Biol. 11, R106 (2010) .
    • . . . In total, 75% (20,550) of protein-coding genes (and 21% of non-coding RNAs and 21% of pseudogenes) were expressed in at least one accession (false discovery rate (FDR) 5%), and 46% (9,360) of expressed protein-coding genes were differentially expressed between at least one pair of accessions27 (Fig. 3a; FDR 5%, Supplementary Information section 11) . . .
  28. Keurentjes, J. J. Regulatory network construction in Arabidopsis by using genome-wide gene expression quantitative trait loci Proc. Natl Acad. Sci. USA 104, 1708-1713 (2007) .
    • . . . Our results corroborate the general findings28, 29, 30, 31 of extensive cis regulation of gene expression in A. thaliana . . .
  29. Plantegenet, S. Comprehensive analysis of Arabidopsis expression level polymorphisms with simple inheritance Mol. Syst. Biol. 5, 242 (2009) .
    • . . . Our results corroborate the general findings28, 29, 30, 31 of extensive cis regulation of gene expression in A. thaliana . . .
    • . . . Copy-number and structural variants were associated with expression in 3% (240) of differentially expressed genes, including 45% (64 out of 142) of genes with more than 100-fold differences (Fig. 3b), consistent with array studies29. . . .
  30. West, M. A. Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis Genetics 175, 1441-1450 (2007) .
    • . . . Our results corroborate the general findings28, 29, 30, 31 of extensive cis regulation of gene expression in A. thaliana . . .
  31. Zhang, X.; Cal, A. J.; Borevitz, J. O. Genetic architecture of regulatory variation in Arabidopsis thaliana Genome Res. 21, 725-733 (2011) .
    • . . . Our results corroborate the general findings28, 29, 30, 31 of extensive cis regulation of gene expression in A. thaliana . . .
  32. Howe, G. A.; Jander, G. Plant immunity to insect herbivores Annu. Rev. Plant Biol. 59, 41-66 (2008) .
    • . . . Seventeen of the 18 GO classifications that were enriched for differential expression (P < 10−3) concerned response to the biotic environment, including pathogen defence and the production of glucosinolates32 to deter herbivores (Supplementary Table 24) . . .
  33. Kaufmann, K.; Melzer, R.; Theissen, G. MIKC-type MADS-domain proteins: structural modularity, protein interactions and network evolution in land plants Gene 347, 183-198 (2005) .
    • . . . The type II MADS box transcription factor family33 showed striking expression polymorphisms (Fig. 4b–d), including for the FLOWERING LOCUS C (FLC)34 and MADS AFFECTING FLOWERING (MAF) genes35 . . .
  34. Sheldon, C. C. The FLF MADS box gene: a repressor of flowering in Arabidopsis regulated by vernalization and methylation Plant Cell 11, 445-458 (1999) .
    • . . . The type II MADS box transcription factor family33 showed striking expression polymorphisms (Fig. 4b–d), including for the FLOWERING LOCUS C (FLC)34 and MADS AFFECTING FLOWERING (MAF) genes35 . . .
  35. Ratcliffe, O. J.; Kumimoto, R. W.; Wong, B. J.; Riechmann, J. L. Analysis of the Arabidopsis MADS AFFECTING FLOWERING gene family: MAF2 prevents vernalization by short periods of cold Plant Cell 15, 1159-1169 (2003) .
    • . . . The type II MADS box transcription factor family33 showed striking expression polymorphisms (Fig. 4b–d), including for the FLOWERING LOCUS C (FLC)34 and MADS AFFECTING FLOWERING (MAF) genes35 . . .
  36. Lempe, J. Diversity of flowering responses in wild Arabidopsis thaliana strains PLoS Genet. 1, 109-118 (2005) .
    • . . . FLC, a floral inhibitor expressed highly in accessions that require prolonged cold (vernalization) to flower36, varied more than 400-fold (Supplementary Fig. 42) . . .
  37. Schmid, M. A gene expression map of Arabidopsis thaliana development Nature Genet. 37, 501-506 (2005) .
    • . . . F-box and defensin-like genes were exceptional in that expression was restricted in a minority of genes (41% and 12%, respectively; Fig. 4b), perhaps reflecting tissue-specific or environment-specific expression25, 37 . . .
31 more (Click to expand)