Professional Documents
Culture Documents
INTRODUCTION Plants have existed long before humans and animals have even appeared. With the instinctive gift of intelligence, and learning through trial-and-error process, humans have gradually developed tools for survival with the crudest technology from the early human existence to the ever evolving level of sophistication in our post-modern era. With this technological evolution came along the agricultural evolution, which could be accounted with greater significance, for in order to maintain life (i.e. society), food is essential, which mainly comes from plants [1]. Humans learned the value of plants and through experience selected those that are beneficial in a process called domestication [2] Plants with particular economic importance that lead to their cultivation are generally called crops [3]. These crops are generally valued for their relevance for food, medicine, materials, industry, landscape, etc. For enhancing the quality and yield of crops produced, breeders and scientists have worked hard to produce methods that address these
1
Laboratory of Functional Crop Genomics and Biotechnology, Department of Plant Science, College of Agriculture and Life Sciences, Seoul National University, 151742 Seoul, Korea Plant Biotechnology Institute, Department of Life Science, Sahmyook University, 139-742 Seoul, Korea Email: newaminal@snu.ac.kr
2
objectives. These are logically achieved by properly understanding the anatomy, physiology, genetics, and ultimately all aspects of plant mechanisms responsible for growth and development, as well as environmental factors that affect this growth and development. It basically means, the more knowledge we have of a plant and the intricate interplay of all biological factors that limit or promote its optimal performance, the easier it is to manipulate particular parameters to obtain the desired phenotype. We have come a long way in understanding the biological factors and mechanisms that contribute in plant growth and development, from knowing the hereditary molecules to the isolation of the first gene [4], to the recent studies of the genome, the transcriptome, the proteome, and the metabolome. Despite these great astonishing
advances that encompass the study of plants, it is still apparent that we have yet another long way to go to make use of this vast amount of information for the improvement of human health and lifestyle, and to address the recent international concerns of global warming. The concerted efforts of scientists from various fields have contributed enormously in the understanding of the plant genome, its structure, and its function. Unlocking the genomic DNA sequence and understanding the interplay of the DNA and other biomolecules have a profound impact in downstream researches and applications of this information. There is such a wide horizon of downstream applications of the genome sequence that attempting to enumerate them is like trying to limit its possibilities. Nevertheless, some of the apparent direct and indirect contributions of the genome sequence to the scientific community include (i) access to: the relatively complete gene catalogue of a species, the regulatory elements that control the gene functions, and the foundation in understanding variation of genomes; (ii) understanding the structure, function, and evolution of organisms; (iii) understanding biochemical pathways; (iv) development of molecular markers to speed up genetic analysis, discovery of genes, and breeding programs for crop improvement; (v) and providing framework for further structural and functional genomics studies of model plants, essential food crops, animal feed, and energy crops. To date, there are about 26 plant genomes that have been sequenced [5]. Of which plant genome to sequence first is influenced by several factors like the sequencing cost, genome size, and genome complexity. These factors have more influence in the decision of sequencing than the direct economic significance of the species being sequenced, as exemplified by the sequencing of Zea mays L. which could have been sequenced after Arabidopsis and rice [6] but was consequently sequenced later, after Vitis
vinefera [7] and Populus trichocarpa [8], due to the huge amount of repetitive elements in
its genome [9]. These highly repetitive elements make genome assembly difficult by challenging computational accuracy [9], especially with the use of the next generation sequencing (NGS) technologies that produce short reads [10]. However, scientists have developed approaches that utilize long reads (Sanger sequencing) in combination with the NGS reads to produce more reliable results [11]. To date, several important crop species have already been sequenced using whole genome shotgun (WGS) sequencing or BAC-byBAC sequencing approaches (Table 1), and this number is dramatically increasing as more sophisticated sequencing technologies and bioinformatics tools are being refined. With the increasing knowledge of the plant genome, which was greatly spurred after the completion of the genome sequence of Arabidopsis thaliana in 2000 [5], comes along the incremental evolution of DNA sequencing technologies. The Sanger method has dominated the DNA sequencing industry for nearly two decades and has contributed so much in sequencing many genomes, including the monumental completion of the human
Table 1. Overview of plant genomes that have been sequenced (Adapted from Trends in Plant Science February 2011, Vol. 16, No. 2 and List of sequenced eukaryotic genomes. (2012, April 20). In Wikipedia, April 29, 2012)
Organism* Dicots Relevance Genome (Mb) 207 Chrom. no. (n) 8 Predicted Genes 32,670 Sequencing strategy WGS Organization Year of completion 2011
Arabidopsis lyrata
Model plant
Arabidopsis thaliana
Model plant
119
BAC-by-BAC
2000
Brassica rapa Cannabis sativa Cucumis sativus Fragaria vesca Glycine max Jatropha curcas Lotus japonicus Malus domestica Medicago truncatula Populus trichocarpa Ricinus communis Solanum tuberosum Thellungiella parvula Theobroma cacao
Crop and model organism Hemp and marijuana production Vegetable crop Fruit crop Protein and oil crop Biodiesel crop Model legume Fruit tree Model organism for legume biology Carbon sequestration, model tree, timber Oilseed crop Crop plant Arabidopsis relative with high salt tolerance Flavoring crop
284 534 367 280 1,100 410 417 927 375 550 320 844 140 430
10 10 7 7 20 11
WGS WGS WGS WGS WGS BAC-by-BAC WGS BAC-by-BAC BAC-by-BAC BAC-by-BAC WGS WGS WGS WGS WGS
multicenter collaboration multicenter collaboration Chinese Academy of Agricultural Sciences, Beijing multicenter collaboration Purdue University Kazusa DNA Research Institute multicenter collaboration International consortium multicenter collaboration The International Poplar Genome Consortium multicenter collaboration multicenter collaboration multicenter collaboration CIRAD, multiple institutions (separate project, Mars Inc., USDA)
2011 2011 2009 2011 2010 2010 2008 2010 2011 2006 2010 2011 2011 2010
19
45,555 31,237
12
39,031 28,901
10
28,798
Vitis vinifera
Fruit crop
490
30,434
WGS
2007
Monocots
272 420
5 12
26,500 32-50,000
WGS WGS
The International Brachypodium Initiative Beijing Genomics Institute, Zhejiang University and the Chinese Academy of Sciences
2010 2002
Crop and model organism West African species of cultivated rice that was domesticated independently of Asian rice.
466 316
12 12
58,000 ND
2002 2010
36 10 10
ND: no data *species in bold font are discussed in detail in this review
genome [12]. However, limitations of this technology (low throughput and high cost are main concerns) have fueled the need for more advanced sequencing technologies that could produce enormous amount of sequence data in shorter time and cheaper cost. The result is the shift of sequencing approach from the traditional first-generation technology of automated Sanger sequencing to the more advanced next-generation sequencing [12]. Recent development and refinement of these technologies have initiated the thirdgeneration sequencing with high accuracy, longer read lengths, super high coverage and fast data acquisition [13].
Here, I review the genomic sequencing results of three recently sequenced crops:
Cucumis sativus, Jatropha curcas, and Theobroma cacao, covering the sequencing
strategies used, genomic structure and arrangement, novel biosynthetic pathways, speciesspecific genes, implication in species evolution, and other areas of functional genomics. REVIEWED GENOMES
Theobroma cacao L. (cocoa tree), the Criollo variety, is an important crop in producing
chocolate products. However, fine-cocoa production is about less than 5% globally. This is mainly caused by fungal, oomycete and viral diseases, and insect pest susceptibility to fine-flavor cocoa varieties. Breeding of improved Criollo varieties is needed for sustained production of fine-flavor cocoa. Despite the great economic significance of these crops as partially mentioned above, there are limited or very limited genomic resources that hinder speedy researches that address their respective objectives. Unlocking their genomic sequence will undoubtedly uncover new frontiers that would help understand their structure, function, and control of desired traits. Consequently, independent organizations have initiated and completed the sequencing of their genomes (Table 1). STRUCTURAL GENOMICS
Whole genome shotgun (WGS) sequencing approach was used to sequence the three genomes, with different platforms used and methods to achieve a better quality of assembled sequences. Whole genome shotgun with a combination of the Sanger and NGS (GA by Illumina) sequencing was used to sequence the cucumber genome. This method produced longer N50 of both contigs and scaffold than when using separately assembled reads from each sequencing strategy (Table 2). For J. curcas genome sequencing, a combination of BAC end sequencing and shotgun sequencing was employed using the
Table 2. Genome assembly statistics of C. sativus Assembly Sanger Illumina GA Sanger + Illumina GA Contig N50 (kb) 2.6 12.5 19.8 Contig total (Mb) 204 190 226.5 Scaffol N50 (kb) 19 172 1,140 Scaffold total (Mb) 238 200 243.5
conventional Sanger method and the NGS (GS-FLX by Roche/454 and GA by Illumina). For R E P O R T | CROP GENOME ANALYSIS
VOLUME 01 | APRIL 2012 | 4
T. cacao, the sequencing strategy used was WGS incorporating Sanger and NGS platforms
(-FLX by Roche/454 and GA by Illumina). Different software was used to assemble the genomic sequences of the three species Table 3. The combination of the conventional
Table 3. Summary of information about the genome, sequencing strategy and genome assembly statistics of the three reviewed species
Jatropha curcas
Genome size (Mb) Ploidy, chrom. no. Date of completion Date of publication, Journal Sequencing/Funding Institution Sequencing strategy ~410 2n=2x=22 2010 2011, DNA research Kazusa DNA Research Institute Foundation, Japan BAC-by-BAC and WGS Sanger Sequencing method NGS for shotgun libraries and BAC ends GS-FLX (Roche, USA) GA II (Illumina, USA) Assembly program Total length of assembled genome Percent of genome covered by the assembly Coverage depth of raw data Gene space covered Sequence anchored to chromosomes Contig: Total number Total length (Mb) Average length (Kb) Longest (Kb) N50 (Kb) Scaffold: Total number Total length (Mb) Average length (Kb) Longest (Kb) N50 (Kb) 15, 300 129.3 8.4 56 ND 4, 792 326.9 68.2 3, 145 473.8 47, 837 243.5 ND ND 1, 144 120, 586 276.7 2.3 29.7 3.8 25, 912 291.4 11.2 190 19.8 62, 410 226.4 ND ND 19.8 PCAP.rep and MIRA 285.9 Mb ~70% (if based on ~410 Mb genome) ~75% (if based on ~380 Mb genome) ND 95% ND Newbler version 2.3, SOAP 326.9 Mb ~76% (based on genotype B97-61/B2, 430 Mb) 61.1x 97.8% 67% RePS2 243.5 Mb ~66% (based on Chinese long inbred line 9930) Total: 72.2x 96.8% 72.8% [15]
Theobroma cacao
~430 2n=2x=20 2010 2011, Nature Genetics The International Cocoa Genome Sequencing Consortium-ICGS, coordinated by CIRAD WGS Sanger NGS for BAC ends only GS-FLX GA II [18]
Cucumis sativus
~367 2n=2x=14 2009 2009, Nature Genetics Chinese Academy of Agricultural Sciences, Beijing, China WGS Sanger NGS for BAC, plasmid, and fosmid sequencing GA II [11]
ND: no data
Sanger and NGS strategy proved to be superior than just using either technology independently by compensating the shortcomings of each respective method, allowing the acquisition of high quality sequences with lower cost in a short period of time; thus,
Gy14 is a North American processing market-type cucumber cultivar. PI183967 is an accession of C. sativus var.
Linkage analysis
The consensus genetic map of T. cacao was created using two mapping populations, while 77 recombinant inbred lines from inter-subspecific cross between Gy14 and PI183967 were used for cucumber. No data was provided about the linkage analysis of J.
hardwickii originating
from India.
curcas. About the same percentage of the molecular markers were aligned into the newly
assembled genomic sequence of C. sativus and T. cacao (Table 4). It was interesting to observe that in cucumber, recombination suppression regions were found after comparing
FISH: Fluorescence in situ hybridization is a cytogenetic technique that is used to detect and localize the presence or absence of specific DNA sequences on chromosomes.
the genetic and physical maps. This covers two 10-Mb regions at either ends of chromosome 4, a 20-Mb region on chromosome 5, and a 8-Mb region on chromosome 7 (Fig. 1a). Further FISH analysis revealed segmental inversion on chromosome 5 between Gy14 and PI183967 (Fig. 1b). This chromosomal inversion helps explain the recombination suppression in these regions and added insight to the study of cucumber evolution during domestication.
Table 4. Summary of the linkage analysis Mapping population Total length (cM) 750.6 581 No. mol. markers 1,259 1,885 Aligned markers 1,192 (94%) 1,763 (93.5) Anchored sequences to chromosomes (%) 67 72.8
T. cacao C. sativus
2 77
Figure 1. The integrated genetic and physical maps of cucumber. (a) Genetic distance vs. physical distance of the seven cucumber chromosomes. The brackets denote the regions of recombination suppression. (b) Detection of segmental inversion on chromosome 5 between Gy14 and PI183967 through FISH (12-7 and 12-2 are fosmid clones used as probes). Bar = 5m.
Repetitive sequences
In proportion to the genome size, J. curcas has the most number of transposons (36.6% of 410 Mb genome) followed by C. sativus (24% of 367 Mb genome) and T. cacao (24% of 430 Mb genome) (Table 5). In all three species, the Class I transposable elements (retrotransposons) represent majority of the repeat sequences in the genome. In J. curcas,
Table 5. Summary of the transposable elements identified in the three species
C. sativus
Number of elements Class I LTR: copia LTR: gypsy LTR: other Others Class II Others TOTAL 20,119 16,972 135,464 266,232 1.75 1.24 11.64 24.01 119,339 (91,109)* Fraction of the genome (%) 12.16 (10.43) Number of elements 113,047 31,740 67,658 13,454 195 25,977 28,069 152,805
J. curcas
Fraction of the genome (%) 29.91 8.03 19.6 2.23 0.05 2.04 5.22 36.6 19260 21,882 ND 67,575 Number of elements 49,942 18,060 12,622
T. cacao
Fraction of the genome (%) ND
ND ND ~24
there are more gypsy-type retrotransposons than copia-type, an opposite pattern with that of T. cacao. In fact, in T. cacao, a copia-like LTR name Gaucho, 11,297 bp long and repeated approximately 1,100 times, was identified and hybridized through FISH, and was found to occupy most of the interstitial regions (regions between centromeres and telomeres) of chromosome arms (Fig. 2b). Additionally, a 212 bp long repeat named ThCen was confirmed to be centromere-specific repeats after FISH analysis (Fig 2a), and that it may have contributed to the genome size variation of T. cacao.
Figure 2. FISH analysis of T. cacao repetitive sequences. (a) T. cacao chromosomes counterstained with DAPI (blue) with ThCen (red) used as probe. (b) hybridization of ThCen (red) and Gaucho LTR retrotransposon probes (green).
Gene content
Not all RNA-encoding genes were reported for the three species (Table 6). Due to the limitations of sequencing method, the number of genes, especially for the ribosomal RNAencoding genes may be largely underestimated. In T. cacao, only six fragments of rRNA genes were recovered, a huge reduction of the average number of repeats found in most R E P O R T | CROP GENOME ANALYSIS
VOLUME 01 | APRIL 2012 | 7
eukaryotes, as can be observed through FISH analysis using rDNA as probes [16]. MicroRNAs (miRNAs) are short non-coding RNAs that transcriptionally or posttranscriptionally regulate gene-expression. Many miRNAs have roles in plant development and stress response. In T. cacao, most of the miRNAs predicted have homologous transcription factor sequences, suggesting that miRNAs are major gene expression regulators in T. cacao. Three gene-prediction methods were used to identify proteinrRNA tRNA miRNA snoRNA snRNA Table 6. Summary of RNA-coding genes in the three species
C. sativus
292 699 171 238 192
J. curcas
ND 597 ND ND 65
T. cacao
6 473 83 ND ND
coding genes for the three species namely ab initio, cDNA-EST, and homology searches using gene
finder software in public databases (Table 7). Comparison of the gene families with other sequenced genomes resulted to 682 T. cacao-specific and 4,362 C. sativus-specific gene families, while 1,529 genes were found to be specific to the family Euphorbiaceae where J.
curcas belongs.
Table 7. Summary of the gene-prediction analysis of the three sequenced genomes
Geneprediction methods Protein-coding region search programs GlimmerHMM Genscan Agustus BGF SNAP GeneMark.hmm Genescan Similarity searches database No. of predicted genes 26,682 Mean coding sequence size (bp) 1,046 Mean exon size (bp) 238 Mean intron size (bp) 483 Mean exons per gene 4.39
C. sativus
ab initio
homology search cDNA-EST
Arabidopsis
Papaya Poplar Grapevine Rice Uniref TrEMBL
J. curcas
ab initio
homology search cDNA-EST
40,929
3,064
227
356
ND
T. cacao
ab initio
homology search cDNA-EST
EUGene SpliceMachine
28,798
3,346
231
6,319
5.03
FUNCTIONAL GENOMICS
by the expansion of its lipoxygenase (LOX) pathways that produce short chain aldehyde and alcohols that are involved in plant defense mechanism. The eukaryotic translation initiation factors confer recessive resistance to plant viral infections. Three EIF4E and EIF4G genes that encode the eIF4E and eIF4G proteins, respectively, have been identified in C.
cacao genome underwent 11 major chromosome fusions from the 21 chromosomes of the
paleo-hexaploid ancestor to produce the present 10 chromosomes (Fig. 3). On the other hand, the collinear gene-order analysis of C. sativus revealed no recent WGD, but some segmental duplication events. Additionally, the comparative genomics between C. sativus and its immediate relative, C. melo (melon) suggests a possible chromosomal fusion between two chromosomes among ten ancestral chromosomes to form the five (chrom. no 1, 2, 3, 5, and 6) of the seven present chromosomes of C. sativus (Fig. 4). In T. cacao, seven blocks of duplicated genes were characterized after alignment of its gene models onto its genome (Fig. 5).
Figure 3. Evolutionary model of T. cacao. The eudicot ancestor chromosomes are presented in seven colors. The several lineage-specific shuffling events have shaped the present eudicot genomes. R: rounds of WGD, F: chromosomal fusions.
Although the idea of accounting the genomic evolutionary history to common ancestry may be incredibly enticing, perhaps due to the fact that chromosomal segments are rearranged after breeding like in the case of C. sativus, I suggest that alternative approach be considered. The fact that nobody has lived a thousand years (how much more for a million years) and that homologous segments doesnt always mean common ancestry but alternatively mean common design and function, I recommend further unbiased researches as far as genomic history is concerned, to unlock further mechanisms that underlie the control of the genomic fusion, inversion, translocation, etc. and to test how much of these R E P O R T | CROP GENOME ANALYSIS
VOLUME 01 | APRIL 2012 | 10
events limits the sustenance of life. Do these similarities really mean common ancestry, or common functions and/or regulatory mechanisms?
Figure 4. Comparative genomics between melon and cucumber, showing chromosomes 1, 2, 3, 5, and 6 of cucumber largely syntenic to two chromosomes of melon.
also help in the advancement of phylogenetic relationship studies. Collectively, syntenic relationship among dicots, and among plants in a broader sense, will help in gene prediction, and eventually aid in understanding the relationship between sequence similarity and function, and the limitations to the theory of ancestry as the sole explanation to sequence similarity. CONCLUSION Recent advancement in sequencing technologies has revolutionized our experimental approaches in the study of plants. It has also shifted major scientific questions like How to sequence a genome to What platform should be used best to sequence a particular genome of interest. It has allowed scientists to study crops holistically in the genomics, transcriptomics, proteomics, and metabolomics level. It will definitely aid in crop improvement, understanding phylogenies and metabolic pathways among others. The direct or indirect exciting consequences of genome sequencing apparently boils down to the economic and lifestyle improvement of people that would hopefully be welldistributed globally. The promising open doors to science brought by these advance technologies are limitless. The use of these technologies for human and Mother Earths benefit is to be the main goal, and not just solely for humans.
REFERENCES (Those in bold font refers to the main articles for the three species reviewed here.) 1. 2. 3. 4. 5. 6. Yang TS. 2012. Plant and CultureAnother Interpretation of Human History. Journal of Jishou