Professional Documents
Culture Documents
10.1128/AEM.02772-10.
2011, 77(11):3846. DOI: Appl. Environ. Microbiol.
Stearns, Gabriel Moreno-Hagelsieb and Josh D. Neufeld
Andrea K. Bartram, Michael D. J. Lynch, Jennifer C.
Andrea K. Bartram,
1
Michael D. J. Lynch,
2
Jennifer C. Stearns,
1
Gabriel Moreno-Hagelsieb,
2
and Josh D. Neufeld
1
*
Department of Biology, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1,
1
and Department of Biology,
Wilfrid Laurier University, Waterloo, Ontario, Canada N2L 3C5
2
Received 25 November 2010/Accepted 24 March 2011
Microbial communities host unparalleled taxonomic diversity. Adequate characterization of environmental
and host-associated samples remains a challenge for microbiologists, despite the advent of 16S rRNA gene
sequencing. In order to increase the depth of sampling for diverse bacterial communities, we developed a
method for sequencing and assembling millions of paired-end reads from the 16S rRNA gene (spanning the V3
region; 200 nucleotides) by using an Illumina genome analyzer. To conrm reproducibility and to identify a
suitable computational pipeline for data analysis, sequence libraries were prepared in duplicate for both a
dened mixture of DNAs from known cultured bacterial isolates (>1 million postassembly sequences) and an
Arctic tundra soil sample (>6 million postassembly sequences). The Illumina 16S rRNA gene libraries
represent a substantial increase in number of sequences over all extant next-generation sequencing approaches
(e.g., 454 pyrosequencing), while the assembly of paired-end 125-base reads offers a methodological advantage
by incorporating an initial quality control step for each 16S rRNA gene sequence. This method incorporates
indexed primers to enable the characterization of multiple microbial communities in a single ow cell lane,
may be modied readily to target other variable regions or genes, and demonstrates unprecedented and
economical access to DNAs from organisms that exist at low relative abundances.
The composition, organization, and spatial distribution of
environmental microbial communities are still poorly under-
stood. Enormous progress in method development has begun
to enable the study of alpha, beta, and gamma diversity, but a
substantial limitation remains: the coverage of most sequenc-
ing methods remains insufcient to analyze single samples
comprehensively or to conduct eld-scale comparisons of the
microbial diversity in most environments. Methodology is still
required to provide (i) high sample throughput, (ii) informa-
tion on the microbial species (or phylotypes) present at both
high and low relative abundances, and (iii) affordability for the
average research laboratory. Although comprehensive meta-
genomic analysis could eventually be used for microbial com-
munity proling (sampling both abundant and rare popula-
tions), this is not yet feasible for most environmental samples
due to enormous computational and sequencing limitations.
Instead, an alternative community proling approach involves
surveying distributions of the small subunit rRNA gene due to
its ubiquity across all domains of life (16S rRNA in the Bacteria
and Archaea and 18S rRNA in the Eukarya) (27). Additionally,
the 16S rRNA gene provides valuable phylogenetic informa-
tion (6) for comparison to database collections. For many
years, the use of Sanger sequencing to sequence collected 16S
rRNA genes from environmental samples has revealed that
sample sizes, and thus coverage, afforded by Sanger sequenc-
ing have been insufcient to adequately describe and compare
microbial communities (6, 24). The advent of serial analysis of
ribosomal sequence tags (SARST) (17, 26, 33) and 454 pyro-
sequencing provided a major advance by enabling the collec-
tion of thousands of sequences from multiple samples. These
approaches have provided a new window into the diversity and
composition of microbial communities (11, 25, 31), increased
sample throughput by use of indexing (1, 10, 19), and sparked
interest in elucidating the members of the rare biosphere,
which are microorganisms that exist at low relative abundances
(23, 28, 31). To further reduce the costs of sequencing, the
Illumina platform was recently used to generate data sets of
unprecedented size (3, 8, 20) that surpass 454 pyrosequencing
data sets by over an order of magnitude in number of se-
quences per unit cost (30). Initial Illumina-based methods for
sequencing 16S rRNA genes have been limited by 101-base
sequence reads (3, 8, 12, 20, 34) and/or an inability to leverage
the paired-end approach that would allow for assembly of
reads and reduced sequencing errors (3, 4, 20).
Here we present a novel application-ready method for gen-
erating multimillion-sequence data sets at a fraction of the cost
of Sanger or 454 pyrosequencing. Without factoring sample
preparation costs, Illumina is currently 50- and 12,000-fold
less expensive per sequenced megabase than pyrosequencing
(i.e., 454 pyrosequencing) and Sanger sequencing, respectively
(S. Pereira, The Centre for Applied Genomics, Toronto, On-
tario, Canada, personal communication). This method lever-
ages the paired-end Illumina sequencing platforms (i.e., GAIIx
genome analyzer, Hiseq 2000 platform, and MiSeq genome
analyzer) to assemble 200-base hypervariable region 3 (V3)
* Corresponding author. Mailing address: Department of Biology,
University of Waterloo, Waterloo, Ontario, Canada N2L 3G1. Phone:
(519) 888-4567. Fax: (519) 746-0614. E-mail: jneufeld@uwaterloo.ca.
Supplemental material for this article may be found at http://aem
.asm.org/.