You are on page 1of 11

DNA Barcoding: Bioinformatics Workflows for Beginners

John-James Wilson, University of South Wales, Pontypridd, United Kingdom and Naresuan University, Phitsanulok, Thailand
Kong-Wah Sing, Kunming Institute of Zoology, Chinese Academy of Sciences, Yunnan, P.R. China
Narong Jaturas, Naresuan University, Phitsanulok, Thailand
r 2018 Elsevier Inc. All rights reserved.

Introduction

Just as species show differences in their morphology, ecology, and behavior, they also show differences in their DNA sequences
(Wilson et al., 2017). “DNA barcoding,” used in a broad sense, refers to the use of short, standardized DNA sequences as markers
for the recognition of species. When used more precisely, “DNA barcoding” refers to the technique of sequencing a short fragment
of the DNA sequence of the mitochondrial cytochrome c oxidase subunit I (COI) gene, the animal “DNA barcode,” from a
taxonomically unknown specimen and performing comparisons with a library of DNA barcodes from taxonomically known
specimens to establish a taxonomic identification.
DNA barcoding requires basic molecular biology methods (which pre-date the term DNA barcoding) to extract and amplify the
DNA barcode sequence fragment from the unknown specimen (Fig. 1). Generally, this sample of amplified DNA is then passed to
commercial companies for inexpensive Sanger sequencing, or, particularly in the case of mixed, bulk samples, for high-throughput
(next generation) sequencing. These molecular biology methods are not covered in this article which focuses on bioinformatics
workflows following the receipt of digital DNA sequences from a sequencer (Fig. 1). For a step-by-step guide to the molecular
biology methods used for standard (animal) DNA barcoding see Wilson (2012) and for an approximation of costs in developing
countries see Sing et al. (2016). Brandon-Mong et al. (2015) provide an example of a method (bulk extraction and bulk PCR) for
use prior to high-throughput sequencing, i.e., DNA metabarcoding.

Fig. 1 The DNA barcoding workflow.

Encyclopedia of Bioinformatics and Computational Biology doi:10.1016/B978-0-12-809633-8.20468-8 1


2 DNA Barcoding: Bioinformatics Workflows for Beginners

The ultimate aim of a DNA barcoding bioinformatics workflow is to (1) produce a “clean” or “reliable” digital representation of
the DNA barcodes, and (2) use these DNA barcodes to obtain information about the taxonomy of the unknown specimen through
algorithms enabling DNA sequence comparisons, in conjunction with DNA sequence libraries (Fig. 1).

Background/Fundamentals

It is important to realize that full exploitation of DNA barcodes for species recognition will only be possible after assembling a
comprehensive library linking organisms (and Linnaean taxonomy) with their DNA barcodes (see Wilson et al., 2017 for an
assessment of DNA barcode library coverage for insects). With this in mind, in an effort to promote DNA barcoding research
and DNA barcode library building (e.g., Wilson et al., 2013), we have been organizing and facilitating DNA barcoding workshops
in Southeast Asia, a mega diverse region with relatively low DNA barcode coverage in public libraries (Wilson et al., 2016). We
have created a website, DNA Barcoding Workshops (DBW), as a companion resource to our workshops and much of the material
in this article derives for our experience running these workshops.
This article is intended as a guide for absolute beginners to DNA barcoding, and provides step-by-step instructions for basic
bioinformatics workflows following receipt of DNA sequences from a DNA sequencing facility.

Application

Getting Started With DNA Barcoding: The Barcode of Life Datasystems


The Barcode of Life Datasystems (BOLD) (Ratnasingham and Hebert, 2013) is a cloud-based data storage and analysis platform
developed and maintained by the Centre for Biodiversity Genomics (CBG) in Canada. It consists of four main modules, a data
portal, an educational portal, a registry of Barcode Index Numbers (BINs) (putative species), and a data collection and analysis
workbench. BOLD is the recommended place to manage your DNA barcoding research. Please see the BOLD Handbook available
from the BOLD website (under “Resources”) and the DNA Barcoding Workshop (DBW) website for step-by-step instructions for:
(i) creating an account on BOLD; (ii) preparing specimen data for BOLD; (iii) creating a project on BOLD; and (iv) adding records
and images to BOLD. You will need to have completed these steps in order to upload your DNA barcode sequences to BOLD and
to use the BINs system. Note that BOLD keeps all of your work (specimen records, sequences, etc.) private, accessible only to
yourself and your coworkers, until you choose to make the work public.

Editing Sanger Sequences


To learn more about the process of Sanger sequencing see the resources on the DNA Learning Centre website from the Cold Spring
Harbor Laboratory. The outputs of Sanger sequencers are trace files (also known as chromatograms and electropherograms), digital
representations of DNA sequences. For a number of reasons, a trace may not be “clean”, it may be “messy”, and the first step is to
“edit” the sequence, a process during which “messy” parts of sequence are removed and only “reliable” sequence is retained.

a) Usually your sequences will come back from the sequencing company by email in a zip (folder) file. An example of a zip file
containing 12 trace files (6 each trace files for forward and reverse) (6 unknown specimens), representing the kind of zip file
you could receive from a sequencing company, is provided for download as Supplemental File 1.
b) Unpack (extract) the zip file to your desktop. This should create a regular folder on your desktop, which you can name Traces.
There are two sets of files for each sequence (e.g., NUMBEROFSAMPLE_F.ab1 and NUMBEROFSAMPLE_F.txt). The files you
are interested in have an extension .ab1 (e.g., NUMBEROFSAMPLE_F.ab1 and NUMBEROFSAMPLE_R.ab1). Delete the other
files in the folder.

For sequence editing we recommend the program CODONCODE ALIGNER which is used at the Canadian Centre for DNA
Barcoding (CCDB). Information on CODONCODE ALIGNER, including a free trial version, can be found at the Codoncode
Corporation website. The following steps describe the process of Sanger sequence editing for multiple specimens whose DNA
barcodes were amplified using the primers LCO1490 and HCO2198 (see Wilson, 2012; Brandon-Mong et al., 2015). For practice
you can go through these steps using the example files in Supplemental File 1. For other primer sets adapt accordingly.
a) Open CodonCode Aligner and choose Create a new project and press OK.
b) Go to File4Import4Add Folder... navigate to the desktop and select the folder of traces [which should be named Traces if you
followed the suggestion above]. Click Open4Import.
c) To see the files you just imported press ► besides the Unassembled Samples folder.
d) The .ab1 files should be of the form NUMBEROFSAMPLE_F.ab1 where the second part “F” refers to the direction, i.e. Forward.
e) Sort the files by quality by double-clicking on Quality. Any sequences that are of very poor quality (look for a big difference
between the sequence length and the quality score; a higher quality score is better) can be deleted by highlighting the
sequence and clicking Edit4Move to Trash.
f) Next we will group our sequences by direction for easy editing.
DNA Barcoding: Bioinformatics Workflows for Beginners 3

g) Make sure the Unassembled Samples folder is highlighted. Select the Contig menu and move the cursor over Advanced Assembly.
From the options that appear select Assemble in Groups.
h) A new window will appear. Click the button Define name parts...
i) There are two name parts to our file names (see above). The first part of our file names refers to the number of the sample and
for our purposes the option in the Meaning menu (first row) can be left as Clone. Since the sample number is followed by an
underscore, choose _ (underscore) in the Delimiter menu next to Clone (if it is not already selected).
j) For the second row choose Direction in the Meaning menu. We can ignore the Delimiter menu for the Direction part because
there is nothing following the direction in our file names.
k) Delete all additional name parts that may appear in the window (if any), and next click Preview... to check how CodonCode
Aligner is interpreting the sample names.
l) Click Close to exit the preview. Click OK to return to the Assemble in Groups window.
m) We first want to assemble our samples according to direction. Choose Direction in the Name part: dropdown menu. Then click
Assemble. You should now have two folders, one called F with the forward sequences and one called R with the reverse
sequences. [Note: if you only sent your PCR products for sequencing in one direction, i.e., with one primer, then you will only
have one folder.].
n) We will deal with the reverse sequences first. The first step is to reverse complement the sequences. Highlight the R folder,
select Edit4Reverse complement.
o) Next we need to cut the primer from the sequence. Double click the R folder to open it. For the reverse sequences, you need to
find the forward primer motif and delete it from the beginning of the consensus sequence at the bottom of the window. You
will find the primer around 30 nucleotides from the end of the raw sequence. For example, you would need to delete the
section of the sequence marked below in bold and everything to the left of it. Highlight it on the consensus sequence at the
bottom of the window and press the Backspace key on the keyboard. ←AAAGATATTGGAACATTATATTTTATTTTT...
p) Next go to the opposite end of the consensus, the far right. Delete the consensus sequence from the point where the sequence gets
messy. This will be apparent due to lots of green highlight. For example [it will not look exactly like this], delete the section marked
in bold and everything to the right of it. Highlight it on the consensus sequence at the bottom of the window and press the Delete
key on the keyboard. Close the window. ... TCTTTTTTTGACCCTGCTGGTGGAGGGTTTGGTAGGAGGATG-
q) Double click the F folder to open it. Go to the far right of the consensus sequence and find the reverse complement reverse
primer motif at the very end. This should be around 650 bp on the raw sequence. For example, you would delete the section
marked in bold and everything to the right of it. Highlight it on the consensus sequence at the bottom of the window and
press the Delete key on the keyboard. ... CAACATTTATTTTGATTTTTTGG-
r) Next go to the opposite end of the consensus, the far left, and delete the consensus sequence from the point where
the sequences get messy. This will be apparent due to lots of green highlight. For example [it will not look exactly
like this], you would delete the sequence in bold and everything to the left of it. Highlight the region on the consensus
sequence at the bottom of the window and press the Backspace key on the keyboard. Close the window.
←ATGCTTTTTTTTTKGGTGTTTAATCAGGACTAATTGGAACTTC
s) Dissolve both the F and R folders by highlighting them and clicking the button marked with a red X.
t) Now we are going to combine the forward and reverse sequence from each specimen into a contig. Highlight the Unassembled
Samples folder and open the Contig menu. Move the cursor over Advanced Assembly. From the options that appear select
Assemble in Groups. This time choose Clone in the Name part: menu, then click Assemble. [Note: if you only sent your PCR
products for sequencing in one direction (with one primer) then you will need to check each sequence individually rather
than checking a consensus (contig).][Note: specimens which only sequenced successfully in one direction will have files
which remain in the Unassembled Samples folder.]
u) The contigs are likely to be in reverse complement orientation. Highlight every folder (contig), select Edit4Reverse complement.
v) Open each folder (contig) in turn by double-clicking. Correct ambiguous positions (shown in red, in green highlight, and/or
as N) and gaps (“—“) in the consensus sequence by checking the original traces. This is done by double-clicking on the
consensus sequence at the bottom of the window. Always check both trace files (forward and reverse) and compare them.
[Note: the corrected consensus sequence should have NO gaps.].
w) Generally if traces conflict (i.e., different colored peaks appear in the same location on the forward and reverse chromato-
grams) you can decide which is more reliable based on sequence quality (e.g., less background noise, taller peaks).
x) Check the contigs first, then check the individual single sequences in the Unassembled Samples folder, if any.
y) To export the consensus sequences, highlight all the folders, go File4Export4Consensus Sequences..., choose Current selection.
Open the Options and select Include gaps in FASTA but deselect all other options. Press Export. Save the file to the desktop as
sequences.fasta.
z) If necessary, to export single direction sequences, go File4Export4Samples..., choose Current selection. Press Export. Save the file
to the desktop as sequences_single.fasta.

Processing High-Throughput Sequencing Reads


The files output by an Illumina MiSeq high-throughput sequencer are in FASTQ format. FASTQ is similar to FASTA but contains
additional data. Note that high-throughput sequencing reads are generally demultiplexed and adapter and primer-trimmed
4 DNA Barcoding: Bioinformatics Workflows for Beginners

onboard the sequencer (e.g., onboard the MiSeq using the MiSeq Reporter software). Because the FASTQ outputs are large files, the
sequencing company probably will not send the files by email but will email you a link to a website from where you can download
the files, usually (like Sanger sequences), packed into a zip file. Two FASTQ files (Paired-end files) are output from each sequencing
run, which you can think of as the Forward and Reverse sequences.
The following workflow describe steps taken for processing high-throughput reads for bulk arthropod samples whose DNA was
amplified using the primers mlCOIintF and HCO2198 (see Brandon-Mong et al., 2015). A zip file containing some example FASTQ
Paired-end files are available for download as Supplemental File 2. For practice you can go through these steps using these example
files. For convenience, save the files in a folder (called Reads) on your Desktop.
It is important to note that the steps provided below are crude methods for processing a very small number of high-throughput
sequencing reads. The field of DNA metabarcoding is a relatively new field and much work is being undertaken to develop
methods to reduce the number of “spurious” reads generated and retained by bioinformatics pipelines for high-throughput
sequencing reads (Brandon-Mong et al., 2015) (see Future Directions below). For FASTQ files which are larger than the example
provided as Supplemental File 2, CODONCODE ALIGNER is probably not a suitable program, and for beginners, it may be better
to register and use applications provided on the GALAXY webserver. Considering that DNA metabarcoding is the focus of another
article in this book, we do not provide additional details here.
a) Open the PRINSEQ webserver (Schmieder and Edwards, 2011), click Use PRINSEQ and click on Upload Data.
b) Your files are FASTQ Paired-end so choose that option.
c) Select the two FASTQ files you have saved on your Desktop (in the Reads folder).
d) Under Please select the statistics you want to generate. Choose None for all options then click Continue.
e) Wait while PRINSEQ processes your data.
f) Once it is finished click Process Input Data.
g) Choose the options from Table 1.
h) Choose to Output the data as FASTA, Data passing all the filters (good).

Table 1 Parameters used for processing high-throughput sequencer reads

Program Parameter Option

PRINSEQ Trim #nucleotides from 50 -end 25


Trim #nucleotides from 30 -end 25
Trim ends by quality scores While Mean of scores is o (less then) 10
(50 -end) and 10 (30 -end_ using window size
of 5 with step size 5
Trim poly-A/T tails from 50 -end That are equal to or longer than 5
Trim poly-A/T tails from 30 -end That are equal to or longer than 5
Minimum sequence length in bp 200
Maximum sequence length in bp 300
Minimum mean quality score 32
Minimum GC content in % 25
Maximum GC content in % 35
Maximum allowed rate of Ns in % 0
Maximum number of allowed Ns 0
Remove sequences with characters other than Yes
A, C, G, T or N
Low-complexity threshold 80 (using Entropy)
Dereplicate data Remove exact sequence duplicates, Remove 50
sequence duplicates, Remove 30 sequence
duplicates, Remove reverse complement exact
sequence duplicates, Remove reverse complement
50 /30 sequence duplicates
CODONCODE ALIGNER Algorithm Local alignments
Min. percent identity 97.5
Min overlap length 20
Min score 20
Max. unaligned end overlap 100.00
Bandwidth (max. gap size) 30
Word length 10
Max successive failures 50
Match score 1
Mismatch penalty 2
Gap penalty 2
Additional first gap penalty 3
DNA Barcoding: Bioinformatics Workflows for Beginners 5

i) Click Generate Files.


j) Use right click on the file you want (all of the FASTA files) to download and select “Save Link As” to save the file into a folder
on your Desktop (which you could name Filtered Reads).
k) Next we will be using CODONCODE ALIGNER (see above). Go to File4Import4Add Folder... Import the sequences in the
FASTA files in the Filtered Reads folder. Click Rename Duplicates.
l) Select all the sequences and click Assemble. Make sure the settings for Assembly found under Edit4Preferences are set to match
those in Table 1.
m) Select all the contigs, including the Unassembled Samples folder and again, Assemble. Repeat until the number of contigs cannot
be reduced any further.
n) Delete (Move to Trash) the sequences still remaining in the Unassembled Samples folder.
o) Export all the remaining contigs as consensus sequences (FASTA file), highlight all the folders, go File4Export4Consensus
Sequences..., choose Current selection. Make sure none of the Options are selected.

Sequence Alignment
The alignment of DNA barcode sequences is a necessary step before two or more DNA barcode sequences can be compared with one
another. Sequence alignment is the process of lining up nucleotides which are assumed to have the same common ancestor
(i.e., thought to be homologous). BIOEDIT is the most commonly used program for small-scale sequence alignment, and is free for use
by any and all interested parties. BIOEDIT can be downloaded from the program website but is no longer being regularly maintained.

a) Open your FASTA file (e.g., sequences.fasta) in BIOEDIT.


b) If necessary, you can then use File4Import4Sequence alignment file to add additional sequences (e.g., sequences_single.fasta) to
the alignment.
c) Make sure Mode: is set to Edit using the dropdown menu.
d) Another dropdown menu will become visible to the right of the Edit dropdown. Make sure this is set to Insert.
e) Sequences that have ended up in the FASTA file in the wrong orientation (with standard primers, full length arthropod DNA
barcodes should typically start AAC or TAC) may be corrected by highlighting the sequence name by clicking the cursor on it,
clicking the Sequences menu at the top of the screen. Moving the cursor down the dropdown to Nucleic Acid and clicking Reverse
complement.
f) Sequences all need to be the same length (typically 658 bp for full length DNA barcodes; 300 bp for metabarcodes) and
aligned to each other (before proceeding for further analysis or before uploading to BOLD and GenBank). This can be done by
typing additional N(s) at the beginning and end of the sequences in the Edit mode. Be sure to check across the whole of the
alignment of the sequences that you have added the correct number of N(s).
g) In Fig. 2 featuring a 50-bp barcode for simplification, JKN001-17 is of full length. JKN002-17 needs 6 Ns adding to the left side
of the sequence to become aligned, while JKN003-17 needs 4 Ns adding to the right side of the sequence to be 50 bps long.
JKN005-17_F needs to be reverse complemented to be in the same orientation as the other sequences.
h) Sanger sequences which were not part of a consensus (i.e., when one direction failed but the single sequence is of sufficient
length and quality for submission to BOLD) may appear in the FASTA still tagged with the direction. This needs to be deleted,
e.g., the sequence named JKN005-17_F should be renamed as JKN005-17.
i) If you are having trouble with the alignment, a good quality (i.e., 658[0n]) sequence can be downloaded from BOLD and
imported into your alignment file as a guide, e.g., MHAHC824-05 (Fig. 2). Be sure to delete this sequence before saving the file.
j) Save the file (File4Save) (e.g., save as sequences_aligned.fasta).
Following these steps you can upload your FASTA file (i.e., sequences_aligned.fasta) (and Sanger sequence trace files) to your
project on BOLD. Step-by-step instruction can be found on the DBW website. Once sequences are registered in BOLD, they can
also be submitted to GenBank using the Submit to GenBank option under the Publication tab on the BOLD project console.

Assigning Taxonomic Names to DNA Barcodes


For the assignment of taxonomic names to DNA barcodes (and by extension, checking sequences for contamination) we find it is
best to use online tools to ensure you are searching the most recent DNA barcode library available (the DNA barcode library is
constantly growing; Wilson et al., 2017). For checking DNA barcodes for contamination and/or establishing a taxonomic identity,
we would commonly “blast” our DNA barcodes against the BOLD and GenBank libraries. By statistically assessing how well library
and query (unknown) DNA barcodes match, we can infer homology and transfer information (such as putative species mem-
bership) to the unknown specimen.
When there are no species-level matches in the libraries, some Linnaean taxonomic information may still be retrievable. Wilson
et al. (2011) performed an investigation of the possibilities of assigning DNA barcodes to higher taxonomic groups (genus, family)
when no species matches (i.e., with 498% similarity) are available in the library. Based on those results we suggest using a strict
tree-based approach. BOLD can provide a Neighbor-Joining tree containing the query DNA barcode and the top matches by
clicking Tree Based Identification on the Specimen Identification Request page (see below).
6 DNA Barcoding: Bioinformatics Workflows for Beginners

Fig. 2 Examples of DNA sequences viewed in BIOEDIT and MS WORD.

BOLD identification engine


a) Open your aligned FASTA file (i.e., sequences_aligned.fasta) in MS WORD using right click Open with. Select All of the text and Copy.
b) Click IDENTIFICATION on the BOLD homepage, select All Barcode Records On BOLD, paste the text from your FASTA file (i.e.,
sequences_aligned.fasta) into the box Enter sequences in fasta format:.
c) BOLD was designed specifically for DNA barcoding, so the Specimen Identification Request page displays a list of library records
and their similarity to the query (i.e., the DNA barcode of the unknown specimen). A self-explanatory Identification Summary is
also provided. An example of a BOLD search result is shown in Fig. 3(a) where the sequence can be conclusively identified as
Amauthuxidia amythaon.

GenBank BLAST
a) Click Nucleotide BLAST on the BLAST homepage, paste the text from your FASTA file into the box Enter accession number(s), gi
(s), or FASTA sequence(s), and make sure the Database selection is Others.
b) BLAST pre-dates DNA barcoding, and is used for a variety of purposes, so the output is a little more difficult to interpret. Like
BOLD, a list of library records is displayed, generally with the closest matching library sequence (i.e., the highest % Identity) at the
top. Four other statistics are supplied: Max score indicates the highest alignment score (bit-score) between the query DNA barcode
and the library sequence segment (the higher the better, 1000 is very good); Total score and Query coverage are generally not
applicable for protein-coding genes such as the animal DNA barcode; E-value is the most important statistic for DNA barcoding
and indicates number of alignments expected by chance with a particular score or higher (the closer to 0 the better). An example
of a BLAST search result is shown in Fig. 3(b) where the sequence can be conclusively identified as Amauthuxidia amythaon.
DNA Barcoding: Bioinformatics Workflows for Beginners 7

Fig. 3 Examples of the results of sequence identification requests using (a) BOLD identification engine and (b) GenBank BLAST.

Clustering DNA Barcodes Into Operational Taxonomic Units


Due to the incompleteness of DNA barcode reference libraries, and inconsistencies in the use of Linnaean names, besides assigning
Linnaean taxonomy to DNA barcodes, it is also valuable and common practice to cluster DNA barcodes into Operational Taxo-
nomic Units (OTU) (see Sukantamala et al., 2017; Casas et al., 2017). OTU are groupings of similar sequences which are objective,
operational, “species”-level units (Ratnasingham and Hebert, 2013). Several methods are available for clustering DNA barcodes
into OTU and we provide step-by-step instructions for the most commonly used and easily accessible programs in this article.

BOLD Barcode Index Number system


DNA barcodes of a minimal length (200 bp) and quality (maximum proportion of Ns) requirement are automatically clustered
into OTU known as Barcode Index Numbers (BINs) upon upload to BOLD. Such clusters, produced by refine single linkage
analysis across the entire BOLD library have been shown to closely correspond to species recognized through traditional taxo-
nomic approaches (i.e., morphology). Consequently, the alphanumeric identifiers can act as surrogate taxonomic names in the
absence of full Linnaean assignments (Wilson et al., 2016).
8 DNA Barcoding: Bioinformatics Workflows for Beginners

Automatic Barcode Gap Discovery


Another popular tool to cluster DNA barcodes into OTU is Automatic Barcode Gap Discovery (ABGD) (Puillandre et al., 2012).
ABGD uses an automatic recursive procedure to converge on the best patterns for the dataset and clusters DNA barcodes into
groups accordingly. The median number of ABDG groups can be used as the basis for OTU as this has produced good corre-
spondence with traditional species in empirical studies. ABGD is available as a web interface.
a) From the ABGD homepage click Take me to ABGD Web site.
b) Open your aligned FASTA file (i.e., sequences_aligned.fasta) in MS WORD (using right click Open with), then copy and paste the
DNA barcodes into the Or paste your data (FASTA alignement) here box.
c) Change the X (relative gap width): to 1, but keep all other settings as default.
d) Click Go and then click to see the results.
e) On the graph click the datapoint representing the median number of groups. If available choose the Recursive partition, rather
than the Initial partition.
f) The page that opens shows a list of groups and the DNA barcodes contained within them.

Neighbor-Joining in MEGA
Molecular Evolutionary Genetics Analysis (MEGA) program is an extremely popular program for tree-building and is free to
download from the program website (Tamura et al., 2013). Once a Neighbor-Joining (NJ) tree has been built, DNA barcodes
can be sorted to OTU ad hoc based on the tree branching pattern (topology) and branch lengths (see the NJ tree and discussion in
Sukantamala et al., 2017). Note that NJ trees are not technically “phylogenetic” trees. Phylogenetic trees are constructed on the
basis of synapomorphies, whereas NJ trees are phenetic trees, constructed on the basis of sequence similarity. DNA barcoding is
concerned with relationships amongst sequences at the “species” boundary and not the reconstruction of phylogeny, so generally
NJ trees are sufficient for the analysis of DNA barcodes.

a) Start MEGA by double-clicking on the MEGA desktop icon.


b) You can then choose File4Open A File/Session…4Select your FASTA file (i.e., sequences_aligned.fasta) from your Desktop
4Open4Analyze4Nucleotide Sequences4OK 4Yes4Invertebrate Mitochondrial4OK.
c) You can then choose Phylogeny4Construct/Test Neighbor-Joining Tree…4Yes.
d) In the Analysis Preferences window Options Summary tab choose the options: Substitutions Type: Nucleotide; Model/Method:
p-distance; Substitutions to Include: d. Transitions þ Transversions; Gaps/Missing Data Treatment: Pairwise deletion.
e) Click Compute to accept the defaults for the rest of the options and begin the computations. A progress indicator will appear
briefly before the tree displays in the Tree Explorer.
f) The tree can then be exported as a Newick (text) file (e.g., tree.nwk) File4Export Current Tree (Newick).
g) To select a branch, click on it with the left mouse button. If you click on a branch with the right mouse button, you will get a
small options menu that will let you flip the branch and perform various other operations on it. To edit the taxon labels,
double click on them.
h) Change the branch style by using View4Tree/Branch Style.
i) Select View4Topology Only to display the branching pattern (without actual branch lengths on the screen).
j) Edit your tree to create the image that you want to use in your publication or report. The tree can then be exported as an image
Image4Save as PDF file or Image4Save as PNG file.

Bayesian Poisson Tree Processes


A Bayesian implementation of the Poisson Tree Processes (bPTP) (Zhang et al., 2013) program for generating OTU is available
through a web interface. bPTP can be used to delimit phylogenetic species in a similar way to the popular and widely used General
Mixed Yule Coalescent (GMYC) approach (Pons et al., 2006), but without the requirement for an ultrametric tree.

a) From the bPTP homepage, click the Browse… button to locate and upload your NJ tree (select the Newick file, e.g., tree.nwk, not
the image file).
b) Leave all the settings as the defaults and enter your email address.
c) Click to refresh until the results appear. Two trees are displayed (a maximum likelihood solution and a Bayesian solution) but
they are likely to be the same topology.
d) Click Download delimitation results. The page that opens shows a list of groups (species) and the DNA barcodes contained
within them.

Illustrative Example(s) or Case Studies

We have used DNA barcoding in several biodiversity studies in Southeast Asia (Wilson et al., 2016) covering a wide range
of animal groups including butterflies (Jisming-See et al., 2016), bats (Syaripuddin et al., 2014), sandflies (Polseela et al., 2016;
Sukantamala et al., 2017) and dragonflies (Casas et al., 2017). Two illustrative examples are provided below.
DNA Barcoding: Bioinformatics Workflows for Beginners 9

Butterflies of Setiu Wetlands, Malaysia


As part of the Setiu Wetlands Scientific Expedition organized by WWF-Malaysia in 2016, we conducted a survey of butterflies at
Lata Changkah, Terengganu, Malaysia. All sampled butterflies were brought back to the laboratory and subjected to standard
molecular methods for DNA barcoding (Wilson, 2012; Jisming-See et al., 2016) and bioinformatics methods described above (see
Sections Editing Sanger Sequences and Sequence Alignment). The DNA barcodes were submitted to BOLD where they are publicly
available under the project code: SETIU. Species assignments were obtained on basis of 498% sequence similarity with records in
BOLD (see Section BOLD identification engine), which was possible due to the existing DNA barcode reference library for the
butterflies of Peninsular Malaysia (Wilson et al., 2013). However, a strict tree-based criterion (Wilson et al., 2011) was used to
assign three DNA barcodes that did not share 498% similarity with any BOLD records to genera. Forty-nine species were recorded
suggesting Lata Changkah can currently support rare butterflies (Sing and Wilson, 2017).

Sandflies of Northern Thailand


We collected sandflies using CDC light traps from five tourist caves in Northern Thailand. Specimens were brought back to the
laboratory and subjected to standard molecular methods for DNA barcoding (Wilson, 2012; Jisming-See et al., 2016) and
bioinformatics methods described above (see Sections Editing Sanger Sequences and Sequence Alignment). We combined our new
DNA barcodes with selected publicly available DNA barcodes in BOLD, which included species reported from Thailand (includes
records from Thailand, India, and Sri Lanka). The combined dataset (394 DNA barcodes) is publicly available in BOLD under the
code: DS-SFNTH. Using a combination of methods (see Sections BOLD Barcode Index Number (BIN) system and Neighbor-
Joining (NJ) in MEGA), the specimens were clustered into 34 OTU. Several of the taxa thought to be present in multiple caves,
based on morphospecies sorting, split into cave-specific OTU which likely represented cryptic species. The resulting species
checklist and DNA barcode library contributed to a growing set of records for sandflies which is useful for monitoring and vector
control (Polseela et al., 2016; Sukantamala et al., 2017).

Discussion

DNA barcoding is being used by researchers across an increasing number of biological fields reflecting the fact that DNA sequence
information can be cheap and easy to obtain and can enable assignment of taxonomic names to organisms without requiring
researchers to be familiar with intricate morphological features. Likewise, molecular OTU can be suitable surrogates for parti-
tioning diversity into interoperable units for biodiversity studies enabling researchers to obtain taxonomic data much faster than
possible with traditional morphological approaches, making studies scalable across much larger taxonomic groups and wider
geographical regions (Wilson et al., 2017). Yet, the prospect of DNA barcoding can be daunting for beginners (Wilson et al., 2016).
Mastering a basic bioinformatics workflow is essential to ensure the quality and reliability of data and to generate meaningful
results (Wilson and Sing, 2013).

Future Directions

High-throughput sequencers are replacing Sanger sequencers in most molecular applications. The development of the sub-
discipline of DNA metabarcoding grew directly from the major advantages offered by high-throughput sequencing through
circumventing the sorting and isolation of the thousands of individuals in bulk mixed samples of organisms (Brandon-Mong et al.,
2015). However, the short read lengths and high error rates limited the use of high-throughput sequencers for conventional,
individual specimen, DNA barcoding (Hebert et al., 2017), and in particular, for DNA barcode reference library construction (Liu
et al., 2017). The continued reliance on Sanger sequencing has constrained reductions in the cost of DNA barcoding and led to
uneven DNA barcoding efforts around the world (Liu et al., 2017). Recently developed approaches using the latest generation of
high-throughput sequencing platforms (Pacific Biosciences SEQUEL and Illumina HiSeq 4000) have produced full length DNA
barcodes of equivalent length and quality to those generated by Sanger sequencing, but with substantially reduced costs of 10-fold
(Liu et al., 2017) to 40-fold (Hebert et al., 2017). Bioinformatics pipelines to complement these new approaches have developed
concurrently. The mBRAVE webserver developed by the team behind BOLD, and with direct links to the BOLD reference libraries
(Hebert et al., 2017) is an important development and likely represents a landmark shift in standard DNA barcoding protocols.

Closing Remarks

In this article, we provide beginners with step-by-step instructions for converting raw DNA sequences into clean DNA barcodes
(sequence editing, sequence alignment), to commonly used tools for assigning taxonomic names to DNA barcodes, and to cluster
DNA barcodes into OTU. As more researchers become comfortable with such bioinformatics workflows, and the DNA barcoding
community continues to grow, essential questions for society: “What is this specimen on an agricultural shipment?”, “Who eats
whom in this whole food web?”, and even “How many species are there?” become answerable (Adamowicz, 2015). It is promising
10 DNA Barcoding: Bioinformatics Workflows for Beginners

that capacity for DNA barcoding is growing in the parts of the world where it is needed most, particularly among the younger
generation of researchers who can easily connect with the barcoding analogy (Adamowicz, 2015; Wilson et al., 2016).

Acknowledgements

Kong-Wah Sing is supported by the Chinese Academy of Sciences President's International Fellowship Initiative. We thank the
BOLD team, especially Megan Milton, for their continuous support of our DNA barcoding workshops in Southeast Asia. We are
grateful to CodonCode Corporation for supplying teaching licenses during our workshops. We thank previous sponsors and hosts
of our DNA barcoding workshops: the Centre of Excellence in Fungal Research, the Department of Microbiology and Parasitology,
and the Faculty of Medical Science at Naresuan University, Phitsanulok, Thailand; the Zoological and Ecological Research Net-
work, and Museum of Zoology at the University of Malaya, Kuala Lumpur, Malaysia; the University of Nottingham Malaysia
Campus, Selangor, Malaysia; Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia; and the Asia-Pacific Network for
Global Change Research. We also thank the scientists who have helped facilitate our workshops: Paul Hebert, Brandon Mong Guo
Jie, Lee Ping Shin, Evan Chin, Kharunnisa Syaripuddin, Jedsada Sukantamala, Cheah Men How, Siti Azizah M Nor, Noor Adelyna
M Akib, Mr Foo, Mr Chin, Elizabeth Clare.

Appendix A Supplementary Information

Supplementary data associated with this article can be found in the online version at 10.1016/B978-0-12-809633-8.20468-8.

References

Adamowicz, S.J., 2015. International barcode of life: Evolution of a global research community. Genome 58, 151–162.
Brandon-Mong, G.J., Gan, H.M., Sing, K.W., et al., 2015. DNA metabarcoding of insects and allies: An evaluation of primers and pipelines. Bulletin of Entomological Research
105, 717–727.
Casas, P.A.S., Sing, K.W., Lee, P.S., et al., 2017. DNA barcodes for dragonflies and damselflies (Odonata) of Mindanao, Philippines. Mitochondrial DNA Part A, Online Early.
doi:10.1080/24701394.2016.1267157.
Hebert, P.D.N., Braukmann, T.W.A., Prosser, S.W.J., et al., 2017. A sequel to sanger: Amplicon sequencing that scales. bioRxiv. 191619. Available at: https://doi.org/10.1101/
191619.
Jisming- See, S.W., Sing, K.W., Wilson, J.J., 2016. DNA barcodes and citizen science provoke a diversity reappraisal for the “ring” butterflies of Peninsular Malaysia (Ypthima:
satyrinae: Nymphalidae: lepidoptera). Genome 59, 879–888.
Liu, S., Yang, C., Zhou, C., Zhou, X., 2017. Filling reference gaps via assembling DNA barcodes using high-throughput sequencing-moving toward barcoding the world.
Gigascience 6, 1–8.
Polseela, R., Jaturas, N., Thanwisai, A., Sing, K.W., Wilson, J.J., 2016. Towards monitoring the sandflies (Diptera: psychodidae) of Thailand: DNA barcoding the sandflies of
Wihan Cave, Uttaradit. Mitochondrial DNA Part A 27, 3795–3801.
Pons, J., Barraclough, T.G., Gomez-Zurita, J., et al., 2006. Sequence-based species delimitation for the DNA taxonomy of undescribed insects. Systematic Biology 55, 595–609.
Puillandre, N., Lambert, A., Brouillet, S., Achaz, G., 2012. ABGD, Automatic barcode gap discovery for primary species delimitation. Molecular Ecology 21, 1864–1877.
Ratnasingham, S., Hebert, P.D.N., 2013. A DNA-based registry for all animal species: The Barcode Index Number (BIN) system. PLOS ONE 8, e66213.
Schmieder, R., Edwards, R., 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864.
Sing, K.W., Syaripuddin, K., Wilson, J.J., 2016. How to rapidly accelerate biodiversity inventory in places where most of the species are unknown? Malayan Nature Journal 68,
131–134.
Sing, K.W., Wilson, J.J., 2017. Butterfly diversity at a recreation hotspot in Setiu Wetlands, Terengganu, Malaysia. Prosiding Seminar Ekspedisi Saintifik Tanah Bencah Setiu
2016. Selangor. WWF-Malaysia. pp. 86–96.
Sukantamala, J., Sing, K.W., Jaturas, N., Polseela, R., Wilson, J.J., 2017. Unexpected diversity of sandflies (Diptera: psychodidae) in tourist caves in Northern Thailand.
Mitochondrial DNA Part A 28, 949–955.
Syaripuddin, K., Kumar, A., Sing, K.W., et al., 2014. Mercury accumulation in bats near hydroelectric reservoirs in Peninsular Malaysia. Ecotoxicology 23, 1164–1171.
Tamura, K., Stecher, G., Peterson, D., Filipski, A., Kumar, S., 2013. MEGA6: Molecular evolutionary genetics analysis Version 6.0. Molecular Biology and Evolution 30, 2725–2729.
Wilson, J.J., 2012. DNA barcodes for insects. In: Kress, W.J., Erikson, D.L. (Eds.), DNA Barcodes: Methods and Protocols. New York: Humana Press.
Wilson, J.J., Rougerie, R., Shonfeld, J., et al., 2011. When species matches are unavailable are DNA barcodes correctly assigned to higher taxa? An assessment using sphingid
moths. BMC Ecology 11, 18.
Wilson, J.J., Sing, K.W., 2013. DNA barcoding can successfully identify Penaeus monodon, associate life cycle stages, and generate hypotheses of unrecognised diversity.
Sains Malaysiana 42, 1827–1829.
Wilson, J.J., Sing, K.W., Floyd, R.M., Hebert, P.D.N., 2017. DNA barcodes and insect biodiversity. In: Foottit, R.G., Adler, P.H. (Eds.), Insect Biodiversity: Science and Society,
second ed. Oxford: Blackwell Publishing Ltd, pp. 575–592.
Wilson, J.J., Sing, K.W., Lee, P.S., Wee, A.K.S., 2016. Application of DNA barcodes in wildlife conservation in Tropical East Asia. Conservation Biology 30, 982–989.
Wilson, J.J., Sing, K.W., Sofian-Azirun, M., 2013. Building a DNA barcode reference library for the true butterflies (Lepidoptera) of Peninsula Malaysia: What about the
subspecies? PLOS ONE 8, e79969.
Zhang, J., Kapli, P., Pavlidis, P., Stamatakis, A., 2013. A general species delimitation method with applications to phylogenetic placements. Bioinformatics 29, 2869–2876.

Further Reading
Adamowicz, S.J., Chain, F.J.J., Clare, E.L., et al., 2016. From barcodes to biomes: Special issues from the 6th international barcode of life conference. Genome 59, v–ix.
Kress, W.J., Erikson, D.L. (Eds.), 2012. DNA Barcodes: Methods and Protocols. New York: Humana Press.
DNA Barcoding: Bioinformatics Workflows for Beginners 11

Relevant Websites

http://wwwabi.snv.jussieu.fr/public/abgd/
ABGD.
www.boldystems.org
Barcode of Life Datasystems (BOLD).
www.mbio.ncsu.edu/bioedit/bioedit.html
BIOEDIT.
http://species.h-its.org/ptp/
bPTP.
www.codoncode.com
CODONCODE ALIGNER.
www.barcodingasia.weebly.com
DNA Barcoding Workshops (DBW).
www.usegalaxy.org
GALAXY.
https://blast.ncbi.nlm.nih.gov/Blast.cgi
GenBank BLAST.
http://www.megasoftware.net/
Molecular Evolutionary Genetics Analysis (MEGA).
www.prinseq.sourceforge.net
PRINSEQ.

You might also like