Professional Documents
Culture Documents
Chapter Y
Bioinformatics of Phylogeny
The focus of this chapter is to address the bioinformatics of
phylogeny. Relying on the classical and modern views on evolution
pertinent to origin and diversity of species, details on phylogenetic
concepts and phylogenetic analysis are presented. Relevant software
classifications and available phylogenetic programs are discussed.
the life sciences what the long sought holy grail of the unified field
theory is to astrophysics”.
Any form of life is a descendent from a common ancestral
origin. Evolution of species is not a static process but it dynamically
changes over time. Darwin’s description of this process conforms to
being a variation sorted out through drift and selection of lineages
that diverge. In short, evolution implies a “descent with
modification.” making the organisms to bear a history; and, the
modifications observed are stochastical epochs (of that history)
stemming from the statistical nature of mutational changes causing
damage or otherwise.
y.2 PHYLOGENY
Considering the diversity of life with the plethora of estimated 5 to
100 million species of organisms living on Earth, an evidential
implication of details gathered (from morphological, biochemical,
and gene sequence data) “suggests that all organisms on Earth are
genetically related, and the genealogical relationships of living things
can be represented by a vast evolutionary tree”…; and, this tree of
life depicts the phylogeny of organisms. (The term “phylon’ in Greek
means a combining form of a race or a tribe; and as such, phylogeny
or phylogenesis implies a race history of an animal or a plant type).
In essence, phylogeny is a collection of information about the origin
and diversity of species. That is, considering the history of lineages
associated with organisms dynamically changing through time, it
implies that “different species arise from previous forms via descent,
and that all organisms, from the smallest microbe to the largest plants
and vertebrates, are connected by the passage of genes along the
branches of the phylogenetic tree that links all of life”. Thus the
evolutionary history depicts the history of development of biological
organisms, functions, molecules etc. through random mutations
under selective (invariably nonrandom) pressure. As such, the
evolutionary history presumes that an existing organism has
descended from ancestral organisms; and, inevitable mutations take
place at nucleotide level leading to the observed biological diversity.
[Hennig, W. 1965. Phylogenetic Systematics. Ann. Rev. Entomol.
10,97-116, Phylogenetic Systematics. (tr. D. Davis and R. Zangerl),
Univ. of Illinois Press, Urbana 1966, reprinted 1979, Zuckerkandl
and Pauling(1965) “Molecules as Documents of Evolutionary
History.” Journal of Theoretical Biology. 8:357-366].
8
Branch OTU
OTU
Node
Branch
length
OTU
Outer
group
Ancestral
root
Root node
Internal Bifurcating
node node
Terminal
nodes
(a) (b)
Number of Nodes
branches
Interior M-2 M-1
Rooted Total 2M - 2 2M - 1
Interior M-3 M-2
Unrooted Total 2M - 3 2M - 2
Tree network: While trees signify only one path between any
pair of nodes, a tree network has more than one path between
any pair of nodes as shown in Figure y.r
17
(a) (b)
Figure y.r (a) A simple tree and (b) a tree network mesh
1 2
3 7
4 6
1
1
2 4
2
1 1
?
Furs,
Mammary
glands
Feather
s
Claws,
Nails
Lung
s
Jaw
s
Figure y.3: A morphological tree of phylogeny
α
Duplication
α β
Speciation
α1 β1 α2 β2
24
PERFORM TREE-BUILDING/RECONSTRUCTION
← This refers to making of the
required phylogenetic tree. This is
done with the set of multiple
sequences aligned and prepared
go to: SUBROUTINE III
next
END: ;
______________________________________________
nucleotides(such as
ribosomal RNA)
or else… Use the
chosen nucleotide
sequence
return Selected sequence
next
Step (i)-
Retrieve-
Given a sequence, its homolog can be retrieved
as follows:
Step (ii)-
Arrange/Prepare: This refers to placing the
chosen multiple sequences one below the other
forming a column of sequences as an alignment
procedure and preparing the multiple sequences
for tree construction. The preparation
involves “cleaning up” the chosen multiple
sequences for alignment by a set of procedures
AAGCA-AGGTAAATGCATGCATGGA- -AGTCCTGGAATGGTA
AGAT- - AGGTAAATGCAGCTAGCAT-AAGTCCTGGACCGGAT
GCAATTAGGTAAAACCAAGGTACCT- -AGTCCTGGAGAGATA
GTGATTAGGTAAAACCAACGCAACGCAGTCCTGGACGTAGG
End
_______________________________________________________
I VI
13 13
IV
13
14 18
15 12
10
V
II III
(a) (b)
Leaf
Root
Figure y.y: Given the number of nodes (ν), realization of εR and εUR:
(a) With ν = 2 in a rooted tree and (b) with ν = 3 in an unrooted (star
topology) tree.
Morphometric analysis
37
that may appear across multiple sub-trees [J. Felsenstein J., Inferring
Phylogenies, Sinauer Associates, Sunderland, MA: 2004].
The genetic distance concept is adopted as a data clustering
strategy in a method known as the neighbor-joining approach, which
enables reconstruction of unrooted trees. In its approach, neighbor-
joining does not assume a constant rate of evolution (that is., a
molecular clock) across lineages. In other words evolutionary
divergence time cannot be found from mutations and as said before,
mutation rates are not constant. (In contrast, in the UPGMA to be
described later rooted trees are reconstructed using a constant-rate
assumption of an ultrametric tree in which, as said earlier, the
distances from the root to every branch tip are equal).
Neighbor-joining is based on the minimum-evolution
criterion for phylogenetic trees, i.e. the topology that gives the least
total branch length is preferred at each step of the algorithm.
However, neighbor-joining may not find the true tree topology with
least total branch length because it is a greedy algorithm that
constructs the tree in a step-wise fashion. Even though it is sub-
optimal in this sense, it has been extensively tested and usually finds
a tree that is quite close to the optimal tree. Nevertheless, it has been
largely superseded in phylogenetics by methods that do not rely on
distance measures and offer superior accuracy under most conditions.
The main virtue of neighbor-joining relative to these other
methods is its computational efficiency. That is, neighbor-joining is a
polynomial-time algorithm. It can be used on very large data sets for
which other means of phylogenetic analysis (e.g. minimum
evolution, maximum parsimony and maximum likelihood) are
computationally prohibitive. Unlike the UPGMA algorithm for
phylogenetic tree reconstruction, neighbor-joining does not assume
that all lineages evolve at the same rate (molecular clock hypothesis)
and produces an unrooted tree. Rooted trees can be created by using
the outgroup and the root can then effectively be placed on the point
in the tree where the edge from the outgroup connects.
Furthermore, neighbor-joining is statistically consistent under
many models of evolution. Hence, given data of sufficient length,
neighbor-joining will reconstruct the true tree with high probability.
Atteson proved that if each entry in the distance matrix differs from
39
the true distance by less than half of the shortest branch length in the
tree, then neighbor joining will construct the correct tree.
on, until only two OTUs are left out. Relevant exercise is illustrated
in the following example.
Example x.1
Consider a set of six OTUs {α, β, χ, δ, ε, φ} whose sequences are
listed below:
α GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCTGG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
φ : T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G
Assuming that the above set of OTUs had the following common
ancestral root (R), corresponding evolutionary distance (ED) matrix
can be constructed as illustrated as illustrated in Figure x.
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
Evolutionary
distance (ED) ED matrix (EDM)
1 1 1 1 α β χ δ ε φ
Root α α 0
β β 2 0
χ χ 4 4 0
δ δ 6 6 6 0
ε ε 6 6 6 4 0
φ φ 8 8 8 8 8 0
Taxa/OTUs
41
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
α GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCTGG
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
φ TCAGGCCGCGTGGTGTTGTCGTCTGCGAGAATTGGCTCG
α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T CT
α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T CG
α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C TCG
α G A A C G C T G C G T G G T G T A G T C GT C T G C G A G A T A T G G C TGG
42
ε G A A G G T T G C G T G G T GT T G T C TG C T G C G A G A T A T G G C T C G
α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
Φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C TCG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
β GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCT CT
φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
δ CAATGCTCCGTGGTGTTGTCGTCTGCGATATATGGCTCG
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
43
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G
Step I: Given an EDM, the two OTUs that are most similar (bearing
closest ED) are identified from among all the OTUs and these two
OTUs are treated as a new single composite OTU. As shown in
Figure y.1, the OTU-pair α, and β is the chosen pair by virtue of their
smallest ED of 2 in the EDM; and, the composite OTU is {α, β}.
ED EDM
1 1 1 1
Root α α
β β 2
Taxa/OTUs
Fig. y.1: Selection of two OTUs α and β that are most similar in the
given EDM (bearing the smallest ED equal to 2)
Step II: With the composite OTU {α, β} introduced in the EDM,
the new matrix is constructed by choosing again the most similar pair
and clustering them together as illustrated in Figure y.2. Relevant set
of EDs is elucidated as follows:
44
ED EDM
αβ χ δ ε φ
1 1 1 1
Root α
αβ 0
β
χ χ 4 0
δ δ 6 6 0
ε ε 6 6 4 0
φ φ 8 8 8 8 0
Taxa/OTUs
Fig. y.2: Construction of the new EDM with the composite OTU, {α
β}
Step III: As in the previous steps, among the new group of OTUs,
the pair with the highest similarity is identified and the corresponding
composite OTU is specified. This procedure is repeated until we only
two OTUs are left out. Corresponding EDM constructions are shown
in Figures y.3-y.x.
45
ED EDM
αβ χ δε φ
1 1 1 1
Root α
αβ 0
β
χ χ 4 0
δ
δε 6 6 0
ε
φ φ 8 8 8 0
Taxa/OTUs
Fig. y.3: Construction of the new EDM with the composite OTUs,
{δε} and {δε}
ED EDM
αβχ δε φ
1 1 1 1
Root α
β αβχ 0
χ
δ
δε 6 0
ε
φ φ 8 8 0
Taxa/OTUs
Fig. y.4: Construction of the new EDM with the composite OTUs,
{αβχ} and {δε}
46
Evolutionary
distance (ED) ED matrix (EDM)
α βχδε φ
1 1 1 1
Root α
χ αβχδε 0
ε
φ
φ 8 0
Taxa/OTUs
Fig. y.4: Construction of the new EDM with the composite OTUs,
{αβχ δε}
The resulting ultrametric tree for the test EDM is illustrated in Figure
y.5.
Evolutionary
distance (ED)
1 1 1 1
α
Root ε
φ
Taxa/OTUs
Problem y.x
The EDM with metrics of distances depicting a hypothetical
ultrametric phylogenetic tree is shown below. Suppose the following
sequence depicts the common ancestral root (R) corresponding to the
given EDM, determine the sequences that can be specified for the
OTU-set {a, b, c, d, e, f, g} of the tree.
R: G A A T G TT G C G T G G T G T T G T G G T C T G C G A G A T A T A A C
T C G AATGCCT
Evolutionary
distance (ED) ED matrix (EDM)
a b c d e f g
1 1 1 1
Root a a 0
b b 2 0
c c 4 4 0
d d 6 6 6 0
e e 6 6 6 4 0
f f 6 6 6 4 2 0
g g 8 8 8 8 8 8 0
Taxa/OTUs
Problem y.x
Consider an EDM with metrics of distances as shown below. For the
ED values of the matrix indicated, relevant topology is as shown.
Evolutionary
distance (ED) ED matrix (EDM)
1 1 1 1 1 1 1 α β χ δ ε φ
α α 0
β β 5 0
χ χ 4 7 0
δ δ 7 10 7 0
ε ε 6 9 6 5 0
φ φ 8 11 8 9 8 0
Taxa/OTUs
2.0
α
1.0
2.0
0.5 χ
3.0
β
0.5 2.5
δ
1.5
2.5
Root ε
4.5
φ
Taxa/OTUs
Example y.y
An EDM indicated below corresponds to real data matrix for five
ribosomal RNA sequences. Each value denotes the estimated number
of nucleotide residue substitutions per position separating the
corresponding pair of the presently existing sequences.
Node 1 2 3 4 5
Taxa BSu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0
Solution:
Bsu
Bst
Mlu
Lvi
Amu
α β
X
la lb
ld lc
Y
δ χ
(a) (b)
1 8 7 8
7
1
6
2 X 6 X Y
x x 5
2
4
3 4 5 3
Let the distance between OTU-pair i and j be dij and lab be the
distance between nodes a and b. The sum of the branch lengths for
the nonhierarchical star-like tree centered at X in Figure y.t(a) is then
given by:
N 1
So = ∑ l iX = ∑ dij (y.c)
i = 1 N − 1
(i < j)
53
1
where ∑ (k = 3, N)
(d1k + d 2k ) term specifies the sum of
{2(N − 2)}
all distances that include lXY.
N
And, [ (N − 2)(l1X + l 2X ] and 2∑ l iY are terms that
i=3
depict irrelevant entities being excluded in computing lXY. Further, if
the interior branch (X to Y) in Figure y.t (b), is removed, there will
be two independent star-like trees, one for the OUT-pair {1 and 2}
and the other for the remaining set of (N – 2) OTUs. Corresponding
branch lengths, (l1X + l2X) and ∑ i = 3 l iY can be deduced by
N
applying equation (y.c). That is,
54
Adding these branch lengths, the sum of all branch lengths of the tree
configuration of Figure y.t(b) namely S12 is obtained as follows:
N
S12 = [ l XY + (l1X + l 2X ] + ∑ l iY
i=3
1
+
−
∑ (k = 3, N) (d1k + d 2k )
{2(N 2)}
1 1
= ∑
− 2)} (k = 3, N)
(d1k + d 2k ) + 2 d12
{2(N
1
+
−
∑ (3 ≤ i < j) dij
(N 2)
(y.zz)
where the three terms on the right-hand side denote explicitly, the
following:
1
{2(N − 2)} ∑ (k = 3, N) (d1k + d 2k ) = Mean distance from OTU1 and
OTU2 to the rest of the OTUs 3,
4,…, N.
1
2 d12 = Semi-pairwise distance from
OTU1 to OTU2
1
N − 2 ∑ (3 ≤ i < j) dij = Average of all pairwise distances
between the OTUs 3 to N.
55
8
7
1
6
x y
5
2
4
3
Further, d1Z and d2Z are distances between (1 and Z) and (2 and Z)
respectively. Explicitly, they are determined by [Nei 1987]:
1 N
d1Z = ∑ d1i
N − 2 i = 3
(y.uc)
1 N
d 2Z = ∑ d 2i (y.ud)
N − 2 i = 3
And, l1X and l2X denote least-squares estimates for the tree of the
illustration in Figure y.t(b).
The neighbor-joining method in essence, is an iterative
algorithm, which assumes an additive tree or summing procedure. Its
each iteration is indicated in the following pseudocode:
__________________________________________________
Phylogenetic tree construction: Pseudocode on
NJ method
Input distance matrix
← Given a set of multiple sequences data
← Call subroutine for sequence alignment:
Global multiple alignment and selecting
reliable data of aligned sequences
devoid of any gaps; that is, any gap
57
i
d11 d12 . . . d1N
d21 d2N
j . .
. dij .
. .
dN1 dN1 . . . dNN
58
Node 1 2 3 4 5
Taxa BSu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0
Initialize
Step 1
Start off with a star tree (Figure y.y1)
1 5
2 3
Figure y.y1: Star-like tree initialized
Step 2
Define/identify the neighbors
← two nodes with the lowest value in the
matrix are chosen and they are defined
59
Step 3
Compute the branch lengths for between this
composite node (1-2) and rest of the other
nodes
← This computation involve staking the
unweighted arithmetic mean of all the
pairwise distances of the new OTU(1-
2)with respect to all other OTUs
(namely, 3, 4, 5. Hence,
→ the distance between (1-2) and 3
is: d(1-2)3 = (d13 + d23)/2
→ the distance between (1-2) and
4 is: d(1-2)4 = (d14 + d24)/2
→ the distance between (1-2) and
5 is: d(1-2)5 = (d15 + d25)/2
Step 4
Construct the resulting table as shown below:
Node 1 2 3 4 5
Taxa Bsu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0
60
Node 1-2 3 4 5
Taxa Bsu- Lvi Amo Mlu
BSt
1-2 Bsu-Bst 0 0.257 0.325 0.220
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0
Step 5
Repeat the process from Step 2 again looking
for the 2 nearest nodes, and constructing the
resultant matrix
← Now OTU 5 (Mlu) and the composite OTU
(1-2(Bsu/Bst)constitute the nearest
neighbors
← Construct the new matrix with:
d(1-2)5 = (d15 + d25)/2 = (d1x + d2x)
d[(1-2)5]3 = (d1x + d13)/2 + (d2x + d23)/2
d[(1-2)5]4 = (d1x + d14)/2 + (d2x + d24)/2
Node (1-2)5 3 4
Taxa (Bsu/Bst)Mlu Lvi Amo
(1-2)5 (Bsu/Bst)Mlu 0 0.282 0.435
3 Lvi 0 0.280
4 Amo 0
Step 6
Repeat the process from Step 2 again looking
for the 2 nearest nodes, and constructing the
resultant matrix
← Now composite OTU(1-2)5(Bsu-Bst)Mlu and
the composite OTU(3-4)(Lvi-Amo) form
nearest neighbors
← Construct the new matrix with:
d(1-2)5 =(d1x + d2x)
61
______________________________________
_______________________________________________________
← Each I involves…
Calculate: Q-matrix based on the current
distance matrix.
← For a given distance d(i, j) between
taxa i and j and using the corresponding
distance matrix relating r taxa, Q-
matrix is specified as follows:
r r
← Qi, j = (r − 2)d(i, j) − ∑ d(i, k) − ∑ d(j, k) (y.1)
k=1 k=1
Determine: The pair of taxa with the
lowest value in Q
Create: A node on the tree that joins this
pair of taxa.
← the closest neighbors are joined
62
β γ
α 25 30
β 35
a b
α β γ
Solution:
From the distance matrix given:
α to β: a + b = 25
α to γ: a + c = 30
63
β to γ: b + c = 35
Example y.z
A hypothetical set of multiple sequences is given below:
Solution
The distances depict number of changes seen in the nucleotides (or
aminoacids) between any two sequences under comparison. Hence,
the required matrix is:
α β δ ε
α 3 7 8
β 6 7
δ 3
ε
α β
2 1
1 2
δ ε
α β δ ε
α
65
β 7
δ 11 6
ε 14 9 7
Solution:
Q- matrix:
α β δ ε
α
β − 40
δ − 34 − 34
ε − 34 − 34 − 40
In the example above, two pairs of taxa have the lowest value,
namely − 40. We can select either of them for the second step of the
algorithm. We follow the example assuming that we joined taxa α
and β together. Adding more branches is iterative. If more sequences
are included, corresponding branches will be introduced in the tree.
This method scales linearly with the number of sequences. This
involves adding a new node to the tree.
For each taxon not considered above, the distance to the new node is
obtained from:
66
1 1
d(ω, k) = [d(θ, k) − d(θ, u)] + [ d(φ, k) − d(φ, u)] (y.3)
2 2
where, as said before ω is the new node, and θ and φ are the
members of the pair just joined; and, k is the node for which it is
required to calculate the distance. and f and g are the members of the
pair just joined
Problem y.1:
(a) In the Example y.2 determine: (i) the distance between α and
the new node; and (ii) the distance between β and the new
node. Use equation (y.2). (Answers: (i) 6; (ii) 1)
(b) In the Example y.1 determine: (i) the distance between δ and
the new node; and (ii) the distance between ε and the new
node. Use equation (y.3). (Answers: (i) 5; (ii) 8)
αβ δ ε
αβ
δ 5
ε 8 7
but often it finds a tree close to optimal topology. Yet there are other
methods that do not rely on distance measures and give results with
better accuracy.
NJ method is computationally efficient and conforms to
polynomial-time algorithm. Very large data sets can be handled by
NJ approach. (Other phylogenetic analyses like minimum evolution,
maximum parsimony and maximum likelihood are computationally
intensive. Though NJ computation leads to an unrooted topology,
rooted trees can be obtained by using an outgroup; and, the root can
be appropriately inserted at a point in the tree where the edge from
the outgroup connects. That is, inasmuch as the phylogenetic
inference via NJ method yields unrooted tree, from this result itself it
is not possible to tell which of the OTUs had branched off before all
the others. So, in order to root a tree one should add an “outgroup” to
the data set. An outgroup implies an OTU for which external
information (for example, paleontological information) is available
that indicates that the outgroup branched off before all other taxa.
(See Figure y.1). In practice, distance-matrix methods suggest the
inclusion of at least one outgroup sequence, (which is known to be
only distantly related to the sequences of interest in the query set).
This procedure conforms to an experimental control (contrasting an
experimental group versus a control group). When the chosen
outgroup is appropriate, it will show much greater genetic distance
(and hence will possess longer branch) than any other sequence in
the set; as such, it will appear near the root of a rooted tree or
indicates the nearest node as the possible root-node in a unrooted
topology (Figure y.1).
Further, in choosing an appropriate outgroup requires the
selection of a sequence that is moderately related to the sequences of
interest. Suppose an outgroup implicates too close a relationship.
Then, this outgroup will lose its significance. Likewise, when an
outgroup depicts too far a relationship, it simply adds noise to the
analysis. In prescribing an outgroup, care must be exercises so as to
omit cases where the species from which the sequences were taken
are distantly related; but, the gene encoded by the sequences is
rather highly conserved across lineages. Another confounding aspect
of outgroup selection and usage is horizontal gene transfer (between,
otherwise divergent bacteria).
68
Fitch-Margoliash method
This method uses the genetic distance and follows a weighted least
squares method for clustering. That is, more weight in the tree
construction process is given to closely-related sequences. This
approach would reduce the inaccuracy associated in the distance
measurement on distantly-related sequences. In the relevant
computation, the distances used as input to the algorithm are first
normalized so as to prevent large deviational artifacts in (computing)
relationships between closely-related versus distantly-related groups.
Further, the distances adopted for this method must be linear. This
linearity criterion for distances requires that the expected values of
the branch lengths for two individual branches must equal the
expected value of the sum of the two branch distances. This property
applies to biological sequences only when they have been corrected
for the possibility of back mutations at individual sites. (This
correction can be done via substitution matrix such as that derived
from the Jukes-Cantor model of DNA evolution. The distance
correction is, however, necessary in practice only when the evolution
rates differ among branches).
Though the least-squares criterion advocated enables the
distances in FM approach to be more accurate, it however, burdens
the computational efforts more heavily than the NJ methods. Further,
in addition to least-squares criterion, another improvement that
69
recovered tree is the one that best fits the data. A more appropriate
analytical procedure would be to use NJ to produce a starting tree,
then employ a tree search using an optimality criterion, to ensure that
the best tree is recovered.
Subroutine III
← Making of the required phylogenetic tree:
Simple
Program or
sophisticated? Remarks
Free
or commercial?
hustle-free Essentially, it is a
multiple sequence
alignment (MSA) tool
enabling assembling of a
tree; but, it is not a
phylogenetic tree. It is
a guide tree that
ClustalW uses to make
MSA. Only with EBI
ClustalW a phylogenetic
tree is enables.
Contains many programs.
Phylip Sophisticated Both distance- and
and involves character-based analyses
running are feasibleBoot-
several strapping is possible.
programs No graphical interface.
Free
Commercial
Free package
75
It enables maximum-
PhyloWin likelihood on nucleotide
Fairly simple sequences.
Allows maximum parsimony
Free computation.
Does visualization of
TreeView trees. Its link: Wiki
______________________________________________
used to infer where the root of the tree is located. One approach is
outgroup rooting (Felsenstein, 2004). Whenever three monophyletic
groups of organisms are compared, and two of them are more closely
related to each other than either is to the third, the third group is
known as the outgroup. Generally this technique is used to determine
the locations of the root of the tree as the outgroup is assumed a
priori to have diverged earlier from the common ancestor than the
other groups in the tree. Another method is to assume a molecular
clock when constructing the tree. By assuming a molecular clock, we
are saying that evolution on all lineages of the tree happened at the
same rate (homogeneous rate of evolution). This makes the distance
from root to tip equal for all paths emanating from the root.
Therefore, we can find the root of the tree by locating the place in the
tree where the distance along all lineages to all tips is approximately
equal (Felsenstein, 2004). Figure 1.3: A rooted binary tree of animals
(a) and its unrooted equivalent (b).