Professional Documents
Culture Documents
Taxonomies
Gabriel Valiente
Phylum Streptophyta
Class Streptophytina
Order Solanales
Family Solanaceae
Genus Solanum
Phylum Chordata
Class Mammalia
Order Primates
Family Hominidae
mapping
matching taxonomic
statistics reference
non-taxonomic taxonomic
assignment assignment
non-taxonomic taxonomic
classification classification
Classification of whole genomes
• k -mer searching
• Top hit Closest sequence to the sequence read
• Best stratum Sequences at the same distance as the top hit
TP
P=
TP + FP
TP
R=
TP + FN
2 2PR
F= 1
=
P
+ R1 P +R
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Ri be the ith read
• Let Mi be the leaves of T matching Ri with up to k mismatches
• Let Ti be the subtree of T rooted at the lowest common ancestor
of Mi
• Let Ni be the leaves of Ti not matching Ri with up to k
mismatches
For the ith read, the leaves of Ti can be partitioned in the following four
subsets:
• TP i = Mi (true positives)
• FP i = Ni (false positives)
• TN i = 0/ (true negatives)
• FN i = 0/ (false negatives)
Ti
Ni Mi
FPi TPi
Given a reference taxonomy T , a set R of sequence reads, and a
threshold value k of sequence similarity,
• Let Tij be the subtree of T rooted at the jth node of Ti
• Let Mij be the leaves of Tij matching Ri with up to k mismatches
• Let Nij be the leaves of Tij not matching Ri with up to k
mismatches
For the ith read and the jth node of Ti , the leaves of Ti can be
partitioned in the following four subsets:
• TP ij = Mij (true positives)
• FP ij = Nij (false positives)
• TN ij = Ni \ Nij (true negatives)
• FN ij = Mi \ Mij (false negatives)
Ti
Tij
Ni Nij Mij Mi
|TP ij |
Pij =
|TP ij | + |FP ij |
|TP ij |
Rij =
|TP ij | + |FN ij |
2 2Pij Rij
Fij = 1
=
Pij
+ R1ij Pij + Rij
Bacteria
Aquificae
Aquificae
Aquificales
Aquificaceae
Aquifex
Aquifex pyrophilus
Hydrogenobaculum
Hydrogenobaculum acidophilum
P = 6/(6 + 8) = 43% Hydrogenobacter
R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus
Hydrogenobacter thermophilus
F = 60% Hydrogenobacter hydrogenophilus
Persephonella
Persephonella hydrogeniphila
Persephonella marina
Persephonella guaymasensis
Sulfurihydrogenibium
Sulfurihydrogenibium subterraneum
P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense
R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense
Thermocrinis
F = 67% Thermocrinis albus
Thermocrinis ruber
Hydrogenivirga
Hydrogenivirga caldilitoris
F -measure The combined F -measure of precision and recall is
|FN ij | |FP ij |
PS ij = q + (1 q)
|TP ij | |TP ij |
The node that minimizes the penalty score is the same node that
would maximize the F -measure
Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Ti |) total time
Theorem
Given a set Mi ✓ L of hits and the subtree Ti of T rooted at the LCA of
Mi , the penalty scores PSi ,j for every node j in Ti can be obtained in
O (|Mi |) total time after O (|T |) time preprocessing
Definition
Any node j in Ti is called relevant if it is a leaf in Mi or the LCA of two
or more leaves in Mi
Lemma
For each node j in Ti there exists a relevant node j 0 such that
PSi ,j 0 PSi ,j
• TANGO http://www.lsi.upc.edu/˜valiente/tango/
• TANGO http://tango.lsi.upc.edu/tango.php
Input Sample 1
Input Sample 2
Input Sample 3
Output Sample 1
Output Sample 2
Output Sample 3
Output Taxonomy Table
Reference Initial Reference Sequence
Taxonomies Mappings Sequences Reads
contraction mapping
Contracted Sequence
equalizing
Taxonomies Matches
Taxonomy
relabeling
Correspondences
Relabeled
Matches
assignment
Assigned
Reads
Reference taxonomies are contracted to the seven taxonomic ranks
usually used to classify organisms
• Kingdom
• Phylum
• Class
• Order
• Family
• Genus
• Species