Pres Suffix Trees

What about the trees of
the Mississippi?
Suffix Trees explained in an algorithm
for indexing large biological sequences
Jacob Kleerekoper & Marjolijn Elsinga
Overview
Suffix
Suffix array
Suffix tree
Suffix links tree
Demo
Suffix
Suffices of mississippi:
1 mississippi
11
i
2 ississippi 8
ippi
3 ssissippi 5
issippi
4
sissippi
2
ississippi
5 issippi
sort alphabetically 1
6
ssippi 10
pi
7 sippi
9
ppi
8 ippi 7
sippi
9 ppi 4
sissippi
10
pi
6
ssippi
11
i
3
ssissippi
mississippi
Suffix array
mississippi
Suffix
11
10
Index
10
Search in suffix array

Idea: two binary searches
- search for leftmost position of X
- search for rightmost position of X
In between are all suffices that begin with X

Search for leftmost occurrence of is
mississippi
Found leftmost
ippi
issippi
pi
Suffix
11
10
Index
10
more occurrences of is left of this one possible!

Search for rightmost occurrence of is
mississippi
issippi
Found rightmost
ississippi mississippipi
Suffix
11
10
Index
10
more occurrences of is right of this one possible!
Result search in suffix array

Leftmost occurrence of is: 5 at index 2
Rightmost occurrence of is: 2 at index 3
is can be found at [2..3] in the suffix array
Tree & Trie

Suffix tree is a compressed digital (suffix)
trie
Suffix tree definition

A suffix tree is a rooted directed tree with m
leaves, where m is the length S (the database
string)
For any leaf i, the concatenation of the edgelabels on the path from the root to leaf i
exactly spells out the suffix of S that starts at
position i
Suffix tree building
i
p
p
i
i
p
i
p
p
i
i
p
Suffices of mississippi:
m
1
mississippi
i
2
ississippi
s
p
3
ssissippi
s
i
4
sissippi
i
5
issippi
s
6
ssippi
s
7
sippi
i
8
ippi
p
9
ppi
p
10 pi
i
11 i
root
p
i
Result suffix tree building

p
root
i
p
p
i
si
s
i
10
ssi
11
s
s
Questions
Adriano: How is the tree created for ANA$?
Answer: Tree creation ANA$

1 root
2 ANA$
3 NA$
4 A$
5$
root
ANA$
A
NA$
2
NA$
2
$
3
$
4
Implicit vs. explicit

Trees in which a special end symbol is used are
called explicit
Searching in this trees can only be stopped at this
end symbol, which is always in a leaf
A search in a implicit tree can stop at any internal
or external node, at the last matching symbol
Question
Peter: How does this method search for
homologous sequences as is done in
BLAST and CAFE?
Searching in a suffix tree

p
root
i
p
10
i
si
p
p
p
i
9
ssi
11
s
i
issi
2 ississippi
s
5 issippi i
Time analysis of suffix tree

Building a suffix tree can be done in O(k) where k
is the length of the database string
Searching a suffix tree can be done in O(n) where
n is the length of the query string
(Note: only in Ukkonens implementation)
Question
Laurence: Can you explain the suffix links
tree?
Suffix links
A necessary implementation trick to achieve
a linear time and space bound during
building the tree
A suffix link is: a pointer from an internal
node xS to another internal node S where
x is a arbitrary character and S is a
possibly empty substring
xS
Suffix linked tree

root
$
9
AC
7
AC
AC
$
$
5
AC
AC
$
$
AC$
4
AC$
1
Question
Ingmar: Why is the memory bottleneck a
problem, and how is it solved with the use
of suffix links?
Answer: we interpreted the article in such
way that the suffix links cause the memory
bottleneck and not the other way around
Question
Lee: How can suffix links cause the memory
bottleneck and why is its reliance on virtual memory
impractical?
Answer: Suffix links are designed to take you from
one region of the tree to another. It could be
possible, because of the size of the tree, that the
region pointed to is not in memory available. The
same holds for virtual memory.
Question
Bram: Why do we need random access of
the memory?
Answer: a tree is based on pointers, these
are not sequentially inserted into the
memory, so random access is necessary
Question
Bogdan: How does this index cope with partial
matches, gapped alignments and so forth, or
is it just used for exact matches, which usually
dont help a lot?
Answer: Your intuition is correct here. Suffix
trees as described in the article can only be
used for exact (local) matches
Question
Lee: Can this method be used for protein data as
well / can this method also be used for similar
matches?
Answer: Suffix trees probably can be used for
protein data, but it is not possible to implement
wildcards or the fact that amino acids are
evolutionary related, but do not match exactly in
some cases.
Question
Peter: Why is it a problem that DNA cannot
be broken into words, and why doesnt it
use the overlapping intervals as in CAFE?
Answer: the begin and end of a base string
cannot be determined. Suffices are a
special kind of overlapping intervals.
Question
Bogdan: Why do we have to change the index for
each search instead of building the index once
and update it when the database is changed?
Answer: the index mentioned is the BLAST index
and in BLAST the index has to be updated for
every search. It has not much to do with suffix
trees.
Question
Adriano: What is the meaning of "cold store" and
"warm store"?
Answer: We think that cold store means that not
the entire database is available in the memory
and in the case of warm store the used part of
the database is in the physical memory. This can
be concluded from the fact that in warm store
only short queries are run.
Question
Bogdan: What is the checkpointing which is done?
Answer: Checkpointing is the process of associating a
resource with one or more registry keys so that when the
resource is moved to a new node, the required keys are
propagated to the local registry on the new node.
We think that the checkpointing is used to first build a
portion of the tree in the memory and then put the
finished (checkpointed) portion onto the disk
Demo
Ukkonens linear time suffix tree algorithm
(on-line available at:
http://www.i.kyushu-u.ac.jp/~takeda/PM/SuffixTree/STreeDemo.html)

Pres Suffix Trees

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pres Suffix Trees

Uploaded by

Copyright:

Available Formats

What about the trees of

Jacob Kleerekoper & Marjolijn Elsinga

Search in suffix array

Search in suffix array

more occurrences of is left of this one possible!

Search in suffix array

more occurrences of is right of this one possible!

Result search in suffix array

Tree & Trie

Suffix tree definition

Suffix tree building

Result suffix tree building

Answer: Tree creation ANA$

Implicit vs. explicit

Searching in a suffix tree

Time analysis of suffix tree

Suffix linked tree

You might also like