You are on page 1of 29

Bioinformatics Report

John Erol Evangelista

Pattern Matching
Problem: Given a string of n characters called the text and a string of m characters (m n) called the pattern, nd a substring of the text that matches the pattern.

If matches other than the first one need to be found, a string-matching algorithm

Brute Force
N

can simply continue working until the entire text is exhausted. A brute-force algorithm for the string-matching problem is quite obvious: align the pattern against the first m characters of the text and start matching the corresponding pairs of characters from left to right until either all m pairs of the characters match (then the algorithm can stop) or a mismatching pair is encountered. In the latter case, shift the pattern one position to the right and resume character comparisons, starting again withMatchingcharacter of the 105 pattern 3.2 Sequential Search and Brute-Force String the first and its counterpart in the text. Note that the last position in the text which can still be a begirming of a matching substring is 11 - m (provided the text's positions are y N 0 0 T I E are H I M 0 0 indexed from B ton-D1). BeyondNthat position,cthere D not enough characters N 0 T to match the entire pattern; hence, the algorithm need not make any comparisons N 0 T there. N 0 T

0 T N 0 T ALGORITHM BruteForceStringMatch(T[O.. n -1]. P[O .. m -1]) N 0 T //Implements brute-forceN 0 matching string T //Input: An array T[0 .. 11 - 1] N 110characters representing a text and of T

II an array P[O .. m - 1] of m characters representing a pattern FIGURE 3.3 Example of brute-force first character in thepattern's characters that are //Output: The index of the string matching. (The text that starts a compared with their text counterparts are in bold type.} II matching substring or -1 if the search is unsuccessful
fori +-Oton-mdo
j+-0

while j < m and P(j] = T[i


j +-j+1 if j = m return i

+ j] do

return -1 An operation of the algorithm is illustrated in Figure 3.3. Note that for this example, the algorithm shifts the pattern almost always after a single character comparison. However, the worst case is much worse: the algorithm may have to make all m comparisons before shifting the pattern, and this can happen for each of the n - m + 1 tries. (Problem 6 asks you to give a

Brute Force
3.2 Sequential Search and Brute-Force String Matching
N N

105
H I M

0 B 0 T N 0 N

T I

E D

T 0

T N 0

T N 0

T N 0
N

T 0
N

T 0

FIGURE 3.3 Example of brute-force string matching. (The pattern's characters that are compared with their text counterparts are in bold type.}

fori +-Oton-mdo

Analysis

Worst case: O(nm) Average case O(n+m) = O(n)

Combinatorial Pattern Matching

Finds the exact or approximate occurrences of a given pattern in a long text. Organize efcient Data Structures

Sufx Trees
A sufx tree for text t = t1t2t3...tn is a labeled tree with n leaves (numbered 1 to n) satisfying the following conditions:
Each edge is labeled with a substring of the text. Each internal vertex (except for possibly the root) has at least two children Any two edges out of the same vertex start with a different letter. Every sufx of text t is spelled out in a path from the root to the leaf. No sufx is a prex of another sufx.

Keyword Tree vs. Sufx tree


9.5 Sufx Trees

323

root A G T C AT G

root T CATG

6 T G C A G CATG

6 G CATG

5 G C A T

4 A T G

3 T G

2 G

(a) Keyword tree

(b) Sufx tree

Figure 9.5

The difference between a keyword tree and a sufx tree for the string

Building sufx trees

Can take up to quadratic time. Weiner: O(n) running time.

Pattern Matching on Sufx Trees

Problem: Given a pattern p, nd the exact occurrence of it in a long text t.

Why sufx trees?


Allow one to preprocess a text in such a way that given any pattern of length n, one can answer whether or not it occurs in the text using only O(n) time, regardless of how long the text is.

SufxTreePatternMatching
2 G 1

(a) Keyword tree

(b) Sufx tree

Figure 9.5 The difference between a keyword tree and a sufx tree for the string ATCATG. The sufx starting at position i corresponds to the leaf labeled by i.

S UFFIX T REE PATTERN M ATCHING (p, t) 1 Build the sufx tree for text t 2 Thread pattern p through the sufx tree. 3 if threading is complete 4 output positions of every p-matching leaf in the tree 5 else 6 output pattern does not appear anywhere in the text

Threading
9 Combinatorial Pattern Matching
root A T G CAT CATGG T ACATGG G CATACATGG G GG ACATGG

7 G

6 ACATGG

3 11 CATACATGG

5 CATACATG G

10

Figure 9.6 Threading the pattern ATG through the sufx tree for the text ATGCATACATGG. The sufxes ATGCATACATGG and ATGG both match, as noted by the gray vertices in the tree (the p-matching leaves). Each p-matching leaf corresponds to a position in the text where p occurs.

Analysis
Threading takes O(m) time, where m is the length of the pattern. Combining with the construction of the sufx trees, total running time is: O(n+m)

Hashing
based on the idea of distributing keys among a one dimensional array H[0,...,m-1] called a hash table. Hash Functions: assigns an integer from 1 to m-1 to a key. This integer is called the hash address.

Hash Functions
Must satisfy two requirements: A hash function needs to distribute keys among the cells of a hash table as evenly as possible. A hash function must be easy to compute.

Collisions
A phenomenon of two (or more) keys being hashed in the same cell of a hash table. Worst case: All keys are hashed on the same hash address

Open Hashing

Keys are stored in linked lists attached to cells of a hash table.

ppen very rarely. Still, every hashing scheme must have a collision resolution nism. This mechanism is different in the two principal versions of hashing: hashing (also called separate chaining) and closed hashing (also called open ssing).

Open Hashing

n Hashing (Separate Chaining)

en hashing, keys are stored in linked lists attached to cells of a hash table. list contains all the keys hashed to its cell. Consider, as an example, the ing list of words: A,FOOL,AND,HIS,MONEY,ARE,SOON,PARTED.

hash function, we will use the simple function for strings mentioned above, , we will add the positions of a word's letters in the alphabet and compute m's remainder after division by 13. We start with the empty table. The first key is the word A; its hash value is = 1 mod 13 = 1. The second key-the word FOOL-isinstalled in the ninth cell (6 + 15 + 15 + 12) mod 13 = 9), and so on. The final result of this process is in Figure 7.5; note a collision of the keys ARE and SOON (because h(ARE) = 8 + 5) mod 13 = 11 and h(SOON) = (19 + 15 + 15 + 14) mod 13 = 11).

Open Hashing
7.3 Hashing

267

keys
hash addresses

10

11

12

AND

MONEY

FOOL

HIS

ARE SOON

PARTED

J.

FIGURE 7.5 Example of a hash table construction with separate chaining

How do we search in a dictionary implemented as such a table of linked lists? We do this by simply applying to a search key the same procedure that was used for creating the table. To illustrate, if we want to search for the key KID in the hash table of Figure 7.5, we first compute the value of the same hash function for the key: h(KID) = 11. Since the list attached to cellll is not empty, its linked list may contain the search key. But because of possible collisions, we cannot tell whether this is the case until we traverse this linked list. After comparing the string KID first with the string ARE and then with the string SOON, we end up with an unsuccessful search. -

Analysis
AND MONEY FOOL

10

11

12

HIS

ARE SOON

PARTED

J.

Efciency depends on the lengths of the linked lists. do we search in a dictionary implemented as such a table of linked lists? How If a hash function distributes n keys among m cells of the hash tables about evenly, each list will be about n/m keys long. This ratio is the load factor of the hash table. Average number of pointers inspected in a successful search S and unsuccessful search U are: S"'1+ " and U=a, z (7,4)
respectively (under the standard assumptions of searching for a randomly selected We do this by simply applying to a search key the same procedure that was used for creating the table. To illustrate, if we want to search for the key KID in the hash table of Figure 7.5, we first compute the value of the same hash function for the key: h(KID) = 11. Since the list attached to cellll is not empty, its linked list may contain the search key. But because of possible collisions, we cannot tell whether this is the case until we traverse this linked list. After comparing the string KID first with the string ARE and then with the string SOON, we end up with an unsuccessful search. In general, the efficiency of searching depends on the lengths of the liuked lists, which, in turn, depend on the dictionary and table sizes, as well as the quality of the hash function. If the hash function distributes n keys among m cells of the hash table about evenly, each list will be about n/m keys long. The ratio a= n/m, called the load factor of the hash table, plays a crucial role in the efficiency of hashing. In particular, the average number of pointers (chain links) inspected in successful searches, S, and unsuccessful searches, U, turn out to be

FIGURE 7.5 Example of a hash table construction with separate chaining

Insertion and Deletion


Insertion: Normally done at the end of the list. Deletion: Searching for key to be deleted and deleting it

Closed Hashing

All keys are stored in the hash table itself without the use of linked lists. Simplest is linear probing.

Linear Probing
If a collision occurs, it checks the cell next to the cell were the collision occurs. If it is empty, the new key is placed there, otherwise it checks for the next cell and so on. Circular array design.

Linear Probing
keys
hash addresses

the next cell is already occupied, the availability of that cell's immediate successor is checked, and so on. Note that if the end of the hash table is reached, the search is wrapped to the beginning of the table; that is, it is treated as a circular array. This method is illustrated in Figure 7.6 by applying it to the same list of words we used above to illustrate separate chaining (we also use the same hash function). To search for a given key K, we start by computing h(K) where his the hash function used in the table's construction. If the cell h(K) is empty, the search is unsuccessful. If the cell is not empty, we must compare K with the cell's occupant:

10

11

12

PARTED

A A A A A A A A

AND AND AND AND AND AND

MONEY MONEY MONEY MONEY

FOOL FOOL FOOL FOOL FOOL FOOL FOOL

HIS HIS HIS HIS HIS

ARE ARE ARE

SOON SOON

FIGURE 7.6 Example of a hash table construction with linear probing

Analysis
Insertion and Search are pretty much 7.3 Hashing 269 straightforward.a matching key; if they are not, we compare if they are equal, we have found
K with a key in the next cell and continue in this manner until we encounter

Deletion can be tricky, but a lazy deletion can make the problem easier. Given the load factor, the number of times the algorithm must access the table for successful and unsuccessful searches is, respectively:
factor a in successful and unsuccessful searches is, respectively,

either a matching key (a successful search) or an empty cell (unsuccessful search). For example, if we search for the word LIT in the table of Figure 7.6, we will get h (LIT) = (12 + 9 + 20) mod 13 = 2 and, since cell 2 is empty, we can stop immediately. However, if we search for KID with h(KID) = (11 + 9 + 4) mod 13 = 11, we will have to compare KID with ARE, SOON, PARTED, and A before we can declare the search unsuccessful. While the search and insertion operations are straightforward for this version of hashing, deletion is not. For example, if we simply delete the key ARE from the last state of the hash table in Figure 7.6, we will be unable to find the key SOON afterward. Indeed, after computing h(SOON) = 11, the algorithm would find this location empty and report the unsuccessful search result. A simple solution is to use "lazy deletion," that is, to mark previously occupied locations by a special symbol to distinguish them from locations that have not been occupied. The mathematical analysis of linear probing is a much more difficult problem than that of separate chaining 3 The simplified versions of these results state that the average number of times the algorithm must access a hash table with the load
1 1 ] ] S"'-(1+--) and U"'-(1+ ) 2 1- a 2 (1 - a )2

(7.5)

(and the accuracy of these approximations increases with larger sizes of the hash

Clustering

A cluster in linear probing is a sequence of contiguously occupied cells.

Double Hashing
Space and 1ime Tradeoffs

is attached to a cluster increases; in addition, large clusters increase the probability that two clusters will coalesce after a new key's insertion, causing even more clustering. Several other collision resolution strategies have been suggested to alleviate this problem. One of the most important is double hashing. Under this scheme, we use another hash function, s(K), to determine a fixed increment for the probing sequence to be used after a collision at location l = h(K): (I+ s(K)) mod m, (l + 2s(K)) mod m, ... .
(7.6)

We use another hash function s(K) to determine a xed increment for the probing sequence to be used after a collision at location l = h(K):

For every location in the table to be probed with sequence (7.6), the increment s(K) and the table size m must be relatively prime, i.e., their only common divisor must be l. (This condition is satisfied automatically if m itself is prime.) Some functions recommended in the literature are s(k) = m- 2- k mod (m- 2) and

References
Jones, Neil C. and Pevzner, Pavel A. An Introduction to Bioinformatics Algorithms. Levitin, Anany. The Design and Analysis of Algorithms. 2nd ed.

You might also like