Professional Documents
Culture Documents
Algorithms
F O U R T H
E D I T I O N
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
Substring search
Goal. Find pattern of length M in a text of length N.
typically N >> M
pattern
text
match
Substring search
pattern
text
match
Substring search
pattern
text
match
Substring search
http://citp.princeton.edu/memory
5
pattern
text
match
Substring search
PROFITS
L0SE WE1GHT
herbal Viagra
There is no catch.
This is a one-time mailing.
This message is sent in compliance
ATTACK AT DAWN
substring search
machine
found
7
http://finance.yahoo.com/q?s=goog
...
<tr>
<td class= "yfnc_tablehead1"
width= "48%">
Last Trade:
</td>
<td class= "yfnc_tabledata1">
<big><b>452.92</b></big>
</td></tr>
<td class= "yfnc_tablehead1"
width= "48%">
Trade Time:
</td>
<td class= "yfnc_tabledata1">
...
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
i+j
txt
0
1
2
2
0
1
2
1
3
3
4
5
0
1
0
3
5
5
9 10
B
A
R
B
A
A
R
B
A
entries in black
match the text
4 10
return i when j is M
pat
A
R
B
A
A
R
B
A
match
Brute-force substring search
11
i+j
10
i
0
1
2
3
4
5
ij
0
4
1
4
2
3
44
54
65
j i+j 0 0 1 1 22 33 44
i+j
5
5
6
6
77 88 9910
D A
txttxt
A A A B AA AC AA A
A BA RA AB C
pat
2
2
A B R A
pat
4
A A A A B
entries in red are
0
1
A B R A
5
A A A A B
mismatches
1
3
A B R A
6
A A A A B entries in gray are
0
3
A B R A
for reference only
7
A
A
A
A
1
5
A B R AB
entries in black
8
A RA AB
0
5
A B
match the text A A
10
A AA BA RA AB
4 10
return i when j is M
Brute-force substring search (worst case)
match
Brute-force substring search
Backup
In many applications, we want to avoid backup in text stream.
ATTACK AT DAWN
substring search
machine
found
mismatch
backup
10
explicit backup
15
Now is the time for all people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for many good people to come to the aid of their party.
Now is the time for all good people to come to the aid of their party. Now is the time for a lot of good
people to come to the aid of their party. Now is the time for all of the good people to come to the aid of
their party. Now is the time for all good people to come to the aid of their party. Now is the time for
each good person to come to the aid of their party. Now is the time for all good people to come to the aid
of their party. Now is the time for all good Republicans to come to the aid of their party. Now is the
time for all good people to come to the aid of their party. Now is the time for many or all good people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for all good Democrats to come to the aid of their party. Now is the time for all people to
come to the aid of their party. Now is the time for all good people to come to the aid of their party. Now
is the time for many good people to come to the aid of their party. Now is the time for all good people to
come to the aid of their party. Now is the time for a lot of good people to come to the aid of their
party. Now is the time for all of the good people to come to the aid of their party. Now is the time for
all good people to come to the aid of their attack at dawn party. Now is the time for each person to come
to the aid of their party. Now is the time for all good people to come to the aid of their party. Now is
the time for all good Republicans to come to the aid of their party. Now is the time for all good people
to come to the aid of their party. Now is the time for many or all good people to come to the aid of their
party. Now is the time for all good people to come to the aid of their party. Now is the time for all good
Democrats to come to the aid of their party.
16
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
http://algs4.cs.princeton.edu
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
assuming { A, B } alphabet
text
after mismatch
on sixth char
brute-force backs
up to try this
A
B
A
A
B
A
A
A
B
A
A
A
A
B
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
and this
and this
and this
and this
but no backup
is needed
A A
pattern
A
A
A
A
A
j
j
0
1
pat.charAt(j)
pat.charAt(j) A
B A
A
1
1 B
dfa[][j]
dfa[][j] B
0
2
C
0
0
C
0
2A
A1
30
0
0
0
3
B
1
4
0
1
B
1
2
0
4
A
5
0
0
2
A5
3C
01
4
0
3
B
1
4
0
4
5
A
C
5 If in1state j reading char c:
0
4if j is 6 halt and accept
else move to state dfa[c][j]
0
6
graphical representation
A
B,C
B,C
A
C
01
21
B,C
C
A
A
B
C
B,C
A
B,C
B
A
A6
B,C
19
A A B A C A A B A B A C A A
pat.charAt(j)
dfa[][j]
A
B
C
0
A
1
0
0
1
B
1
2
0
2
A
3
0
0
3
B
1
4
0
4
A
5
0
0
5
C
1
4
6
B, C
A
C
B
A
B, C
C
B, C
20
A A B A C A A B A B A C A A
pat.charAt(j)
dfa[][j]
A
B
C
0
A
1
0
0
1
B
1
2
0
2
A
3
0
0
3
B
1
4
0
4
A
5
0
0
5
C
1
4
6
B, C
A
C
B, C
C
B, C
B
A
6
substring found
21
0
B
txt
1
C
2
B
3
A
4
A
5
B
6
A
7
C
8
A
0
A
pat
suffix of txt[0..6]
1
B
2
A
3
B
4
A
5
C
prefix of pat[]
B, C
A
C
B
A
B, C
C
B, C
22
no backup
Running time.
no backup
X
j
pat.charAt(j)
A
dfa[][j] B
C
1
B
1
2
0
B,C
0
A
1
0
0
A
C
2
A
3
0
0
3
B
1
4
0
4
A
5
0
0
5
C
1
4
6
A
B
2
B,C
3
C
B,C
24
pat.charAt(j)
dfa[][j]
0
A
1
B
2
A
3
B
4
A
5
C
A
B
C
25
pat.charAt(j)
dfa[][j]
A
B
C
0
A
1
0
0
1
B
1
2
0
2
A
3
0
0
3
B
1
4
0
4
A
5
0
0
5
C
1
4
6
B, C
A
C
B
A
B, C
C
B, C
26
pat.charAt(j)
dfa[][j]
0
A
1
B
2
A
3
B
4
A
5
C
A
B
C
27
pat.charAt(j)
dfa[][j]
A
B
C
0
A
1
1
B
2
A
3
3
B
4
A
5
5
C
4
6
28
dfa['B'][5] = 4
simulate BABA;
simulate BABA;
= dfa['A'][3]
j 0
pat.charAt(j) A
= dfa['B'][3]
1
B
2
A
3
B
4
A
5
C
B, C
A
C
simulation
of BABA
j
B
B, C
C
B, C
29
dfa['B'][5] = 4
X' = 0
from state X,
from state X,
from state X,
= dfa['A'][X]
= dfa['B'][X]
= dfa['C'][X]
0
A
1
B
2
A
3
B
4
A
5
C
B, C
A
C
j
B
B, C
C
B, C
30
pat.charAt(j)
dfa[][j]
0
A
1
B
2
A
3
B
4
A
5
C
A
B
C
31
pat.charAt(j)
dfa[][j]
A
B
C
0
A
1
0
0
1
B
1
2
0
2
A
3
0
0
3
B
1
4
0
4
A
5
0
0
5
C
1
4
6
B, C
A
C
B
A
B, C
C
B, C
32
0
1
2
3
4
5
A
B
Ain time
B
A and
C
dfa[][]
0
0
0
0
0
3
space proportional to R M.
space proportional to M.
graphical representation
34
Key words, pattern, string, text-editing, pattern-matching, trie memory, searching, period of a
35
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
http://algs4.cs.princeton.edu
Robert Boyer
J. Strother Moore
j
text
5
11
15
5
4
0
return i
0
F
N
1
I
E
2
N
E
3
D
D
4
I
L
5
N
E
6
A
7 8
H A
pattern
E
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Y S T A C K N E E D L E I N A
L
E
N
L
N
E
E
= 15
37
before
txt
pat
after
txt
pat
mismatch character 'T' not in pattern: increment i one character beyond 'T'
38
before
txt
pat
after
txt
pat
mismatch character 'N' in pattern: align text 'N' with rightmost pattern 'N'
39
before
txt
pat
mismatch character 'E' in pattern: align text 'E' with rightmost pattern 'E' ?
40
before
txt
pat
after
txt
pat
-1
N
0
-1
B
C
D
E
...
L
M
N
...
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
1
-1
-1
-1
2
-1
-1
3
2
-1
-1
3
2
-1
-1
3
5
-1
-1
-1
-1
-1
0
-1
-1
0
-1
-1
0
-1
-1
0
4
-1
0
4
-1
0
E
1
-1
E
2
-1
D
3
-1
L
4
-1
E
5
-1
right[c]
-1
-1
-1
3
5
-1
4
-1
0
-1
42
public
{
int
int
int
for
{
}
return N;
}
43
Boyer-Moore: analysis
Property. Substring search with the Boyer-Moore mismatched character
heuristic takes about ~ N / M character compares to search for a pattern of
length M in a text of length N.
sublinear!
1
2
3
4
5
1
1
1
1
1
B
A
B
B
A
B
B
B
A
B
B
B
B
B
pat
B
B
B
B
A
B
B
B
B
Algorithms
R OBERT S EDGEWICK | K EVIN W AYNE
Knuth-Morris-Pratt
Boyer-Moore
Rabin-Karp
http://algs4.cs.princeton.edu
Michael Rabin
Dick Karp
pat.charAt(i)
0 1 2 3 4
2
% 997 = 613
0
3
1
1
2
4
3
1
4
5
txt.charAt(i)
5 6 7 8 9 10 11 12 13 14 15
9 2 6 5 3 5 8 9 7 9 3
% 997 = 508
% 997 = 201
% 997 = 715
% 997 = 971
% 997 = 442
% 997 = 929
1
2
3
4
5
6
return i
= 6
match
% 997 = 613
Modular arithmetic
Math trick. To keep numbers small, take intermediate results modulo Q.
Ex.
(mod 997)
= (30 + 535) * 3
(mod 997)
= 1695
(mod 997)
= 698
(mod 997)
pat.charAt()
0 1 2 3 4
2 6 5 3 5
% 997 = 2
Computing the hash value for the pattern with Horners method
48
+ ti+M
current
subtract
multiply
value leading digit by radix
i
...
current value 1
new value
add new
trailing digit
...
4
4
1
0
1
5
0
5
*
9
9
0
9
1
2
+
2
2
0
2
0
0
6
6
1
1
5
5
text
current value
subtract leading digit
multiply by radix
add new trailing digit
new value
49
0
3
1
1
% 997 = 3
match
% 997 = ((442 + 5*(997 - 30))*10 + 3) % 997 = 929
5
6
7
2
4
3
1
4
5
9
10
return i-M+1
= 6
5
9
6
2
7
6
8
5
9 10 11 12 13 14 15
3 5 8 9 7 9 3
Q
Horner's
rule
R
rolling
hash
//
//
//
//
//
RabinKarp(String pat) {
pat.length();
256;
longRandomPrime();
RM1 = 1;
for (int i = 1; i <= M-1; i++)
RM1 = (R * RM1) % Q;
patHash = hash(pat, M);
a large prime
(but avoid overflow)
precompute RM 1 (mod Q)
}
private long hash(String key, int M)
{ /* as before */ }
public int search(String txt)
{ /* see next slide */ }
}
51
Rabin-Karp analysis
Theory. If Q is a sufficiently large random prime (about M N 2),
then the probability of a false collision is about 1 / N.
Practice. Choose Q to be a large prime (but not so large to cause overflow).
Under reasonable assumptions, probability of a collision is about 1 / Q.
Monte Carlo version.
53
Extends to 2d patterns.
Extends to finding multiple patterns.
Disadvantages.
54
fingerprints
efficiently
and(Javas
compared.
to
implementcan
andbeworks
well incomputed
typical cases
indexOf() method in String uses
brute-force search); Knuth-Morris-Pratt is guaranteed linear-time with no backup in
Substring
search
cost
summary
Summary
The
table
the bottom
the page
summarizes
algorithms
we
the
input;
Boyer-Moore
isatsublinear
(by of
a factor
of M)
in typical the
situations;
and that
Rabinhave is
discussed
for substring
search. As is
often the might
case when
we time
have proportional
several algoKarp
linear. Each
also has drawbacks:
brute-force
require
rithms
the same task, eachand
of them
has attractive
features.
Brute
search ishas
easy
to
MN; for
Knuth-Morris-Pratt
Boyer-Moore
use extra
space;
and force
Rabin-Karp
a
Cost of searching
for
an
pattern
in as
anopposed
N-character
text.
to implement
well
in typical
cases
(Javas indexOf()
method
in String
uses
relatively
long and
innerworks
loopM-character
(several
arithmetic
operations,
to character
combrute-force
search);
Knuth-Morris-Pratt
is guaranteed
linear-timeinwith
no backup
pares
in the other
methods.
These characteristics
are summarized
the table
below. in
the input; Boyer-Moore is sublinear (by a factor of M) in typical situations; and RabinKarp is linear. Each also has drawbacks: brute-force
might require time proportional
operation count
backup
extra
algorithm
version
correct?
to MN;
Knuth-Morris-Pratt
and Boyer-Moore use extra space;
and
Rabin-Karp
has a
in input?
space
guarantee
typical
relatively long inner loop (several arithmetic operations, as opposed to character comparesbrute
in theforce
other methods. These
characteristics
summarized
table below.
full DFA
version5.6 )
(Algorithm
brute force
mismatch
only
transitions
Knuth-Morris-Pratt
DFA
fullfull
algorithm
(Algorithm 5.6 )
Knuth-Morris-Pratt
Boyer-Moore
mismatched char
mismatch
heuristic only
transitions only
(Algorithm 5.7 )
Boyer-Moore
Rabin-Karp
Rabin-Karp
full
algorithm
Monte
Carlo
(Algorithm 5.8 )
mismatched char
heuristic
only
Las Vegas
(Algorithm 5.7 )
count
2operation
N
1.1 N
guarantee
typical
backup
no
in input?
yes
correct?
3 NN
M
1.1
1.1 N
N
no
yes
yes
yes
M
1
32N
N
N1.1
/M
N
yes
no
yes
yes
R
MR
3N
M
N
N1.1
/M
no
yes
yes
yes
M
R
3N
7N
N/M
7N
yes
no
yes
yes
7MNN
N7 /NM
yes
no
yes
yes
1R
extra
MR
space
Monte Carlo
no
7N
7N
Cost
summary 5.8
for )substring-search implementations
(Algorithm
55