Boyer Moore Algorithm: Idan Szpektor

Boyer Moore Algorithm
Idan Szpektor
Boyer and Moore
What It’s About
 A String Matching Algorithm
 Preprocess a Pattern P (|P| = n)
 For a text T (| T| = m), find all of the

occurrences of P in T
 Time complexity: O(n + m), but usually sub-

linear
Right to Left (like in Hebrew)
 Matching the pattern from right to left
 For a pattern abc:

↓
T: bbacdcbaabcddcdaddaaabcbcb
P: abc
 Worst case is still O(n m)

The Bad Character Rule (BCR)
 On a mismatch between the pattern and the

text, we can shift the pattern by more than
one place.
Sublinearity!
ddbbacdcbaabcddcdaddaaabcbcb
acabc
↑
BCR Preprocessing
 A table, for each position in the pattern and a

character, the size of the shift. O(n |Σ|) space. O(1)
access time.
1 2 3 4 5
a b a c b:
a 1 1 3 3 3
1 2 3 4 5
b 2 2 2 5
c 4 4
 A list of positions for each character. O(n + |Σ|)

space. O(n) access time, But in total O(m).
BCR - Summary
 On a mismatch, shift the pattern to the right

until the first occurrence of the mismatched
char in P.
 Still O(n m) worst case running time:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: abaaaa
The Good Suffix Rule (GSR)
 We want to use the knowledge of the

matched characters in the pattern’s suffix.
 If we matched S characters in T, what is (if

exists) the smallest shift in P that will align a
sub-string of P of the same S characters ?
GSR (Cont…)
 Example 1 – how much to move:
↓
P: cabbabdbab
cabbabdbab
GSR (Cont…)
 Example 2 – what if there is no alignment:
↓
T: bbacdcbaabcbbabdbabcaabcbcb
P: bcbbabdbabc
bcbbabdbabc
GSR - Detailed
 We mark the matched sub-string in T with t
and the mismatched char with x
1. In case of a mismatch: shift right until the

first occurrence of t in P such that the next
char y in P holds y≠x
2. Otherwise, shift right to the largest prefix of

P that aligns with a suffix of t.
Boyer Moore Algorithm
 Preprocess(P)
 k := n
while (k ≤ m) do
 Match P and T from right to left starting at k
 If a mismatch occurs: shift P right (advance k)

by max(good suffix rule, bad char rule).
 else, print the occurrence and shift P right

(advance k) by the good suffix rule.
Algorithm Correctness
 The bad character rule shift never misses a

match
 The good suffix rule shift never misses a

match
Preprocessing the GSR – L(i)
 L(i) – The biggest index j, such that j < n and

prefix P[1..j] contains suffix P[i..n] as a suffix
but not suffix P[i-1..n]
1 2 3 4 5 6 7 8 9 10 11 12 13
P: b b a b b a a b b c a b b
L: 0 0 0 0 0 0 0 0 0 5 9 0 12
Preprocessing the GSR – l(i)
 l(i) – The length of the longest suffix of P[i..n]

that is also a prefix of P
P: b b a b b a a b b c a b b
l: 2 2 2 2 2 2 2 2 2 2 2 1
Using L(i) and l(i) in GSR
 If mismatch occurs at position n, shift P by 1
 If a mismatch occurs at position i-1 in P:

 If L(i) > 0, shift P by n – L(i)
 else shift P by n – l(i)
 If P was found, shift P by n – l(2)

Building L(i) and l(i) – the Z
 For a string s, Z(i) is the length of the longest
sub-string of s starting at i that matches a
prefix of s.
s: b b a c d c b b a a b b c d d
Z: 1 0 0 0 0 3 1 0 0 2 1 0 0 0
 Naively, we can build Z in O(n^2)

From Z to N
 N(i) is the longest suffix of P[1..i] that is also a

suffix of P.
 N(i) is Z(i), built over P reversed.
s: d d c b b a a b b c d c a b b
N: 0 0 0 1 2 0 0 1 3 0 0 0 0 1
Building L(i) in O(n)
 L(i) – The biggest index j < n, such that prefix
P[1..j] contains suffix P[i..n] as a suffix but not
suffix P[i-1..n]
 L(i) – The biggest index j < n such that:

N(j) == | P[i..n] | == n – i + 1
 for i := 1 to n, L(i) := 0
 for j := 1 to n-1
 i := n – N(j) + 1
 L(i) := j
Building l(i) in O(n)
 l(i) – The length of the longest suffix of P[i..n]
that is also a prefix of P
 l(i) – The biggest j <= | P[i..n] | == n – i + 1

such that N(j) == j
 k := 0
 for j := 1 to n-1
 If(N(j) == j), k := j
 l(n – j + 1) := k
Building Z in O(n)
 For calculating Z(i), we want to use the

previously calculated Z(1)…Z(i-1)
 For each I we remember the right most Z(j):

j, such that j < i and j + Z(j) >= k + Z(k), for all
k<i
Building Z in O(n) (Cont…)
↑ ↑ ↑ ↑
 S i’ j i
 If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously,

starting at i’ = i – j + 1.
 Z(i’) < Z(j) – (i - j) ?
Building Z in O(n) (Cont…)
 For Z(2) calculate explicitly
 j := 2, i := 3
 While i <= |s|:
 if i >= j + Z(j), calculate Z(i) explicitly
 else
 Z(i) := Z(i’)
 If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail
explicitly
 If j + Z(j) < i + Z(i), j := i
Building Z in O(n) - Analysis
 The algorithm builds Z correctly
 The algorithm executes in O(n)

 A new character is matched only once
 All other operations are in O(1)
Boyer Moore Worst Case Analysis
 Assume P consists of n copies of a single
char and T consists of m copies of the same
char:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaaaa
 Boyer Moore Algorithm runs in Θ(m n) when

finding all the matches
The Galil Rule
 In a specific matching phase, We mark with k
the position in T of the right end of P. We
mark with s the position of last matched char
in this phase.
s k k’
P: abaab
abaab
The Galil Rule (Cont…)
 All the chars in position s < j ≤ k are known to
be matching. The algorithm doesn’t need to
check them.
 An extended Boyer Moore algorithm with the

Galil rule runs in O(m + n) worst case (even
without the bad-character rule).
Don’t Sleep Yet…
O(n + m) proof - Outline
 Preprocess in O(n) – already proved
1. Properties of strings
2. Proof of search in O(m) if P is not in T, using
only the good suffix rule.
3. Proof of search in O(m) even if P is in T,
adding the Galil rule.
Properties of Strings
 If for two strings δ, γ: δγ = γδ then there is a
string β such that δ = βi and γ = βj, i, j > 0
- Proof by induction
 Definition: A string s is semiperiodic with

period β if s consists of a non-empty suffix of
β (possibly the entire β) followed by one or
more complete copies of β.
β’ β β β
Properties of Strings (Cont…)
 A string is prefix semiperiodic if it contains

one or more complete copies of β followed by
a non-empty prefix of β.
 A string is prefix semiperiodic iff it is

semiperiodic with the same length period
Lemma 1
 Suppose P occurs in T starting at position p
and also at position q, q > p. If q – p ≤  n/2 
then P is semiperiodic with period
α = P[n-(q-p)+1…n]
p
α’ α α α
q
α’ α α α
Proof - when P is Not Found in T
 We have R rounds during the search.
 After each round the good suffix rule decides

on a right shift of si chars.
 Σsi ≤ m
 We shall use Σsi as an upper bound.

Proof (Cont…)
 For each round we count the matched chars
by:
 fi – the number of chars matched for the first
time
 gi –the number of chars already matched in
previous rounds.
 Σfi = m
 We want to prove that gi ≤ 3si ( Σgi ≤ 3m).
Proof (Cont…)
 Each round don’t find P  it matched a
substring ti and one bad char xi in T (xiti  T)
T: bbacdcbaabcbbabdbabcaabcbcb
P: bdbabc
 |ti|+1 ≤ 3si  gi ≤ 3si (because gi + fi = |ti|+1)

 For the rest of the proof we assume that for
the specific round i: |ti| + 1 > 3si
Lemma 2 (|ti| + 1 > 3si)
 In round i we look at the matched suffix of P,
marked P*. P* = yi ti, yi≠ xi.
 Both P* and ti are semiperiodic with period α

of length si and hence with minimal length
period β, α = βk.
 Proof: by Lemma 1.
Lemma 3 (|ti| + 1 > 3si)
 Suppose P overlapped ti during round i. We
shall examine in what ways could P overlap ti
in previous rounds.
 In any round h < i, the right end of P could not

have been aligned with the right end of any
full copy of β in ti.
- proof:
 Both round h and i fail at char xi
 two cases of possible shift after round h are invalid
Lemma 4 (|ti| + 1 > 3si)
 In round h < i, P can correctly match at most |
β|-1 chars in ti.

 By Lemma 3, P is not aligned with a right end of ti in phase h.
 Thus if it matched |β| chars or more there is a suffix γ of β
followed by a prefix δ of β such that δ γ = γ δ.
 By the string properties there is a substring μ such that β = μk,
k>1.
 This contradicts the minimal period size property of β.
Lemma 5 (|ti| + 1 > 3si)
 If in round h < i the right end of P is aligned

with a char in ti, it can only be aligned with
one of the following:
 One of the left-most |β|-1 chars of ti
 One of the right-most |β| chars of ti

-proof:
 If not, By Lemma 3,4, max |β|-1 chars are matched and only
from the middle of a β copy, while there are at least |β|
 A shift cannot pass the right end of that β copy
Proof (Cont…)
 If |ti| + 1 > 3si then gi ≤ 3si


 Using Lemma 5, in previous rounds we could match only the
bad char xi, the last |β|-1 chars in ti or start from the first |β| right
chars in ti.
 In the last case, using Lemma 4, we can only match up to |β|-1
chars
 in total we could previously match:
gi = 1 + |β|-1 + (|β| + |β|-1) ≤ 3|β| ≤ 3si
Proof - Final
 Number of matches = ∑(fi + gi) =

∑fi + ∑gi ≤ m + ∑3si ≤ m + 3m = 4m
Proof - when P is Found in T
 Split the rounds to two groups:

 “match” rounds –an occurrence of P in T was
found.
 “mismatch” rounds –P was not found in T.
 we have proved O(m) for “mismatch” rounds.

Proof (Cont…)
 After P was found in T, P will be shifted by a
constant length s. (s = n – l(2)).
 |n| + 1 ≤ 3s 
∑ matches in round i ≤ ∑3s ≤ m
 For the rest of the proof we assume that:

|n| + 1 > 3s
Proof (|n| + 1 > 3s)
 By Lemma 1, P is semiperiodic with minimal
length period β, |β| = s.
 If round i+1 is also a “match” round then, by

the Galil rule, only the new |β| chars are
compared.
 A contiguous series of “match” rounds, i…i+k

is called a “run”.
Proof (|n| + 1 > 3s)
 ∑ The length of a “run”, not including chars

that where already matched in previous “runs”
≤ m
 How many chars in a “run” where already

matched in previous “runs”?
Lemma (|n| + 1 > 3s)
 Suppose k-1 was a “match” round and k is a
“mismatch” round that ends the “run”.
 If k’ > k is the first “match” round then it
overlaps at most |β|-1 chars with the previous
“run” (ended by round k-1).

 The left end of P at round k’ cannot be aligned with the left end
of a full copy of |β| at round k-1.
 As a result, P cannot overlap |β| chars or more with round k-1.
Proof (|n| + 1 > 3s)
 By the Lemma and because the shift after
every “match” round is of |β|, only the first
round of a “run” can overlap, and only with
the last previous “run”.

 ∑ The length of the chars that where already
matched in previous “runs” ≤ m
Proof (|n| + 1 > 3s) - Final
 ∑ The length of a “run” =
∑ The length of a “run”, not including chars
that where already matched in previous “runs”
+
∑ The length of the chars that where already
matched in previous “runs” ≤
m+m

Boyer Moore Algorithm: Idan Szpektor

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Boyer Moore Algorithm: Idan Szpektor

Uploaded by

Copyright:

Available Formats

Boyer Moore Algorithm

 Preprocess a Pattern P (|P| = n)

 For a text T (| T| = m), find all of the

 Time complexity: O(n + m), but usually sub-

 For a pattern abc:

 Worst case is still O(n m)

 On a mismatch between the pattern and the

 A table, for each position in the pattern and a

 A list of positions for each character. O(n + |Σ|)

 On a mismatch, shift the pattern to the right

 Still O(n m) worst case running time:

 We want to use the knowledge of the

 If we matched S characters in T, what is (if

1. In case of a mismatch: shift right until the

2. Otherwise, shift right to the largest prefix of

 If a mismatch occurs: shift P right (advance k)

 else, print the occurrence and shift P right

 The bad character rule shift never misses a

 The good suffix rule shift never misses a

 L(i) – The biggest index j, such that j < n and

 l(i) – The length of the longest suffix of P[i..n]

 If mismatch occurs at position n, shift P by 1

 If a mismatch occurs at position i-1 in P:

 If P was found, shift P by n – l(2)

 Naively, we can build Z in O(n^2)

 N(i) is the longest suffix of P[1..i] that is also a

 L(i) – The biggest index j < n such that:

 l(i) – The biggest j <= | P[i..n] | == n – i + 1

 For calculating Z(i), we want to use the

 For each I we remember the right most Z(j):

 If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously,

 If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail

 The algorithm builds Z correctly

 The algorithm executes in O(n)

 Boyer Moore Algorithm runs in Θ(m n) when

 An extended Boyer Moore algorithm with the

 Definition: A string s is semiperiodic with

 A string is prefix semiperiodic if it contains

 A string is prefix semiperiodic iff it is

 We have R rounds during the search.

 After each round the good suffix rule decides

 We shall use Σsi as an upper bound.

 |ti|+1 ≤ 3si  gi ≤ 3si (because gi + fi = |ti|+1)

 Both P* and ti are semiperiodic with period α

 In any round h < i, the right end of P could not

 If in round h < i the right end of P is aligned

 One of the right-most |β| chars of ti

 If |ti| + 1 > 3si then gi ≤ 3si

 Number of matches = ∑(fi + gi) =

 Split the rounds to two groups:

 we have proved O(m) for “mismatch” rounds.

 For the rest of the proof we assume that:

 If round i+1 is also a “match” round then, by

 A contiguous series of “match” rounds, i…i+k

 ∑ The length of a “run”, not including chars

 How many chars in a “run” where already

You might also like