Professional Documents
Culture Documents
fsada,imaig@is.s.u-tokyo.ac.jp
sequences, therefore it is important to store them on text of length by using a clever ordering of sorting
n
memory for quick queries. The sux array [10] is a and an integer encoding. Note that we assume the
compact data structure for on-line string searches. For size of integers is 32bits. We dene a measure of di-
such purposes, sux tree is used because it enables culty of sorting suxes: average match length (AML).
us to nd the longest substring of a large text that Our algorithm requires much more memory than the
matches the query string in linear time. However, the Bentley-Sedgewick, which requires 5 bytes, but our
n
sux tree requires huge memory and therefore it is algorithm is faster than other sorting algorithms when
impossible to use it for large text. AML is large. It is faster than the sux tree construc-
Though the sux tree can be constructed in lin- tion algorithm of Larsson [9] if the alphabet size is
ear time [11, 14, 9], the time complexity depends on large and it requires less than half of memory for the
alphabet size. Searching time also depends on the al- sux tree. When a text becomes long, many pairs of
phabet size, and it becomes slow if the alphabet size is words appear repeatedly and the AML becomes large,
large. On the other hand, construction and searching therefore our algorithm is practical for large texts.
time of the sux array does not depend on the size of 1.4 Denitions
alphabet. Therefore searching time of the sux array Assume that we make a sux array of a text = X
may be comparable with the sux tree though it has x [0 0 1] = 0 1 n01 . The size of alphabet 6 is
::n x x :::x
superlinear time complexity of ( log ) where is
O P n P constant. We add a special symbol `$' at the tail of
the length of a query string and is the length of a
n the text , that is, n = $. It is not in the alphabet
X x
large text. and smaller than any other symbols in the alphabet.
1.2 Block Sorting data compression Suxes of the text are represented by i = [ ].
S x i::n
The Block Sorting is a lossless data compression The sux array [0 ] is an array of indexes of suxes
I ::n
scheme [4]. Though its compression ratio is compa- S i . All suxes in the array are sorted in lexicographic
1
order. A sux i is lexicographically less than a sux
S groups have dierent numbers. Next suxes of each
S j if 9 0 [
l + 0 1] = [ + 0 1] and [ + ]
x i::i l x j::j l x i l < group are split into subgroups according to the [ ] of V i
[ + ].
x j l suxes. As a result, all suxes are split according to
We use another ordering k dened by comparison
< their rst two symbols. This operation continues until
of prexes of length as follows. k all suxes belong to groups of size one. If groups are
split and ordered in alphabetical order, all suxes can
i k j () 0 9
S < S [ + 0 1] = l < k x i::i l
be sorted in lexicographic order.
[ + 0 1] and [ + ] [ + ]
x j::j l x i l < x j l
The key of the algorithm is so-called doubling tech-
i =k j () [ + 0 1] = [ + 0 1] nique. Because all suxes in a group have the same
rst symbols [ + 0 1] after an iteration, we
S S x i::i k x j::j k
k x i::i k
i k j () 0 9
S > S [ + 0 1] = l < k x i::i l can split the groups according to [ + ]. Since x i k::n
[ + 0 1] and [ + ] [ + ]
x j::j l x i l > x j l the numbers [ + ] are already calculated in the
V i k
2.1 Bentley-Sedgewick algorithm groups by the second symbols. If [0] = , i01 is theI i S
This is a practical algorithm for sorting strings [3]. smallest sux among the group [ 0 1]. We can move V i
It is similar to the quick sort, but it recursively par- suxes to their correct position according to rst two
titions pointers of strings into three parts. A pivot symbols. The number of iterations is less than dlog e n
used to partition is the rst symbol in a string and and each iteration can be done in ( ) time, therefore
O n
all strings are partitioned into less than, equal to and this algorithms works in ( log ) time. O n n
greater than the symbol according to their rst sym- 2.5 Larsson's sux tree construction al-
bols. A key idea of this algorithm is to use the equal gorithm
part. Strings in it have same rst symbols, therefore This algorithm can maintain a sliding window of a
we can sort the strings without comparing the rst text in linear time and it can be used for PPM-style
symbols. data compression [9]. Though the sux tree requires
The Bentley-Sedgewick algorithm is used in a free more memory than the sux array, it enables us to
software bzip2 [13], an implementation of the Block search substrings in linear time.
Sorting. 3 Proposed algorithm
2.2 Andersson-Nilsson algorithm (For- 3.1 Idea
ward Radix Sort) We propose an algorithm for sorting suxes us-
This is a general algorithm for sorting strings using ing the KMR algorithm and the Bentley-Sedgewick
radix sort [1]. First all strings are sorted and split algorithm. Though the original KMR algorithm uses
into groups according to their rst symbols by using a bucket sort for each group, it is not practical because
radix sort. Next all strings are moved to buckets ac- the number of dierent values of i becomes large and
V
cording to second symbols. The buckets are traversed the number of unsorted elements becomes small as it-
in alphabetical order and strings in them are returned eration of the KMR algorithm proceeds and cost of
to their own groups. Now all strings are sorted accord- initializing the bucket becomes large. The algorithm
ing to rst two symbols, therefore we split the groups of Manber and Myers solved this problem of the KMR
and iterate these operations until all strings are sorted. algorithm, but it traverses all suxes in each iteration
This algorithm is simple, but it theoretically has good even if almost all suxes were already sorted. We use
time complexity. a comparison-based algorithm for sorting each group
2.3 Karp-Miller-Rosenberg (KMR) algo- and sort only unsorted strings in each iteration.
rithm Our algorithm maintains k order of all suxes in
<
This is an algorithm for nding repeated patterns an array [0 ] and iterates until all [ ] have dif-
V ::n V i
in a string [8] and it can be used for sorting suxes. ferent values. After iteration , all suxes are sorted
i
according to their rst symbols and given numbers rst symbols are equal form a group. Groups are
k
2
[ ] and [ ] for only unsorted groups, therefore [ ]
V i I i V i next group begins at index = , therefore we sort g size
order if
V ::n <
is updated. Next we update the array for having V
Si < Sj ! [ ] [ ],
V i V j If the size of a group becomes one in iteration ,
g k
initialize [ ] (1 ), let = 1
V i i n k
( [1 2]), 3 ( [3 4]), 6 ( [6 9]) and 11 ( [11 13]) sepa-
I :: I :: I :: I ::
4. combine consecutive sorted groups into one sorted and then update [ ]. V i
xes s 2 [0
S I 0 1] according to [ + ]. The
::size V s k
3
3.3 Further improvements 4 Experimental results
3.3.1 One pass implementation We have experimented on sorting time of various
algorithms. We used SPARCstation-5 with 32MB
The above implementation requires three passes: sort- memory and Sun Ultra 1 with 256MB memory. Al-
ing, updating , and updating . Note that combin-
S V gorithms we experimented are Bentley-Sedgewick,
ing groups can be done in sorting pass of the next Manber-Myers, Larsson, and our algorithm. All algo-
iteration. However, we can change the algorithm to rithms except the Larsson use initial 2-pass radix sort-
work in one pass. To update in the sorting pass, we
V ing. Larsson is not a sorting algorithm but a sux tree
must guarantee values of to be consistent. If an ele-
V construction algorithm. Children of nodes in a sux
ment [ ] is updated, the value is always incremented
V i tree are represented by a linked list and elements of the
and it may become larger than other values which are list is rearranged by move-to-front rule. Our algorithm
greater than i in k order. However, if we update the
S < uses the doubling technique of the KMR algorithm and
array right to left, it is always consistent. Therefore
V ternary partitioning of the Bentley-Sedgewick. Ours
we use quick sort or Bentley-Sedgewick algorithm and and Bentley-Sedgewick uses the bubble sort for groups
sort recursively from greater part to lesser part. of size less than 6.
Table 1 shows memory requirements of the algo-
3.3.2 Initial radix sort rithms. Bentley-Sedgewick uses only one array [0 ] I ::n
Because we use comparison-based sorting algorithms a text buer. Our algorithm uses two arrays and I V
for each group, the rst iteration requires ( log ) O n n of 4 bytes and an array of bytes. Both Bentley-
n S n
time. We can accelerate the iteration using bucket Sedgewick and our algorithm use a stack. However,
sort by rst two symbols of suxes. We use an array the depth of the stack of our algorithm is smaller than
of size j6j2 and count the number of all patterns of that of Bentley-Sedgewick because Bentley-Sedgewick
two symbols. All suxes are sorted according to rst is depth-rst algorithm and ours is breadth-rst algo-
two symbols in ( ) time.
O n rithm. Though Larsson's sux tree requires 22 bytes n
We can also use 2-pass radix sort. First all suf- in the worst case, we found that generally the num-
xes are sorted by third and fourth symbols and then ber of internal nodes is about half of leaves and the
sorted by rst two symbols. We do not require re- tree requires about 15 bytes by experiments. Though
n
calculating count of two symbol patterns in each pass Manber-Myers can be implemented in 8 bytes, it be-
n
because the rst two symbols of a sux are the third comes very slow and therefore we use the 13 ver- n
and fourth symbols of another sux and the count sion. Note that the initial 2-pass radix sorting requires
does not change in all passes. Note that input string memory of size 8 and therefore Bentley-Sedgewick
n
is wrapped around for the Block sorting. The count with the radix sorting requires 9 bytes.
n
Manber-Myers 13 n
3. for = 0 2 0
j k ::
sort suxes by [ + 2 ] [ + 2 + 1]
x i j x i j
Table 2 shows time for sorting suxes or making
split groups by [ + 2 0 2] [ + 2 0 1] and cal-
x i k x i k
sux tree. We use les in Calgary corpus [5] and
culate []S
Canterbury corpus [2] for benchmark. In the table,
4. calculate [] rst, second and third columns show lename, size and
average match length (AML) of the les respectively.
V
i=1
n
range of the values varies from 1 to . However, only n
one byte is enough for each element. where lcp is the length of the longest common prex
If the size of a group is , [ + 1 + 0 1] are
g s S g ::g s of two strings. Note that the AML is average num-
not used. Therefore we can store the size of the group ber of symbols per sux to verify that all suxes are
to the unused array if the size cannot be represented sorted correctly. Fourth to seventh columns show sort-
by one byte. ing time of Bentley-Sedgewick, Manber-Myers, Lars-
4
son, and our algorithm respectively. Files are sorted Table 3: sorting time and average match length
Table 2: sorting time and average match length les sorting time (s)
name size (byte) AML BS MM Lar ours
les sorting time (s) 1 txt 12602680 12.91 52.47 119.23 82.20 56.43
name size (byte) AML BS MM Lar ours KIJI80M 81920000 15.46 394.77 | | 436.54
geo 102400 3.54 0.44 0.97 5.74 0.44 KIJI50M 51200000 19.38 237.12 1143.84 | 237.49
book1 768771 7.32 6.54 12.23 13.50 8.35 KIJI30M 29282148 26.33 171.80 451.39 170.36 135.16
progc 39611 8.27 0.29 0.47 0.51 0.28 yeast 8720211 30.98 55.65 117.24 55.65 48.33
book2 610856 9.60 5.75 9.90 9.25 6.20 mail 8173683 50.63 63.77 109.93 70.51 32.24
bible.txt 4047392 14.0 22.0 57.2 30.6 26.5 elispman 7884800 60.22 59.47 97.05 50.74 27.68
E.coli 4638690 17.3 32.2 74.7 39.88 37.20 gcc2723.tar 28231680 76.07 407.58 476.54 237.74 125.77
news 377109 18.2 4.69 7.56 6.96 3.34 aac src 9597511 223.75 2187.10 141.06 26.45 42.85
world192.txt 2473400 23.0 19.75 34.30 18.01 14.87
progl 71646 24.64 1.12 1.25 0.78 0.60
180.00
100.00
40.00
average match length of suxes is large. Its speed dlog2 e. However, practically our algorithm is faster
n
5
References
Figure 3: Sorting speed for large les [1] A. Andersson and S. Nilsson. A New Ecient
Large texts Radix Sort. In 35th Symp. on Foundations of
bytes / s x 10 3
Ours
Computer Science, pages 714{721, 1994.
[2] R. Arnold and T. Bell. A Corpus for the Evalua-
Lar
180.00
MM
160.00
100.00
for sorting and searching strings. In Proceed-
80.00 ings of the 8th Annual ACM-SIAM Symposium
on Discrete Algorithms, pages 360{369, 1997.
http://www.cs.princeton.edu/~rs/strings/.
60.00
14.00 16.00 18.00 20.00 22.00
300.00
250.00
using adaptive coding and partial string match-
200.00
ing. IEEE Trans. on Commun., COM-32(4):396{
150.00
402, April 1984.
100.00
Japanese).
[8] R. M. Karp, R. E. Miller, and A. L. Rosen-
algorithm is practically fast, the worst-case time berg. Rapid identication of repeated patterns
complexity is ( log2 ) and it is worse than
O n n in strings, arrays and trees. In 4th ACM Sym-
( log ) of Manber-Myers because our algorithm
O n n posium on Theory of Computing, pages 125{136,
uses comparison-based sorting. We will therefore try 1972.
to remove the comparison-based sorting like the al- [9] N. J. Larsson. Extended application of sux trees
gorithm of Manber and Myers. To do so, we con- to data compression. In Data Compression Con-
sider changing order of sorting groups and utilizing ference, pages 190{199, April 1996.
a parent-child relationship of groups. (2) We will be
able to select Bentley-Sedgewick or ours by estimat- [10] U. Manber and G. Myers. Sux arrays: A new
ing AML from initial radix sort. We will also estimate method for on-line string searches. In Proceed-
the best passes of the radix sort. (3) If input text ings of the 1st Annual ACM-SIAM Symposium
is too long and it cannot store in main memory, we on Discrete Algorithms, pages 319{327, 1990.
have to use external sorting. To reduce the number of [11] E. M. McCreight. A space-economical sux tree
disk I/O, we can use lcp of suxes and parent-child construction algorithm. Journal of the ACM,
relationship of groups. (4) The Block Sorting com- 23(12):262{272, 1976.
pression uses sorting of suxes. However, sorting is [12] K. Sadakane. A Fast Algorithm for Making Sux
not exactly a necessary condition. For example, when Arrays and for Burrows-Wheeler Transformation.
all preceding symbols i01 of suxes i in a group
x S
In Data Compression Conference (DCC'98), to
are the same, sorting of the group is not required. We appear.
may accelerate compression speed of the Block Sorting
by using the property. [13] J. Seward. bzip2, 1996.
http://www.muraroa.demon.co.uk/.
Acknowledgment [14] E. Ukkonen. On-line construction of sux trees.
The work of the authors was supported in part Algorithmica, 14(3):249{260, September 1995.
by the Grant-in-Aid on Priority Areas, `Advanced
Databases,' of Ministry of Education, Science, Sports
and Culture of Japan.