10 1 1 34

Constructing Sux Arrays of Large Texts
K. Sadakane and H. Imai
Department of Information Science, University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN
fsada,imaig@is.s.u-tokyo.ac.jp
Abstract rable with variants of the PPM [6], its compression

Recently, Sadakane [12] proposes a new fast and speed is faster than them and decompression speed is
memory ecient algorithm for sorting suxes of a text much faster than compression. Moreover, the required
in lexicographic order. It is important to sort suxes memory is smaller than the PPM, therefore the block
because an array of indexes of suxes is called suf- sorting is a promising scheme.
x array and it is a memory ecient alternative of The Burrows-Wheeler transformation used in the
the sux tree. Sorting suxes is also used for the Block Sorting is a permutation of a text to be com-
Burrows-Wheeler transformation in the Block Sorting pressed and it requires sorting of all suxes of the text.
text compression, therefore fast sorting algorithms are Though the Block Sorting is faster than the PPM, it
desired. In full-text databases, of course the length of is slower than the gzip because sorting of long text
texts are quite large, and this algorithm makes it pos- is required. Therefore the Block Sorting generally di-
sible to use the sux array data structure and the vides a long text into many blocks and performs the
compression scheme for such larger texts. Burrows-Wheeler transformation to each block and
In this paper, we compare algorithms for making that makes the compression ratio worse. Therefore
sux arrays of Bentley-Sedgewick, Andersson-Nilsson speeding-up of sorting suxes for large texts is im-
and Karp-Miller-Rosenberg and making sux trees of portant.
Larsson on speed and required memory and compare 1.3 Our results
them with our new algorithm which is fast and mem- We propose a fast and memory ecient algorithm
ory ecient by combining them. for sorting all suxes of a text. We use the dou-
1 Introduction bling technique of Karp-Miller-Rosenberg [8], ternary
1.1 Sux array partitioning of Bentley-Sedgewick [3], and number-
Today large databases become available, such as ing method of Andersson-Nilsson [1] and Manber-
full text of newspapers or Web pages, and Genome Myers [10]. It requires only 9 bytes for sorting a
n
sequences, therefore it is important to store them on text of length by using a clever ordering of sorting
n
memory for quick queries. The sux array [10] is a and an integer encoding. Note that we assume the
compact data structure for on-line string searches. For size of integers is 32bits. We dene a measure of di-
such purposes, sux tree is used because it enables culty of sorting suxes: average match length (AML).
us to nd the longest substring of a large text that Our algorithm requires much more memory than the
matches the query string in linear time. However, the Bentley-Sedgewick, which requires 5 bytes, but our
n
sux tree requires huge memory and therefore it is algorithm is faster than other sorting algorithms when
impossible to use it for large text. AML is large. It is faster than the sux tree construc-
Though the sux tree can be constructed in lin- tion algorithm of Larsson [9] if the alphabet size is
ear time [11, 14, 9], the time complexity depends on large and it requires less than half of memory for the
alphabet size. Searching time also depends on the al- sux tree. When a text becomes long, many pairs of
phabet size, and it becomes slow if the alphabet size is words appear repeatedly and the AML becomes large,
large. On the other hand, construction and searching therefore our algorithm is practical for large texts.
time of the sux array does not depend on the size of 1.4 Denitions
alphabet. Therefore searching time of the sux array Assume that we make a sux array of a text = X
may be comparable with the sux tree though it has x [0 0 1] = 0 1 n01 . The size of alphabet 6 is
::n x x :::x
superlinear time complexity of ( log ) where is
O P n P constant. We add a special symbol `$' at the tail of
the length of a query string and is the length of a
n the text , that is, n = $. It is not in the alphabet
X x
large text. and smaller than any other symbols in the alphabet.
1.2 Block Sorting data compression Suxes of the text are represented by i = [ ].
S x i::n
The Block Sorting is a lossless data compression The sux array [0 ] is an array of indexes of suxes
I ::n
scheme [4]. Though its compression ratio is compa- S i . All suxes in the array are sorted in lexicographic
1
order. A sux i is lexicographically less than a sux
S groups have dierent numbers. Next suxes of each
S j if 9 0 [
l + 0 1] = [ + 0 1] and [ + ]
x i::i l x j::j l x i l < group are split into subgroups according to the [ ] of V i
[ + ].
x j l suxes. As a result, all suxes are split according to
We use another ordering k dened by comparison
< their rst two symbols. This operation continues until
of prexes of length as follows. k all suxes belong to groups of size one. If groups are
split and ordered in alphabetical order, all suxes can
i k j () 0 9
S < S [ + 0 1] = l < k x i::i l
be sorted in lexicographic order.
[ + 0 1] and [ + ] [ + ]
x j::j l x i l < x j l
The key of the algorithm is so-called doubling tech-
i =k j () [ + 0 1] = [ + 0 1] nique. Because all suxes in a group have the same
rst symbols [ + 0 1] after an iteration, we
S S x i::i k x j::j k
k x i::i k
i k j () 0 9
S > S [ + 0 1] = l < k x i::i l can split the groups according to [ + ]. Since x i k::n
[ + 0 1] and [ + ] [ + ]
x j::j l x i l > x j l the numbers [ + ] are already calculated in the
V i k
last iteration, we can split the groups according to

Due to the space limitation we skip describing re- [ + +2 0 1] in the next iteration. Consequently,
x i k::i k
lated algorithms in this version. all suxes can be sorted in lexicographic order within
2 Related works dlog e iterations.
n
It is necessary to sort all suxes of a string in lexi- 2.4 Manber-Myers algorithm

cographic order for making a sux array. Several algo- This algorithm [10] uses the doubling technique in
rithms below are available to sort suxes. First two the KMR algorithm. First we sort and group all suf-
algorithms are general algorithms for sorting strings xes by their rst symbols using bucket sort. Suf-
and last two algorithms are for sorting suxes of a xes are given [ ] like the KMR algorithm. Next all
V i
string. groups are traversed in order of [ ] to sort suxes in

V i
2.1 Bentley-Sedgewick algorithm groups by the second symbols. If [0] = , i01 is theI i S
This is a practical algorithm for sorting strings [3]. smallest sux among the group [ 0 1]. We can move V i
It is similar to the quick sort, but it recursively par- suxes to their correct position according to rst two
titions pointers of strings into three parts. A pivot symbols. The number of iterations is less than dlog e n
used to partition is the rst symbol in a string and and each iteration can be done in ( ) time, therefore
O n
all strings are partitioned into less than, equal to and this algorithms works in ( log ) time. O n n
greater than the symbol according to their rst sym- 2.5 Larsson's sux tree construction al-
bols. A key idea of this algorithm is to use the equal gorithm
part. Strings in it have same rst symbols, therefore This algorithm can maintain a sliding window of a
we can sort the strings without comparing the rst text in linear time and it can be used for PPM-style
symbols. data compression [9]. Though the sux tree requires
The Bentley-Sedgewick algorithm is used in a free more memory than the sux array, it enables us to
software bzip2 [13], an implementation of the Block search substrings in linear time.
Sorting. 3 Proposed algorithm
2.2 Andersson-Nilsson algorithm (For- 3.1 Idea
ward Radix Sort) We propose an algorithm for sorting suxes us-
This is a general algorithm for sorting strings using ing the KMR algorithm and the Bentley-Sedgewick
radix sort [1]. First all strings are sorted and split algorithm. Though the original KMR algorithm uses
into groups according to their rst symbols by using a bucket sort for each group, it is not practical because
radix sort. Next all strings are moved to buckets ac- the number of dierent values of i becomes large and
V
cording to second symbols. The buckets are traversed the number of unsorted elements becomes small as it-
in alphabetical order and strings in them are returned eration of the KMR algorithm proceeds and cost of
to their own groups. Now all strings are sorted accord- initializing the bucket becomes large. The algorithm
ing to rst two symbols, therefore we split the groups of Manber and Myers solved this problem of the KMR
and iterate these operations until all strings are sorted. algorithm, but it traverses all suxes in each iteration
This algorithm is simple, but it theoretically has good even if almost all suxes were already sorted. We use
time complexity. a comparison-based algorithm for sorting each group
2.3 Karp-Miller-Rosenberg (KMR) algo- and sort only unsorted strings in each iteration.
rithm Our algorithm maintains k order of all suxes in
<
This is an algorithm for nding repeated patterns an array [0 ] and iterates until all [ ] have dif-
V ::n V i
in a string [8] and it can be used for sorting suxes. ferent values. After iteration , all suxes are sorted
i
First all suxes i of a string are split into groups

S according to their st = 2i symbols. Suxes whose
k
according to their rst symbols and given numbers rst symbols are equal form a group. Groups are
k
[ ]. All suxes in a group have the same = [ ]

V i v V i called unsorted if their size is more than one and they
and the group is represented by the number . All v are called sorted if the size is one. We want to update
2
[ ] and [ ] for only unsorted groups, therefore [ ]
V i I i V i next group begins at index = , therefore we sort g size
and [ ] of sorted groups must be consistent in all it-

I i S s 2 [ + [ ] 0 1]. After all groups are traversed,
I g::g S g
erations, that is, if a sux i is in a sorted group, [ ]

S V i they will be split to form 2k order. If consecutive suf-
<
and [ ] = must be equal to the values in case of all

I j i xes [ ] and [ + 1] in group have dierent group
I i I i g
suxes are sorted. [ [ ] + ] and [ [ + 1] + ], the group is split be-

V I i k V I i k g
Denition 1 The array [0 ] is consistent with k

tween index and + 1, that is, the size of the group
i i
order if
V ::n <
is updated. Next we update the array for having V
dierent value according to 2k order. <
Si < Sj ! [ ] [ ],
V i V j If the size of a group becomes one in iteration ,
g k
the group is sorted and it is unnecessary to sort the

S i > Sj ! [ ] [ ], and
V i V j
group after the iteration. Therefore we want to skip
S i 6=k j !
S [ ] 6= [ ].
V i V j
such groups. To do so, we combine consecutive sorted
groups into one group. We use sign ag of the size of
Note that we can assign dierent values to [ ], [ ] V i V j groups to indicate whether a group is sorted or not.
even if i =k j . This enables us to sort suxes
S S The array [ ] is a memory ecient alternative of a
S g
without temporary arrays. linked list which contains unsorted groups.
Our algorithm continues maintaining [0 ] is al- V ::n
Figure 1 shows an example of the proposed algo-
ways consistent. If [ ] represents the number of suf-
V i
rithm for sorting suxes of
xes which are less than a sux i according to k S <
\tobeornottobe$." In the gure, second row shows the
order, the value is consistent and [ [ ]] = if a sux V I i i
example string and third, fourth and fth rows show
S i is in a sorted group. This numbering comes from the result of sorting suxes by their rst symbols.
[10, 1]. Negative numbers in [0], [5] and [10] show that
Our algorithm proceeds as follows:
S S S
suxes I [0] , I [5] and I [10] are already sorted. In

S S S
1. sort i by their rst symbols using bucket sort,

S
the rst iteration ( = 1), we sort suxes in groups 1
k
initialize [ ] (1 ), let = 1
V i i n k
( [1 2]), 3 ( [3 4]), 6 ( [6 9]) and 11 ( [11 13]) sepa-
I :: I :: I :: I ::
rately by the second symbols. The second symbol of a

2. sort unsorted groups according to k order using < sux s is the rst symbol of the sux s+1, that is
S S
a comparison-based sorting algorithm the key idea of the KMR algorithm, and is represented
3. split groups and update by [ +1], therefore we sort [ ] by [ [ ] + 1]. After
V s I i V I i
V
the sorting, we split the groups, that is, update [ ], S i
4. combine consecutive sorted groups into one sorted and then update [ ]. V i
group In the beginning of the second iteration ( = 2), k
all suxes have been sorted in 2 order. Therefore

5. if the number of groups is , exit
<
n
we sort suxes in a group by third and later sym-
6. = 2 2 and goto 2.
k k
bols. The element [ + 2] represents the 2 order of
V s <
x s+2 s+3 , therefore after sorting suxes by [ [ ]+2],

x V I i
Sorting and splitting a group is done as follows. As- we obtain 4 order of suxes.
<
sume that suxes in a group = [ ] are equal G I s::e
according to k order and now we calculate 2k or-

< <
der and [ ] for 2 [ ]. First we sort the suf-

V i i I s::e Figure 1: An example of our algorithm
xes according to [ [ ] + ] ( ) by using a
V I i k s i e
comparison-based algorithm, then we split the group i 0

t
1
o
2 3 4 5
b e o r n
6 7 8 9 10
o t t o
11
b
12 13
e $
and update [ ]. The group is split if [ [ ] + ] =
V i 6 V I i k
k xi
I [i] 13 2 11 3 12 6 1 4 7 10 5 0 8 9
[ [ + 1] + ]
V I i k S [i] -1 2 2 -1 4 -1 3
3.2 Implementation V [I [i]] 0 1 1 3 3 5 6 6 6 6 10 11 11 11
1 V [I [i] + k] 3 3 6 0 1 10 11 1 6 11 6
Groups are represented by three arrays: [0 ], I ::n
I [i] 2 11 12 3 1 10 4 7 0 9 8
V [0 ] and [0 ]. After iteration , they represents
::n S ::n k S [i] -1 2 -3 2 -3 2 -1
V [I [i]] 1 1 3 4 6 6 8 9 11 11 13
[ ]: the index of -th sux in k order,
I i i < 2 V [I [i] + k] 8 0 4 3 1 1
I [i] 11 2 10 1 0 9
[ ]: the group of sux s ,
V s S S [i] -11 2 -1
V [I [i]] 1 2 6 7 11 11
[ ]: the size of the group .
S g g 4 V [I [i] + k] 8 0
I [i] 9 0
To implement the algorithm without temporary ar- S [i] -14
rays, we must update groups carefully. All groups are V [I [i]] 11 12
traversed in k order in iteration . The rst group
< k
I [i] 13 11 2 12 3 6 10 1 4 7 5 9 0 8
begins at index 0 and its is [0]. We sort suf-
size S
xes s 2 [0
S I 0 1] according to [ + ]. The
::size V s k
3
3.3 Further improvements 4 Experimental results
3.3.1 One pass implementation We have experimented on sorting time of various
algorithms. We used SPARCstation-5 with 32MB
The above implementation requires three passes: sort- memory and Sun Ultra 1 with 256MB memory. Al-
ing, updating , and updating . Note that combin-
S V gorithms we experimented are Bentley-Sedgewick,
ing groups can be done in sorting pass of the next Manber-Myers, Larsson, and our algorithm. All algo-
iteration. However, we can change the algorithm to rithms except the Larsson use initial 2-pass radix sort-
work in one pass. To update in the sorting pass, we
V ing. Larsson is not a sorting algorithm but a sux tree
must guarantee values of to be consistent. If an ele-
V construction algorithm. Children of nodes in a sux
ment [ ] is updated, the value is always incremented
V i tree are represented by a linked list and elements of the
and it may become larger than other values which are list is rearranged by move-to-front rule. Our algorithm
greater than i in k order. However, if we update the
S < uses the doubling technique of the KMR algorithm and
array right to left, it is always consistent. Therefore
V ternary partitioning of the Bentley-Sedgewick. Ours
we use quick sort or Bentley-Sedgewick algorithm and and Bentley-Sedgewick uses the bubble sort for groups
sort recursively from greater part to lesser part. of size less than 6.
Table 1 shows memory requirements of the algo-
3.3.2 Initial radix sort rithms. Bentley-Sedgewick uses only one array [0 ] I ::n
and a text buer. Larsson uses ve arrays of size and n
Because we use comparison-based sorting algorithms a text buer. Our algorithm uses two arrays and I V
for each group, the rst iteration requires ( log ) O n n of 4 bytes and an array of bytes. Both Bentley-
n S n
time. We can accelerate the iteration using bucket Sedgewick and our algorithm use a stack. However,
sort by rst two symbols of suxes. We use an array the depth of the stack of our algorithm is smaller than
of size j6j2 and count the number of all patterns of that of Bentley-Sedgewick because Bentley-Sedgewick
two symbols. All suxes are sorted according to rst is depth-rst algorithm and ours is breadth-rst algo-
two symbols in ( ) time.
O n rithm. Though Larsson's sux tree requires 22 bytes n
We can also use 2-pass radix sort. First all suf- in the worst case, we found that generally the num-
xes are sorted by third and fourth symbols and then ber of internal nodes is about half of leaves and the
sorted by rst two symbols. We do not require re- tree requires about 15 bytes by experiments. Though
n
calculating count of two symbol patterns in each pass Manber-Myers can be implemented in 8 bytes, it be-
n
because the rst two symbols of a sux are the third comes very slow and therefore we use the 13 ver- n
and fourth symbols of another sux and the count sion. Note that the initial 2-pass radix sorting requires
does not change in all passes. Note that input string memory of size 8 and therefore Bentley-Sedgewick
n
is wrapped around for the Block sorting. The count with the radix sorting requires 9 bytes.
n
can be updated in constant time even if the string is

not wrapped around. Table 1: memory requirements
The algorithm can be extended to -pass and it k
becomes as follows. algorithm memory

1. count all two symbol patterns Bentley-Sedgewick 5 + stack
n
Manber-Myers 13 n
2. sort suxes = [ ] by [ +2 0 2] [ +2 0 1] Larsson 22 n

Si x i::n x i k x i k
our algorithm 9 + stack
n
3. for = 0 2 0
j k ::
sort suxes by [ + 2 ] [ + 2 + 1]
x i j x i j
Table 2 shows time for sorting suxes or making
split groups by [ + 2 0 2] [ + 2 0 1] and cal-
x i k x i k
sux tree. We use les in Calgary corpus [5] and
culate []S
Canterbury corpus [2] for benchmark. In the table,
4. calculate [] rst, second and third columns show lename, size and
average match length (AML) of the les respectively.
V
The AML is dened as follows.

3.3.3 Word size of the array S
Because elements of the array is the size of groups,
S
AM L
1 X
n01
= 0 1 lcp( I [i] I [i+1] )
S ;S
i=1
n
range of the values varies from 1 to . However, only n
one byte is enough for each element. where lcp is the length of the longest common prex
If the size of a group is , [ + 1 + 0 1] are
g s S g ::g s of two strings. Note that the AML is average num-
not used. Therefore we can store the size of the group ber of symbols per sux to verify that all suxes are
to the unused array if the size cannot be represented sorted correctly. Fourth to seventh columns show sort-
by one byte. ing time of Bentley-Sedgewick, Manber-Myers, Lars-
4
son, and our algorithm respectively. Files are sorted Table 3: sorting time and average match length
Table 2: sorting time and average match length les sorting time (s)
name size (byte) AML BS MM Lar ours
les sorting time (s) 1 txt 12602680 12.91 52.47 119.23 82.20 56.43
name size (byte) AML BS MM Lar ours KIJI80M 81920000 15.46 394.77 | | 436.54
geo 102400 3.54 0.44 0.97 5.74 0.44 KIJI50M 51200000 19.38 237.12 1143.84 | 237.49
book1 768771 7.32 6.54 12.23 13.50 8.35 KIJI30M 29282148 26.33 171.80 451.39 170.36 135.16
progc 39611 8.27 0.29 0.47 0.51 0.28 yeast 8720211 30.98 55.65 117.24 55.65 48.33
book2 610856 9.60 5.75 9.90 9.25 6.20 mail 8173683 50.63 63.77 109.93 70.51 32.24
bible.txt 4047392 14.0 22.0 57.2 30.6 26.5 elispman 7884800 60.22 59.47 97.05 50.74 27.68
E.coli 4638690 17.3 32.2 74.7 39.88 37.20 gcc2723.tar 28231680 76.07 407.58 476.54 237.74 125.77
news 377109 18.2 4.69 7.56 6.96 3.34 aac src 9597511 223.75 2187.10 141.06 26.45 42.85
world192.txt 2473400 23.0 19.75 34.30 18.01 14.87
progl 71646 24.64 1.12 1.25 0.78 0.60
Figure 2: Sorting speed for small les

in order of AML. Bentley-Sedgewick is fast if AML
is relatively small and our algorithm is fast if AML is Small files
large. Larsson is slow for small AML les and its speed
3
bytes / s x 10
240.00 Ours
is similar to our algorithm for E.coli because E.coli is a

Lar
220.00 MM
BS
DNA sequence of a colon bacillus and linked lists in the

200.00
180.00
tree is ecient for the small alphabet (ATCG). Our 160.00
algorithm is fast if AML is large because it uses the

140.00
doubling technique of the KMR algorithm. Roughly

120.00
100.00
speaking, its sorting time is proportional to logarithm 80.00
of the AML. On the other hand, sorting time of the

60.00
40.00
Bentley-Sedgewick is proportional to the AML. 20.00
Table 3 shows experimental results for very large

5.00 10.00 15.00 20.00 25.00
les. The results are also shown in Figure 2-4. We

average match length
used Sun Ultra30 with 1024MB memory. In the table,

1 txt is an English text, KIJI30M is a part of the text
of Nihon Keizai Shimbun, which is a Japanese news- tree. The average match length becomes large if a text
paper including headlines and bodies, KIJI50M and is long, therefore our algorithm is practical for large
KIJI80M are bodies of the newspaper, yeast is a DNA texts.
sequence, elispman is texinfo le of EmacsLisp, mail We experimented on sorting time for many les
is a Japanese mailing list le, gcc2723.tar is an archive and found that average match length (AML) of the
of the gcc sources, and acc src is C source codes of an Japanese newspaper is not too large. This is all bod-
audio codec. Our algorithm is always faster than the ies of the newspaper in two to six months and various
Manber-Myers and faster than Bentley-Sedgewick if articles exist. If articles of similar topics are gathered,
AML is more than 20. If AML is less than 20, Bentley- the AML becomes large and our algorithm works well.
Sedgewick is faster than ours. However, the dierence The Bentley-Sedgewick is faster if the AML is not
is about 10%. The Larsson is fast for acc src because large. However, it becomes very slow if the AML is
the le has large AML and the number of nodes will be large and we don't know the value before sorting. The
small. If we use binary trees or hash tables for nding Bentley-Sedgewick is therefore not stable on AML.
children of nodes in a sux tree, the Larsson will be- The problem may be relieved by using run-length en-
come faster for other les. However, it requires more coding of the input text, however, the encoding does
memory. Our algorithm works for very large les up not aect les which have many repeated substrings.
to 80M bytes. The Manber-Myers and Larsson cannot On the other hand, our algorithm is stable. Though
be available due to memory limitation. Larsson's sux tree construction algorithm is also sta-
ble, it cannot be accelerated by initial radix sort.
5 Concluding remarks Very recently, Hongo and Yokoo [7] proposed an
We examined speed and required memory of sev- algorithm for the Burrows-Wheeler transformation.
eral algorithms for sorting suxes and making a sux Their algorithm uses the KMR algorithm and sorts
tree and we proposed a practical algorithm for sort- only unsorted groups like ours. For sorting a group,
ing suxes by combining other algorithms. Our al- it compares two numbers at a time. Thus the num-
gorithm is faster than other sorting algorithms when ber of iteration is less than dlog3 e rather than n
average match length of suxes is large. Its speed dlog2 e. However, practically our algorithm is faster
n
is compared with a sux tree construction algorithm than theirs.

and it requires less than half of memory for the sux Some works are remained. (1) Though our
5
References
Figure 3: Sorting speed for large les [1] A. Andersson and S. Nilsson. A New Ecient
Large texts Radix Sort. In 35th Symp. on Foundations of
bytes / s x 10 3
Ours
Computer Science, pages 714{721, 1994.
[2] R. Arnold and T. Bell. A Corpus for the Evalua-
Lar
180.00
MM
tion of Lossless Compression Algorithms. In Data

BS
160.00
Compression Conference, pages 201{210, March

1997. http://corpus.canterbury.ac.nz/.
140.00
[3] J. L. Bentley and R. Sedgewick. Fast algorithms

120.00
100.00
for sorting and searching strings. In Proceed-
80.00 ings of the 8th Annual ACM-SIAM Symposium
on Discrete Algorithms, pages 360{369, 1997.
http://www.cs.princeton.edu/~rs/strings/.
60.00
14.00 16.00 18.00 20.00 22.00
[4] M. Burrows and D. J. Wheeler. A block-sorting

lossless data compression algorithms. Technical
Figure 4: Sorting speed for very large les Report 124, Digital SRC Research Report, 1994.
[5] Calgary Text Compression Corpus.
3 Very large texts ftp://ftp.cpsc.ucalgary.ca/pub/projects/
text.compression.corpus/.
bytes / s x 10
Ours
Lar
350.00
MM
[6] J. G. Cleary and I. H. Witten. Data compression

BS
300.00
250.00
using adaptive coding and partial string match-
200.00
ing. IEEE Trans. on Commun., COM-32(4):396{
150.00
402, April 1984.
100.00
[7] F. Hongo and H. Yokoo. Block-Sorting Data

50.00
Compression and KMR Algorithm. In 20th Sym-

0.00
posium on Information Theory and Its Applica-
tions, pages 673{676. SITA, December 1997. (in
50.00 100.00 150.00 200.00
Japanese).
[8] R. M. Karp, R. E. Miller, and A. L. Rosen-
algorithm is practically fast, the worst-case time berg. Rapid identication of repeated patterns
complexity is ( log2 ) and it is worse than
O n n in strings, arrays and trees. In 4th ACM Sym-
( log ) of Manber-Myers because our algorithm
O n n posium on Theory of Computing, pages 125{136,
uses comparison-based sorting. We will therefore try 1972.
to remove the comparison-based sorting like the al- [9] N. J. Larsson. Extended application of sux trees
gorithm of Manber and Myers. To do so, we con- to data compression. In Data Compression Con-
sider changing order of sorting groups and utilizing ference, pages 190{199, April 1996.
a parent-child relationship of groups. (2) We will be
able to select Bentley-Sedgewick or ours by estimat- [10] U. Manber and G. Myers. Sux arrays: A new
ing AML from initial radix sort. We will also estimate method for on-line string searches. In Proceed-
the best passes of the radix sort. (3) If input text ings of the 1st Annual ACM-SIAM Symposium
is too long and it cannot store in main memory, we on Discrete Algorithms, pages 319{327, 1990.
have to use external sorting. To reduce the number of [11] E. M. McCreight. A space-economical sux tree
disk I/O, we can use lcp of suxes and parent-child construction algorithm. Journal of the ACM,
relationship of groups. (4) The Block Sorting com- 23(12):262{272, 1976.
pression uses sorting of suxes. However, sorting is [12] K. Sadakane. A Fast Algorithm for Making Sux
not exactly a necessary condition. For example, when Arrays and for Burrows-Wheeler Transformation.
all preceding symbols i01 of suxes i in a group
x S
In Data Compression Conference (DCC'98), to
are the same, sorting of the group is not required. We appear.
may accelerate compression speed of the Block Sorting
by using the property. [13] J. Seward. bzip2, 1996.
http://www.muraroa.demon.co.uk/.
Acknowledgment [14] E. Ukkonen. On-line construction of sux trees.
The work of the authors was supported in part Algorithmica, 14(3):249{260, September 1995.
by the Grant-in-Aid on Priority Areas, `Advanced
Databases,' of Ministry of Education, Science, Sports
and Culture of Japan.

10 1 1 34

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1 1 34

Uploaded by

Copyright:

Available Formats

Constructing Sux Arrays of Large Texts

K. Sadakane and H. Imai

Department of Information Science, University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN

Abstract rable with variants of the PPM [6], its compression

last iteration, we can split the groups according to

It is necessary to sort all suxes of a string in lexi- 2.4 Manber-Myers algorithm

string. groups are traversed in order of [ ] to sort suxes in

First all suxes i of a string are split into groups

[ ]. All suxes in a group have the same = [ ]

and [ ] of sorted groups must be consistent in all it-

erations, that is, if a sux i is in a sorted group, [ ]

and [ ] = must be equal to the values in case of all

suxes are sorted. [ [ ] + ] and [ [ + 1] + ], the group is split be-

De nition 1 The array [0 ] is consistent with k

di erent value according to 2k order. <

the group is sorted and it is unnecessary to sort the

suxes I [0] , I [5] and I [10] are already sorted. In

1. sort i by their rst symbols using bucket sort,

rately by the second symbols. The second symbol of a

group In the beginning of the second iteration ( = 2), k

all suxes have been sorted in 2 order. Therefore

x s+2 s+3 , therefore after sorting suxes by [ [ ]+2],

according to k order and now we calculate 2k or-

der and [ ] for 2 [ ]. First we sort the suf-

comparison-based algorithm, then we split the group i 0

and a text bu er. Larsson uses ve arrays of size and n

can be updated in constant time even if the string is

becomes as follows. algorithm memory

2. sort suxes = [ ] by [ +2 0 2] [ +2 0 1] Larsson 22 n

The AML is de ned as follows.

Figure 2: Sorting speed for small les

is similar to our algorithm for E.coli because E.coli is a

DNA sequence of a colon bacillus and linked lists in the

tree is ecient for the small alphabet (ATCG). Our 160.00

algorithm is fast if AML is large because it uses the

doubling technique of the KMR algorithm. Roughly

speaking, its sorting time is proportional to logarithm 80.00

of the AML. On the other hand, sorting time of the

Bentley-Sedgewick is proportional to the AML. 20.00

Table 3 shows experimental results for very large

les. The results are also shown in Figure 2-4. We

used Sun Ultra30 with 1024MB memory. In the table,

is compared with a sux tree construction algorithm than theirs.

tion of Lossless Compression Algorithms. In Data

Compression Conference, pages 201{210, March

[3] J. L. Bentley and R. Sedgewick. Fast algorithms

average match length

[4] M. Burrows and D. J. Wheeler. A block-sorting

[6] J. G. Cleary and I. H. Witten. Data compression

[7] F. Hongo and H. Yokoo. Block-Sorting Data

Compression and KMR Algorithm. In 20th Sym-

average match length

You might also like

Constructing Sux Arrays of Large Texts

It is necessary to sort all suxes of a string in lexi- 2.4 Manber-Myers algorithm

string. groups are traversed in order of [ ] to sort suxes in

First all suxes i of a string are split into groups

[ ]. All suxes in a group have the same = [ ]

erations, that is, if a sux i is in a sorted group, [ ]

suxes are sorted. [ [ ] + ] and [ [ + 1] + ], the group is split be-

Denition 1 The array [0 ] is consistent with k

dierent value according to 2k order. <

suxes I [0] , I [5] and I [10] are already sorted. In

all suxes have been sorted in 2 order. Therefore

x s+2 s+3 , therefore after sorting suxes by [ [ ]+2],

and a text buer. Larsson uses ve arrays of size and n

2. sort suxes = [ ] by [ +2 0 2] [ +2 0 1] Larsson 22 n

The AML is dened as follows.

tree is ecient for the small alphabet (ATCG). Our 160.00

is compared with a sux tree construction algorithm than theirs.