You are on page 1of 15

Efficient Similarity Join over Multiple Stream

Time Series
Xiang Lian, Student Member, IEEE, and Lei Chen, Member, IEEE
AbstractSimilarity join (SJ) in time-series databases has a wide spectrum of applications such as data cleaning and mining.
Specifically, an SJ query retrieves all pairs of (sub)sequences from two time-series databases that -match with each other, where is
the matching threshold. Previous work on this problem usually considers static time-series databases, where queries are performed
either on disk-based multidimensional indexes built on static data or by nested loop join (NLJ) without indexes. SJ over multiple stream
time series, which continuously outputs pairs of similar subsequences from stream time series, strongly requires low memory
consumption, low processing cost, and query procedures that are themselves adaptive to time-varying stream data. These
requirements invalidate the existing approaches in static databases. In this paper, we propose an efficient and effective approach to
perform SJ among multiple stream time series incrementally. In particular, we present a novel method, Adaptive Radius-based Search
(ARES), which can answer the similarity search without false dismissals and is seamlessly integrated into SJ processing. Most
importantly, we provide a formal cost model for ARES, based on which ARES can be adaptive to data characteristics, achieving the
minimum number of refined candidate pairs, and thus, suitable for stream processing. Furthermore, in light of the cost model, we utilize
space-efficient synopses that are constructed for stream time series to further reduce the candidate set. Extensive experiments
demonstrate the efficiency and effectiveness of our proposed approach.
Index TermsStream time series, ARES, similarity join, synopsis.

1 INTRODUCTION
I
N the context of stream time series, similarity join (SJ) has
many important applications such as data cleaning and
data mining [35], [26]. For example, SJ queries can be used
to help clean sensor data collected from various sources that
might contain inconsistency [26]. As another example [35],
in the stock market, it is crucial to find correlations among
stocks so as to make trading decisions timely. In this case,
we can also perform SJ over price curves of stocks in order
to obtain their common patterns, rules, or trends. In
addition to the above two concrete examples, SJ can be
applied to a wide spectrum of stream time series applica-
tions including Internet traffic analysis [12], sensor network
monitoring [36], and so on. Formally, given two time-series
databases R and S containing (sub)sequences of length i,
an SJ query outputs all pairs hi. :i, such that di:ti. : ,
where i 2 R. : 2 S. di:t. is a distance function between
two series, and a similarity threshold.
To the best of our knowledge, there is no previous work
on SJ in the scenario of multiple stream time series, where new
data arrive continuously over time. Although some propo-
sals in the literature studied the join over data streams [28],
[22], [27], their focus is on load-shedding stream data to
disk, in the case of memory overflow, such that the joined
pairs (involving only a few join attributes) are output
smoothly. In contrast, the SJ problem over stream time series
is entirely processed in memory, where the join predicate
considers subsequences with extremely high dimensional-
ity (join attributes), which arises indexing and query
efficiency issues that cannot be solved by simple joins.
Furthermore, previous methods on SJ problem in time-
series databases mainly focus on static data, which can be
classified into two categories, SJ with and without indexes.
The most related work to the first category is the spatial join
[6], [33], [20], which considers each time series of length i as
an i-dimensional data point in the spatial database. In
particular, this approach builds multidimensional indexes
such as R-tree [14] or spatial hash [20] on (reduced) time
series and executes the join operator with the help of
indexes. In contrast to the first category, the second one
does not rely on any index. Instead, a nested loop join (NLJ) is
invoked to exhaustively compare each possible pair. These
approaches in static databases, however, cannot be applied
to SJ over stream time series either, due to the unique
requirements of stream time-series processing such as low
memory consumption, low processing cost, and adaptivity
to data characteristics. Specifically, it is not efficient to build
an R-tree [14] for each stream time series, since the required
memory size is large, and index maintenance and query
costs are high. Moreover, the work of the spatial hash join
[20] is designed only for static data, and thus, not adaptive
to the change of stream data. Finally, NLJ incurs high (i.e.,
quadratic) computation cost, which is not tuned to stream
time-series processing.
Motivated by this, in this paper, we propose an efficient
and effective approach that incrementally performs SJ over
multiple stream time series with low space and computation
cost, yet adaptive to the change of data characteristics.
Specifically, we construct space-efficient synopses for stream
1544 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
. The authors are with the Department of Computer Science and
Engineering, Hong Kong University of Science and Technology, Clear
Water Bay, Kowloon, Hong Kong, China.
E-mail: {xlian, leichen}@cse.ust.hk.
Manuscript received 9 Oct. 2007; revised 7 July 2008; accepted 4 Nov. 2008;
published online 8 Jan. 2009.
Recommended for acceptance by K. Shim.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2007-10-0492.
Digital Object Identifier no. 10.1109/TKDE.2009.27.
1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
time series, which are used to facilitate pruning candidate
pairs, and thus, reduce the computation cost during SJ
processing. Furthermore, we propose a formal cost model
for the entire SJ procedure, in light of which each
incremental step is adaptive to the change of stream data,
such that the total number of refined candidates at each step
is minimized.
In particular, we make the following contributions:
1. We propose in Section 5.1 an Adaptive Radius-basEd
Search (ARES) approach to answer SJ on stream time
series, which does not introduce false dismissals.
2. We provide in Section 5.2 a formal cost model for
ARES, based on which ARES can achieve the
minimum number of refined candidates and be
seamlessly integrated into SJ procedure.
3. We use space-efficient synopses over stream time
series proposed in Section 4 to further prune the
candidate set in light of the cost model in Section 5.3,
and thus, reduce the computation cost of SJ.
4. We also discuss the batch processing and load
shedding for the similarity join over stream time
series in Sections 5.5 and 5.6, respectively.
In addition, Section 2 briefly overviews the previous
work on SJ in static databases as well as the related similarity
search problem. Section 3 formally defines our SJ problem
over multiple stream time series. Section 4 presents the data
structures of the synopses to summarize stream time series
and help efficient SJ processing. Section 6 illustrates through
extensive experiments the query performance of our
proposed approach. Finally, Section 7 concludes the paper.
2 RELATED WORK
2.1 Similarity Join
As indicated before, existing works on SJ in static time-series
databases can be classified into two categories: SJ with and
without indexes. Specifically, the most related work [6], [33],
[20] in the first category is the spatial join, which constructs
spatial indexes, for example, R-tree [14] or spatial hash [20],
on (sub)sequences in time-series databases, and performs
the join operation on indexes. Specifically, Brinkhoff et al. [6]
build an R-tree on each database, traverse both R-trees in a
depth-first manner, and finally, obtain similar pairs in leaf
nodes. Huang et al. [33] improve the performance of the
spatial join by traversing R-trees in a breadth-first manner.
Lo and Ravishankar [20] spatially hash data into buckets for
both databases and retrieve as candidates those pairs that
fall into the same buckets. In order to answer queries with
range predicates, they map each data from one of the two
databases into multiple buckets.
The second category, SJ without indexes, performs the
well-known NLJ, which exhaustively computes the distance
for each pair of data from two databases and reports the
result if they are similar. In particular, each time, NLJ loads
one disk page from a database and joins this page with
every page in the other database. Note that NLJ is a generic
brute-force approach, which can be applied to join with any
predicate. Clearly, these methods on SJ over static databases
usually incur large memory consumption of indexes, high
processing cost, or inability of adapting to the changing
data, which, thus, cannot be applied directly to SJ
processing in the stream scenario.
To the best of our knowledge, there is no previous work
on SJ problem over multiple stream time series, which
involves the unique characteristics of both time series and
stream processing. In particular, time series is a sequence of
ordered data values usually with long length (i.e., with high
dimensionality), for example, 128, which makes indexing
and querying inefficient due to the dimensionality curse
problem [32], [4], [3]. Moreover, in the stream scenario, the
SJ query processing has its own properties, such as limited
memory, fast arrival rate, and time-varying data. Thus, it is
desired to design SJ techniques that can achieve small
memory consumption, low processing cost, and yet high
query accuracy adaptive to the stream time series data.
Thus, all these requirements are challenging problems that
we need to solve in order to perform efficient SJ over
multiple stream time series.
In the data stream literature, there are some proposals to
output the result of the equality join between two data
streams (e.g., XJoin [28], hash merge join (HMJ) [22], and rate-
based progressive join (RPJ) [27]). Specifically, they use a hash
function to map join attributes of each stream data into
buckets and perform the hash join on data from pairwise
buckets of two streams. In these approaches, the focus is on
load shedding buckets to disk when the memory is full and
outputting the join result as early as possible. Furthermore,
they consider only small number (e.g., 1 or 2) of join
attributes (i.e., with low dimensionality). In contrast, our SJ
processing over multiple stream time series is entirely
accomplished in memory and the dimensionality of each
subsequence is usually very high, which arises indexing
and query efficiency issues that cannot be handled by
stream joins.
2.2 Similarity Search
In this section, we briefly review previous work on
similarity search, which is a very close research problem to
SJ. The similarity search problem is one of the most
fundamental problems in time-series databases that in-
volves many applications, such as the multimedia retrieval,
data mining, Web search, and retrieval. In particular, a
similarity query retrieves all the (sub)sequences in the
database that are similar to a user-specified query time
series, where the similarity between two series is defined by
a distance function, for example, L
j
-norm [34], Dynamic
Time Warping (DTW) [5], Longest Common Subsequence
(LCSS) [29], and Edit distance with Real Penalty (ERP) [10].
In this paper, we consider euclidean distance (i.e., L
2
-norm) as
our similarity measure, which has been widely used in
many applications [1], [13].
Agrawal et al. [1] first proposed the whole matching in the
static sequence database, where all the data sequences are of
the same length. Specifically, they reduce the dimension-
ality of the entire data sequences from i to d (d ( i)
applying a dimensionality reduction technique, Discrete
Fourier Transform (DFT), and insert the reduced data into a
d-dimensional R-tree [14]. Given any query time series Q of
the same length i and similarity threshold , we transform
Q to a d-dimensional query point using DFT similarly,
issue a range query centered at with a radius on the R-tree
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1545
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
index, and finally, refine candidates returned from the range
query by checking their real distance to Q. It has been
proved that the resulting candidate set by using DFT
reduction method does not introduce any false dismissals
(actual answers to the query that are, however, absent in the
candidate set).
Faloutsos et al. [13] later proposed the subsequence
matching problem in the time-series database, where both
data and query time series can have different lengths. Given
a query time series Q of arbitrary length i, in order to
retrieve subsequences of length i that -match Q, the
proposed method, namely FRM for brevity, preprocesses
each time series in the database by extracting sliding
windows of size n (assuming that i : n, for a positive
integer :), where n is the minimum possible value of i.
Then, FRM reduces the dimensionality of each sliding
window from n to ) using DFT () ( n) and indexes the
reduced series in an R-tree [14]. To answer the range query
with query series Q, FRM partitions Q into : disjoint
windows of equal length n (since i : n), converts them
into :)-dimensional query points,
1
.
2
. . . . . and
:
, respec-
tively, using DFT, and issues a range query on R-tree
centered at
i
with a smaller radius

:
p
, for each 1 i :,
whose results are finally refined by checking their real
euclidean distances to Q. According to the lower bounding
lemma [13], since the distance between any two reduced
data using any dimensionality reduction technique (e.g.,
DFT) is never greater than that between the original data, it
is guaranteed that no false dismissals are introduced in the
query results.
In contrast to FRM, the duality-based method (Dual) [24]
extracts disjoint windows of size n from data time series and
sliding windows of size n from the query time series. Moon
et al. [23] have integrated both FRM and Dual methods into
the framework for the general match, by introducing the
concept of J-disjoint windows and J-sliding windows. Note,
however, that FRM can generate much fewer candidates
than Dual and general match [23]. Moreover, Faloutsos et al.
[13] use MBR to group data converted from consecutive
sliding windows in order to reduce the I/O cost. However,
since our SJ processing is performed in memory and no I/O
is involved, throughout this paper, we always mean FRM
without MBR.
In the literature of stream time series, Liu and Ferhatos-
manoglu [19] build VA-Stream or VA

-Stream to improve
the query performance of a linear scan. Kontaki and
Papadopoulos [16] construct R-tree on the reduced series
with deferred update policy to facilitate the similarity search.
Bulut and Singh [7] use multilevel DWT to represent stream
data and monitor pattern queries. Lian et al. [18] propose a
multiscale segment mean (MSM) representation for subse-
quences to detect static patterns over stream time series.
However, applying these methods to directly process
stream SJ queries faces the scalability problem, in terms of
either time or space. Sakurai et al. [25] propose a SPRING
approach to monitor stream time series data and find
subsequences that are similar to a given query sequence
under DTW measure. In contrast, our work studies the
stream SJ problem under euclidean distance.
3 PROBLEM DEFINITION
In this section, we formally define the problem of SJ over
multiple stream time series and illustrate its general frame-
work. Assume that we have i stream time series T
1
. T
2
. . . . .
and T
i
, which are synchronized in the sense that new data
items of all series would arrive at the same time stamp. For
each series T
i
, we keep the most recent \ data T
i
t \
1 : t in memory, where t is the current time stamp. When a
new data item, say T
i
t 1, arrives at the next time stamp
t 1, the oldest item T
i
t \ 1 is expired, and thus,
evicted from the memory. With this sliding window model
for each stream time series, an SJ query continuously outputs
similar pairs of subsequences o
i
. o
,
of length i from any
two series T
i
and T
,
(for i. , 2 1. i), respectively, such that
di:to
i
. o
,
, where di:t. is a distance function
between two series and the similarity threshold. As
mentioned before, there are many distance functions to
measure the similarity between two series, such as L
j
-norm
[34], DTW [5], and ERP [10]. In this paper, we use the
euclidean distance (i.e., L
2
-norm), which has been widely
used in many applications including financial, marketing,
or production data analysis, and scientific databases (e.g.,
with sensor data) [1], [13], and leave the discussion of other
measures as our future work.
Fig. 1 illustrates our stream SJ scenario. In particular,
each stream time series T
i
maintains a space-efficient
synopsis oyi
i
that can facilitate a fast SJ. Whenever a series
T
i
receives one insertion or expunge an old data item, the
synopsis oyi
i
of T
i
is incrementally updated. In the case
where T
i
obtains a new subsequence, say o
icn
, we join o
icn
with other stream time series T
,
with the help of oyi
,
, where
1 , i, and report the matching pairs as the join result.
Fig. 2 illustrates the general framework for the
SJ algorithm, which incrementally outputs the SJ result upon
an insertion T
i
t 1 to series T
i
. Specifically, due to the new
insertion, we first obtain a new subsequence o
icn
in T
i
and
update the synopsis oyi
i
of T
i
accordingly (lines 1 and 2).
Then, for each stream time series T
,
(1 , i), we output all
subsequences in T
,
that -match o
icn
(lines 3-6). In
1546 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 1. Illustration of SJ over stream time series.
Fig. 2. The general framework for SJ.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
particular, we utilize a novel approach ARES to obtain a
candidate set coid containing subsequences in T
,
that are
similar to o
icn
(line 4). Since procedure ARES is based on a
formal cost model that we develop, the number of
candidates returned is the minimum, compared to FRM
approach [13] used in static databases. Next, in procedure
Synopsis_Pruning, we further prune candidates in coid
with the help of synopsis oyi
,
resulting in a new candidate
set coid
00
(line 5). Finally, candidates in coid
00
are refined by
checking their real euclidean distances to o
icn
and output
(line 6). In the following sections, we first illustrate data
structures of synopsis and then discuss the detailed
procedures ARES and Synopsis_Pruning. Fig. 3 sum-
marizes the commonly used symbols in this paper.
4 DATA STRUCTURES OF SYNOPSIS
In this section, we discuss the synopsis oyi
i
that we
maintain for each stream time series T
i
. Fig. 4 illustrates the
data structure of oyi
i
for the stream time series T
i
, which is
similar to HistogramBlooms [21], consisting of two parts: an
equi-width histogram and bloom filters. In particular, the
histogram of oyi
i
has / cells that are used to summarize )-
dimensional data converted from sliding windows of size n in
series T
i
() ( n), using any dimensionality reduction
technique. In contrast to the normal histogram, however,
for the /th cell (1 / /), we keep not only the frequency
)ic
i/
of data in the cell but also a bloom filter 11
i/
with o bits.
Bloom filters have been widely used in many applications
[21] to check the existence of values. Specifically, a bloom
filter 11
i/
is a bit vector initialized with values 0. A
position in 11
i/
is set to value 1, only if there exists at least
one reduced data in the /th cell, such that the start offset of
whose sliding window is hashed into this position using a
hash function H. Furthermore, we also store start offsets of
sliding windows in oyi
i
, which are pointed to by the position
they are hashed into.
As an example, in Fig. 4, assume that we have a sliding
window T
i
o: : o: n 1 of size n from T
i
, which is
summarized in synopsis oyi
i
as follows: First, we trans-
form it to an )-dimensional point using any dimensionality
reduction method. Then, we find the cell of the histogram
(e.g., the first one) into which this point falls, and increase
its frequency (i.e., )ic
i1
) by 1. Next, we update the bloom
filter BF
i1
by setting its third position to 1 (assuming that
Ho: 3) and storing the start offset o: of the sliding
window T
i
o: : o: n 1 in the cell, which is pointed to by
the third position in 11
i1
.
Note that there are many dimensionality reduction
techniques, such as Singular Value Decomposition (SVD)
[17], DFT [1], Discrete Wavelet Transform (DWT) [9], Piecewise
Aggregate Approximation (PAA) [34], Adaptive Piecewise
Constant Approximation (APCA) [15], Chebyshev Polynomial
(CP) [8], and Piecewise Linear Approximation (PLA) [11]. Since
our SJ processing requires small memory consumption and
low processing cost, any reduction approach that satisfies
these two requirements can be applied in our synopsis. In
this paper, we simply use PAA as our dimensionality
reduction method, since it can be incrementally computed in
the stream environment and yet without consuming extra
space. In particular, PAA takes the mean of values within
each sliding window of size n, reducing the dimensionality
from n to 1 (i.e., ) 1). Thus, throughout this paper, we
assume that synopsis oyi
i
always contains a 1-dimensional
histogram (i.e., let ) 1). Note that, like many other
dimensionality reduction techniques, the query efficiency
on the PAA-reduced data can be improved by increasing
the reduced dimensionality. In the case of PAA, this can be
achieved by specifying smaller size n of sliding windows
(note: not increasing )). For other reduction methods, larger
) ( 1) value can be used, and our proposed approaches
can be applied on the data structure with arbitrary ) value,
as will be discussed later in Section 5.2.
Memory size. The total memory consumption of i
synopses in our SJ scenario is i / d|oq
2
\ n 1e /
d|oq
2
oe \ n 1d|oq
2
\ n 1e bytes, where i is
the number of stream time series, / |oq
2
\ n 1 is the
space for frequencies in each histogram, / |oq
2
o is the space
for bloom filters in each histogram, and |oq
2
\ n 1 is the
space for the start offset of each window of size n in series.
As an example, assume that o 2
8
. / 2
6
. \ 2
10
, and
n 2
8
. With a total 16` 2
24
bytes of the available
memory, our SJ procedure can retain synopses for about
2,000 stream time series of length 1,024, which is very space
efficient. We later discuss load shedding, which further
reduces the required memory size by discarding start offsets.
Incremental updates of synopsis. Since we use an
incremental dimensionality reduction technique, that is, the
incremental PAA, the update of the synopsis is very efficient.
In particular, when a new data item T
i
t 1 arrives, we
convert the most recent sliding window T
i
t n 2 : t 1 of
size n into a 1-dimensional point using the incremental PAA.
Then, we insert this point into a cell (e.g., the /th cell) in the
histogram, that is, increasing its frequency )ic
i/
by 1, setting
the Ht n 2th position in 11
i/
to 1 with a hash function
H, andstoring the start offset t n 2 of the sliding window
T
i
t n 2 : t 1,which is pointed by this position.
For an expired sliding window, say Tt \ 1 : t
\ n, it accesses the Ht \ 1th positions of / bloom
filters until the start offset t \ 1 is found and removed.
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1547
Fig. 3. Meanings of symbols used.
Fig. 4. Synopsis structure for stream time series.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
The corresponding frequency is decreased by 1 and the
position in the bloom filter is set to 0 if no other start offsets
are mapped to it; otherwise, set to 1.
5 SJ OVER MULTIPLE STREAM TIME SERIES
In this section, we discuss the SJ, which outputs all similar
pairs of subsequences from stream time series without any
false dismissals. As mentioned in Fig. 2, SJ is executed
incrementally as follows: Whenever a stream time series T
i
receives a new data item T
i
t 1, SJ obtains and then
outputs all subsequences from stream time series T
,
(1 , i) that -match with the new subsequence o
icn

T
i
t i 2 : t 1 from T
i
, where i : n.
Due to the high dimensionality i (e.g., i 128) of the
new subsequence o
icn
, most indexes on high-dimensional
data fail to show good query performance, even compared
with a linear scan. Previous work on tackling such a problem
usually reduces the dimensionality of series before indexing
them. In particular, Agrawal et al. [1] reduce the dimen-
sionality of the entire time series directly from i to d (d ( i),
whereas Faloutsos et al. [13] divide each subsequence of
length i into : disjoint windows (segments) of size n
(i : n) and reduce the dimensionality of each segment
from n to ) () ( n). Note that the former approach requires
large value of d for long subsequences (i.e., with large i), so
as to obtain a candidate set containing small number of false
positives (candidates that are not the actual answer).
However, the resulting indexes, such as R-tree [14] or grid,
even on the reduced d-dimensional data, are not so efficient
(due to the dimensionality curse problem), in terms of
both memory consumption and query performance, making
them unsuitable for stream processing. On the other hand, if
we consider a fixed memory size available for the reduced
data, the latter approach ([13], FRM) has a much lower
dimensionality ), since d ) :, indicating that )
d
:
. Thus,
FRM can construct an )-dimensional index with : times
lower dimensionality than the former method. However,
FRM uses the same radius value for all range queries,
regardless of data distributions, which is, thus, not adaptive
to SJ over stream time series, where data distributions
continuously change.
Therefore, in the next section, we propose an ARES
approach, which is adaptive to the data distribution in
answering SJ over multiple stream time series. Most impor-
tantly, Section 5.2 provides a cost model to formalize the
number of candidates obtained by ARES, based on which
an efficient and effective approach for SJ processing is
proposed to minimize the total number of refined candi-
dates. Section 5.3 utilizes space-efficient synopses to prune
candidates of SJ result such that the cost of SJ processing can
be further reduced. Section 5.4 illustrates parameters
tuning. Finally, Sections 5.5 and 5.6 discuss the SJ batch
processing and load shedding, respectively.
5.1 Adaptive Radius-Based Search
Recall that, in FRM [13], we obtain : )-dimensional query
points,
1
.
2
. . . . . and
:
, from : disjoint windows of the query
series, respectively, and then issue : range queries centered
at each query point
i
, for 1 i :, all with the same query
radius

:
p
. Fig. 5a illustrates a simple example of FRM, where
: 2 and ) 1. In particular, the query series is divided into
two disjoint windows that are reduced to 2 one-dimensional
query points
1
and
2
, respectively. Let point be a two-
dimensional point
1
.
2
as illustrated in Fig. 5a. Here, the
actual data points we want to obtain are those candidates
within distance from point (i.e., within the circle). FRM
performs the search as follows: For each query point
i
where i 1 or 2, FRM issues a range query centered at
i
with
a radius

2
p
, and obtains all candidates within the range (i.e.,
between two vertical lines |
1
and |
2
for
1
, or between
horizontal lines |
3
and |
4
for
2
), in the shaded region of
Fig. 5a. It has been proved [13] that FRM does not introduce
any false dismissals. However, in the case where all the points
fall into the region between lines |
1
and |
2
(i.e., answer to
range query of
1
), FRM has to access all these data, which is
inefficient and not acceptable for SJ stream processing.
Therefore, we propose a novel ARES approach, whose
intuition is illustrated in Fig. 5b, where different radii t
1
and
t
2
are used for query points
1
and
2
, respectively. With too
many candidates that are close to the query point
1
, ARES
uses a smaller radius t
1
such that the total number of
candidates to be refined is reduced. Furthermore, in order
to guarantee no false dismissals, a larger radius t
2
for query

2
is applied. Thus, ARES always chooses appropriate radii
of range queries to minimize the computation cost (i.e., the
number of candidates) based on data distribution and is
efficient for SJ processing.
Before we discuss how to choose the adaptive radii, let
us first illustrate the outline of our ARES approach in Fig. 6.
Specifically, ARES first chooses radii t
i
for 1 i :
adaptive to the data distribution (different from equal radii
in FRM; line 1) such that no false dismissals are introduced.
Details of choosing radii will be described later in this
section. Next, it issues : range queries centered at
i
with
radii t
i
for all 1 i : (lines 2-4). Finally, the candidates
are refined and the actual answers are returned (line 5).
1548 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 5. An example of FRM and ARES. (a) FRM. (b) ARES.
Fig. 6. The procedure of ARES.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
In the sequel, we illustrate how to choose query radii
t
i
1 i : without introducing false dismissals. First, we
give a theorem showing that as long as ARES selects : radii
satisfying the condition

:
i1
t
2
i
!
2
, no false dismissals
will be introduced.
Theorem 5.1. Given : query points
1
. . . . .
:
reduced from :
disjoint windows of size n in the query time series, and data
points j
1
. . . . . reduced from sliding windows of size n in a
data time series T, ARES guarantees no false dismissals in the
candidate set cand if the chosen query radii t
1
. t
2
. . . . . and t
:
satisfy:

:
i1
t
2
i
!
2
.
Proof. As illustrated in Fig. 7, assume that we have :
ordered lists 1
1
. . . . . 1
:
, which, from bottom-up, contain
data points in the ascending order of distances from :
query points
1
. . . . .
:
, respectively. For each list 1
i
, we
retrieve all data points with the radius below t
i
in the list
and add them to the candidate set coid. Now we prove,
by contradiction, that as long as

:
i1
t
2
i
!
2
, there are
no false dismissals in coid. By contradiction, assume that
there exists one subsequence o of T that does not belong
to the candidate set coid, however, is an actual answer
(i.e., o-matches with the query series).
Without loss of generality, assume that subsequence o
contains : disjoint windows of size n, which are reduced to
data points j
1
. . . . . j
:
, respectively. Since o is not in the
candidate set coid, each j
i
must have distance from
i
greater than t
i
, that is, di:t
i
. j
i
t
i
, for all 1 i :
(otherwise, subsequence o must be checked, and thus,
included in the candidate set). Therefore, we have

:
i1
di:t
i
. j
i

:
i1
t
2
i
. Furthermore, since it holds
that

:
i1
t
2
i
!
2
, we have

:
i1
di:t
i
. j
i

2

2
, indicat-
ing that subsequence o is not an actual answer, which
contradicts with our assumption. Thus, the theorem
holds. tu
Theorem 5.1 indicates that ARES can always give the
exact answer and obtain all subsequences similar to a query
series if it holds that

:
i1
t
2
i
!
2
. Note that there might
exist many possible radius combinations satisfying the no-
false-dismissal condition mentioned in Theorem 5.1. For
example, FRM [13] is a special case of ARES, where
t
1
t
2
t
:

:
p
. However, different selections of
radii might result in different computation costs. In the
next section, we will provide a cost model to decide how to
select a good radius combination such that the cost of
retrieving candidates is minimized.
5.2 Cost Model for ARES
As a second step, we propose a formal cost model for
ARES, in terms of the number of refined candidates, based
on which an efficient and effective approach is presented
to achieve the near optimal (i.e., minimum) number of
candidates. Specifically to our SJ problem shown in Fig. 2,
we assume that each sliding window of size n extracted
from any stream time series T
,
(1 , i) is converted into
a one-dimensional point using the incremental PAA and
inserted into its synopsis (consisting of a histogram and
bloom filters). The query time series o
icn
of length i : n
in T
i
is partitioned into : disjoint windows of size n and
then reduced to : one-dimensional query points
1
.
2
. . . . .
and
:
, respectively. Next, our goal is to select the
appropriate radius t
i
in ARES (line 1 of Fig. 6) for each
query point
i
(1 i :), under the constraint

:
i1
t
2
i
!
2
of Theorem 5.1, such that the total number of candidates is
minimized. Note that the case of reducing the dimension-
ality from n to arbitrary ) can easily be extended, which
will be discussed at the end of this subsection. In the
sequel, we first consider the case, where ) 1.
Let q
i
t
i
be the number of data points (candidates) that
are within t
i
distance from the query point
i
in the one-
dimensional space, and jd)
,
r be the density function of
data points converted from stream time series T
,
, where r is
within a domain of reduced data. Note that the density
function jd)
,
r can be obtained from the synopsis (i.e.,
histogram) of T
,
. We have the following equation with
respect to q
i
t
i
and jd)
,
r:
q
i
t
i

_

i
t
i

i
t
i
jd)
,
rdr. 1
Furthermore, since we have
jcoidj

:
i1
q
i
t
i
. 2
where jcoidj is the total number of candidates, our goal is to
minimize jcoidj, under the constraint

:
i1
t
2
i
!
2
. 3
In the sequel, we first assume that data points are
uniformly distributed within a search radius for each query
point
i
, for each 1 i :. We later extend it to the
nonuniform case. Without loss of generality, in the uniform
case, let d
i
be the density of data points within distance
from the query point
i
.
Equation (1) is rewritten as
q
i
t
i

_

i
t
i

i
t
i
jd)
,
rdr 2 d
i
t
i
. 4
where d
i
is the density within distance from the query
point
i
.
Similarly, (2) can be rewritten as
jcoidj 2

:
i1
d
i
t
i
. 5
where (3) holds.
Thus, we want to select appropriate search radii
t
1
. . . . . t
:
, in order to minimize the total number jcoidj of
candidates in (5). Fig. 8 illustrates a simple example of
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1549
Fig. 7. Illustration for Theorem 5.1.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
selecting query radii, where : 2. Assuming that d
1
1
and d
2
0.5, we have jcoidj 2t
1
t
2
by (5), which is the
dash-dot line with cut point 0. jcoidj in t
0
1
-t
0
2
space.
Furthermore, the circle in Fig. 8 centered at the origin with
a radius in the first quadrant is the border of the shaded
region satisfying the constraint t
2
1
t
2
2
!
2
. Intuitively,
when we move the line 2t
1
t
2
c 0 upward in parallel
by increasing parameter c from zero to , for the first time,
the line will intersect with the shaded region. In other
words, there exists one ht
1
. t
2
i-pair (i.e., h0. i) that satisfies
the constraint t
2
1
t
2
2
!
2
. Note that, at that point, the
number of candidates jcoidj is minimized. In general, in
order to obtain the minimum number of candidates, we
should set to the radius of the query point (i.e.,
2
in the
example) that has the minimum density (i.e., d
2
, since
d
2
< d
1
) and 0 for those of other query points (i.e.,
1
).
We summarize the above example in the following
theorem.
Theorem 5.2. Assume that data points are uniformly distributed
within distance from each query point
i
with density d
i
,
where 1 i :. In order to obtain the minimum number of
candidates, we always find a query point
,
with the minimum
density d
,
and set t
i
to 0, if i 6 ,; , otherwise.
Proof. Without loss of generality, assume that
d
1
d
2
d
:
. 6
We want to prove that the total number jcoidj of
candidates is minimized when t
1
and t
i
0 for all
2 i :. In other words, we only need to prove the
following equation:
2

:
i1
d
i
t
i
! 2d
1
. 7
where

:
i1
t
2
i
!
2
.
We prove (7) as follows: Based on (6), it holds that
2

:
i1
d
i
t
i
! 2d
1

:
i1
t
i
. 8
Furthermore, since we have

:
i1
t
i

:
i1
t
i
_ _
2

_
!

:
i1
t
2
i

. 9
by combining (8), (9), and (3), we can exactly have (7),
which completes the proof. tu
Therefore, Theorem 5.2 shows that, under the assump-
tion of uniform data distribution within the distance from
each query point, the minimum number of candidates is
achieved when we select a query point
,
with the lowest
density among all query points and issue one range query
centered at
,
with a radius .
Now we compare FRM [13] with our theoretical solution
under the uniform distribution assumption. Since FRM
issues : ranges queries with radii

:
p
centered at : different
query points
i
(1 i :), the expected number of
candidates for each range query with query point
i
is
given by
2

:
p
d
i
. Note that although there are duplicates for
candidates retrieved from different range queries, FRM still
needs to retrieve these duplicates. Thus, FRM expects to
obtain totally
2

:
p

:
i1
d
i
candidates for : range queries,
whose retrieval cost is given by O
2

:
p

:
i1
d
i
. In contrast,
assuming that d
1
d
,
for , 1, our ARES approach only
needs to retrieve 2 d
1
candidates with time complex-
ity O2 d
1
. Therefore, even if all d
i
are equal to d, ARES
can save the computation cost to retrieve as many as 2
d

:
p
1
2

:
p

:
i1
d 2 d candidates, compared
to FRM.
Discussions on nonuniform data distribution. In the
case where data are not uniformly distributed within the
distance from each query point
i
, the optimal solution
might be as follows: First, for each possible value combina-
tion of t
1
. . . . . and t
:
, satisfying the constraint

:
i1
t
i

2
,
we obtain the total number of candidates falling into these
ranges and select one combination that results in the
smallest candidate set. This solution is globally optimal;
however, the computation cost is rather high. Thus, we are
seeking for locally optimal solutions.
Specifically, we divide the problem of retrieving candi-
dates that are within distance from the query series into
subproblems of finding candidates that are within intervals
0. c. c. 2c. . . . . and c. distances fromthe queryseries,
respectively, where c . Here, c is a small value such that
data points within each interval are assumed to be uniformly
distributed (e.g., c can be considered as the size of cells).
Therefore, for each subproblem, we apply ARES discussed
above, which incurs much fewer candidates than FRM.
In particular, we initially compute the density d
1
. d
2
. . . . .
d
:
of : ranges centered at query points
1
.
2
. . . . .
:
,
respectively, with the same radius c; obtain a query point
(e.g.,
1
) with the lowest density (e.g., d
1
); and issue a range
query centered at
1
with radius c. As a second step, we
obtain those candidates that have distances from the query
series within c. 2c. In particular, with the help of the
histogram, we calculate the increased number of candidates
by enlarging the search radius for each query point. Note
that, in our example, the increased number of candidates for
query point
1
is the number of candidates within c. 2c
distance from
1
, whereas that for query point
i
i 6 1 is
the number of candidates within

2c
2
c
2
_
1550 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 8. Choosing optimal search radii.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
distance from
i
. Then, we select a query point with the
minimum number of candidates and perform the search
with radius in c. 2c for
1
or

2c
2
c
2
_
for other
i
, where i 1. This procedure repeats until the
total search radius is finally reached (i.e., (3) holds).
Discussions on the general case with arbitrary ). Up to
now, we always assume that the reduced dimensionality
(via PAA) on each sliding window (extracted from stream
time series) is equal to 1, that is, ) 1. Now we discuss the
general case, where ) ! 1, using any dimensionality reduc-
tion method. In particular, for all the reduced )-dimensional
data, we can construct a generalized synopsis on which our
proposed ARES methodology discussed above can be
applied with small modification. Specifically, the general-
ized synopsis structure is similar to that defined in Section 4.
The only difference is that the histogram has ) dimensions,
for ) 1 (rather than 1D histogram in the case, where
) 1). The bloom filters (with start offsets of sliding
windows) corresponding to cells of the histogram are
defined in the same way, as given in Section 4.
Given a query sequence o
icn
of length i, we divide it
into : disjoint windows of equal size n (i n :) and then
reduce these disjoint windows to : )-dimensional query
points
1
.
2
. . . . . and
:
, respectively, where ) ! 1. Our
ARES approach also needs to estimate the query radii t
i
for
range queries with query points
i
(1 i :). In particular,
we can obtain the density d
i
around each query point
i
from the )-dimensional histogram in the constructed
synopsis and decide the query radii similar to that in
Theorem 5.1. Note that the space cost of our constructed
synopsis for the general case () ! 1) is proportional to the
number of cells in synopsis (i.e., / c
)
cells with frequency
information and bloom filters, where c is the number of
intervals that divide the data space on each dimension).
Thus, compared to the case where ) 1, the number of
cells (or bloom filters) increases for ) 1, requiring higher
space cost. Moreover, the distance computation between
two )-dimensional points requires higher time cost,
proportional to ). On the other hand, the size of the
resulting candidate set for large ) value is expected to be
small due to the high pruning power by using more
reduced dimensions to prune. Thus, there is a trade-off
between the space (constrained by the available memory
size) and the pruning power in the stream environment.
5.3 Pruning with Synopsis
Although ARES can minimize the total number of
candidates obtained from : range queries without introdu-
cing false dismissals, there are still many false positives
existing in the candidate set. Fig. 9 illustrates an example
of such false positives. Suppose that we have a query
subsequence Q of length 2n, which is divided into two
disjoint windows and transformed to two 1-dimensional
query points
1
and
2
, respectively, using PAA. Consider
one subsequence T
i
o:
1
: o:
1
2n 1 of length 2n from T
i
,
which similarly has two converted one-dimensional points
j
1
and j
2
. Without loss of generality, assume that j
1
is
within t
1
( ) distance from
1
, that is, subsequences
T
i
o:
1
: o:
1
2n 1 and Q form a candidate pair. Accord-
ing to ARES, we need to compute the euclidean distance
between subsequences T
i
o:
1
: o:
1
2n 1 and Q from
scratch. However, we can save this computation cost if j
2
has the distance from
2
greater than

2
di:t
2

1
. j
1

_
(or simply in the sequel), where di:t
1
. j
1
is the euclidean
distance between two 1-dimensional points
1
and j
1
. In
other words, if it holds that di:t
2
. j
2
, then we have
di:t
2

1
. j
1
di:t
2

2
. j
2

2
, which indicates that this
candidate pair is a false positive, and thus, can safely be
pruned. This motivates us to further refine the candidate set
after ARES, with the help of synopsis (including histograms
and bloom filters).
Recall that, during SJ processing, we hash the start offset
of each sliding window in T
i
into a position of a bloom filter in
oyi
i
using a hash function H and set this position to value
1. As in Fig. 9, windows T
i
o:
1
: o:
1
n 1 and T
i
o:
1

n : o:
1
2n 1 are mapped to positions Ho:
1
and
Ho:
1
n of bloom filters 11 and 11
0
, respectively. Since
di:t
1
. j
1
t
1
, the range query centered at
1
with a radius
t
1
would cover the cell containing 11. Thus, the pair
hT
i
o:
1
: o:
1
2n 1. Qi is a candidate. In order to further
refine this pair (i.e., prune it if possible), we retrieve all
bloom filters 11
0
1
. 11
0
2
. . . . . in those cells that are within
distance from query point
2
and check whether or not their
Ho:
1
nth positions contain values 1. If all these
positions are 0, then data point j
2
does not fall into any
of these cells, which indicates that di:t
2
. j
2
, and the
candidate pair is a false positive that can safely be removed;
otherwise, it remains a candidate.
We observe that it is quite inefficient to perform bit
checking in bloom filters one by one for each candidate
pair. In order to speed up this procedure, we use a special
family of hash functions H for bloom filters, which satisfy
the condition Ho: Ho: n Ho: : 1n,
where o: is a positive integer representing the start offset
of sliding windows and n the window size. One instance H
of this hash family is defined as Hr r mod o,
where is a random number within 1. o 1. n c o
and c is a positive integer.
Therefore, in Fig. 9, we have Ho:
1
Ho:
1
n, that is,
disjoint and consecutive windows (e.g., T
i
o:
1
: o:
1
n 1
and T
i
o:
1
n : o:
1
2n 1) have their start offsets
mapped to the same position of (possibly different) bloom
filters (i.e., 11 and 11
0
), which can result in efficient bit
operations among bit vectors. Without loss of generality,
assume that Ho:
1
Ho:
1
n 3. As illustrated in
Fig. 10a, the third (Ho:
1
th) position in 11 is set to 1,
where 11 corresponds to the cell that point j
1
from
window T
i
o:
1
: o:
1
n 1 falls into. Next, we retrieve
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1551
Fig. 9. Pruning heuristics.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
those bloom filters 11
0
1
. 11
0
2
. . . . . whose corresponding
cells are within distance from query point
2
, and want to
check whether or not the third (Ho:
1
nth) bit in any of
them is zero. Since positions of all bit vectors that need to be
verified are at the same place (i.e., the third position),
instead of checking these positions one by one, we can
perform bit operations efficiently. Specifically, we first use
bit OR operations for all bloom filters 11
0
1
. 11
0
2
. . . . . to
obtain a vector \
2

/
11
0
/
, and then calculate a vector
\ , which is the bit AND of 11 and \
2
(i.e., \ = 11

\
2
). In
Fig. 10a, since all bits at the third position of 11
0
/
are 0,
the resulting vector \ would also have 0 at this position,
indicating that candidates pointed to by the same (third)
location of 11 can successfully be pruned, since the start
offset o:
1
n is not mapped to any of those bloom filters
11
0
/
. On the other hand, however, as illustrated in Fig. 10b,
if there exists a bloom filter, say 11
0
2
, having value 1 at the
third position, then the third position of \ is also 1,
implying that candidates pointed to by this (third) position
of 11 cannot be pruned, since the start offset o:
1
n is
possible to be hashed into 11
0
2
.
Note that Fig. 10 only illustrates the example, where
: 2. In the case where : 2, when checking candidates
in 11 obtained from the query of
1
, for each query point

,
(1 < , :), we need to retrieve those bloom filters in cells
that are distance from
,
, and efficiently perform bit OR
operations for all of them, resulting in a bit vector \
,
. Then,
we compute the bit vector \ using bit AND for 11 and all
\
,
, where 1 < , :. That is, \ 11

\
2

\
3

. . .

\
:
.
The meaning of bits in \ is similar to that in the case
where : 2.
Fig. 11 illustrates the detailed procedure of pruning with
synopsis constructed for stream time series. In particular, we
first compute the bit vector \
,
, which is the bit OR of all
bloom filters whose corresponding cells are within
distances from
,
for each 1 , : (line 1). Then, we
consider candidates from the range query of each query
point
i
separately (lines 2-8). In particular, for each
i
, we
obtain cells c
i
in synopsis, whose minimum distance from
i
is within t
i
, as well as bloom filters in them (line 3). For each
bloom filter 11, we obtain a bit vector \ , which is the bit OR
of 11 with all \
,
, where 1 , : and , 6 i (line 4).
Therefore, the resulting bit vector \ summarizes the
remaining candidates after pruning, where value 1 in \
indicates that candidates pointed to by the same position in
11 cannot be pruned, and 0 implies that those can be
safely pruned. After pruning with bloom filters, for each
resulting candidate t
i
with a start offset o:
i
, we retrieve :
cells c
1
. c
2
. . . . . c
i
. . . . . and c
:
, which contain : start offsets
o:
i
i 1n. o:
i
in. . . . . o:
i
. . . . . and o:
i
i :n)
(corresponding to consecutive and disjoint windows in the
candidate subsequence (line 6), respectively. If any of these
offsets (e.g., o:
i
i ,n in cell c
,
) is absent (within
range from
,
) or the summation of minimum (squared)
distances iiidi:t
2

,
. c
,
is greater than
2
, then candidate t
i
is pruned; otherwise, inserted into the candidate set coid
00
(lines 7 and 8).
5.4 Tuning Parameters
In this section, we discuss how to choose parameters o and :
at the beginning of SJ processing, which are the number of
bits in each bloom filter and that of disjoint windows in each
subsequence of length i, respectively. Our goal is to achieve
the lowest computation cost with appropriate values of o
and :.
Specifically, let coid
0
be the candidate set after pruning
with bloom filters returned from ARES (i.e., all candidates t
i
in line 5 of Fig. 11). As we know, each position in a bloom
filter of o bits is set to 1 with probability 1 1
1
o

`
,
where ` is the number of data that have been hashed in the
bloom filter so far. According to line 4 of Fig. 11, a position is
set to 1 in bit vector \ , which is bit AND of : vectors,
with probability

:
,1
_
,6i
1 1
1
o

q
,

1 1
1
o

q
i
t
i

,
where q
,
(q
i
t
i
) is the number of data within (t
i
)
distance from query point
,
(
i
). Therefore, the number
jcoid
0
j of candidates after pruning with synopsis is
expected to be
jcoid
0
j

:
i1
q
i
t
i

:
,1
_
,6i
1 1
1
o
_ _
q
,

_ _
_
_
_
_
_
_
1 1
1
o
_ _
q
i
t
i

_ _
.
10
which can be rewritten as
jcoid
0
j

:
i1
q
i
t
i

:
,1
1 1
1
o
_ _
q
,

_ _ _ _

1 1
1
o
_ _
q
i
t
i

_ _
1 1
1
o
_ _
q
i

_ _ .
11
We si mpl i f y ( 11) usi ng Tayl or s Theor em,
1 1
1
o

r
%
r
o
, where r ( o, and obtain
1552 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 10. Pruning with bit operations (candidate pair hT
i
o:
1
: o:
1

2n 1. Qi). (a) Successful case. (b) Unsuccessful case.


Fig. 11. Pruning procedure with synopsis.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
jcoid
0
j

:
,1
1 1
1
o
_ _
q
,

_ _ _ _

:
i1
q
i
t
i

2
q
i

. 12
where (3) holds.
Therefore, (12) provides a formal equation to model the
total number jcoid
0
j of candidates after pruning with bloom
filters. At the beginning of SJ processing, we assume uniform
distribution of the underlying data, and thus, (4) holds that
q
i
t
i

2
2 d
i
t
i
. Furthermore, assuming that d
1
is the
lowest density among all d
i
, we set t
1
and t
i
0 for
i 6 1. We rewrite (12) as follows:
jcoid
0
j

:
,1
2d
,

o
2d
1
. 13
where (3) holds.
As in lines 6-8 of Fig. 11, for each candidate t
i
, the
refinement cost of t
i
is O: in the worst case, since we need
to find : 1 locations of start offsets and compute the
summed distance from : windows. Thus, the total
computation cost is jcoid
0
j :, where jcoid
0
j is in (13). We
model the total cost, Co:t, of lines 6-8 in Fig. 11 as
Co:t jcoid
0
j : 2d
1

:
,1
2d
,

o
:. 14
which can be rewritten as
|oqCo:t |oq2d
1

:
,1
|oq2d
,
: |oqo |oq:.
15
Without loss of generality, we ignore the first constant
term. Let |oq2d
,
be a random number generated from a
random variable A with the mean j and variance o
2
, which
can initially be obtained from histograms in synopsis.
Therefore, (15) can be simplified as
Co:t

:
,1
A
,
: |oqo |oq:. 16
where A
,
$ `j. o
2
.
Our goal is to minimize Co:t in (16) by selecting the best
values of : and o. Let

:
,1
A
,
follow a random variable 7
with the cumulative density function (CDF) 1c. Assuming
that A
,
are random numbers independently drawn from a
random variable A with mean j and variance o
2
, we apply
the Central Limit Theorem (CLT) [31] and obtain
1c 1io/f7 cg
c:j

:
p
o
, where r is the CDF of
the normal distribution.
We approximate r with a linear function:
r %
1
2
1

_
r
_ _
. 17
By combining (17), we have
1c
1
2
1

_
c :j

:
p
o
_ _
. 18
Therefore, the expected value 17 of 7 is
17
_
c
ior
c
iii
c 1
0
cdc
_
c
ior
c
iii
c
1

2:
p
o
dc. 19
where 1
0
c is the probability density function (PDF) of
variable 7, and c
iii
and c
ior
are the minimum and
maximum possible values of 7, respectively.
Equation (19) can be simplified as
1c
1
2

2:
p
o
_
c
2
ior
c
2
iii
_
. 20
Thus, substituting (20) for (16), we have
Co:t 1c : |oqo |oq:

1
2

2:
p
o
_
c
2
ior
c
2
iii
_
: |oqo |oq:.
21
In practice, : should not be too large, since large : incurs
high computation cost of combining the result. Therefore,
we choose the value of : no greater than 8. After fixing :, the
value of o can easily be derived from (21) so as to achieve
the minimum cost.
5.5 Batch Processing
In this section, we discuss the batch processing of SJ over
multiple streamtime series. In reality, streamdata may arrive at
the system at different rates or can be delayed and then
suddenly arrive in a batch due to various reasons, such as
network congestion. In such situations, we need to handle SJ
queries over a number of (e.g., t) new data (i.e., new
subsequences o
1
icn
. o
2
icn
. . . . . ando
t
icn
) at the same time. One
straightforward way to solve this problem is to invoke
procedure SJ_Framework (described in Fig. 2) several times,
considering newsubsequences separately as if they come one
after another. This method requires invoking procedure
ARES for each subsequence o
i
icn
(1 i t), which incurs
high search cost in finding similar pairs from stream time
series. In contrast, the batch processing can group consecutive
newsubsequences (e.g., fromo
1
icn
to o
/
icn
) that have temporal
correlations (i.e., close to each other), and handle SJ by
invoking procedure ARES only once for each group.
Next, we illustrate details of batch processing. Let

,
1
.
,
2
. . . . . and
,
:
be : query points converted from :
disjoint windows of subsequence o
,
icn
(1 , t), respec-
tively. For simplicity, we only consider the assumption of
the uniform data distribution, that is, data are uniformly
distributed within distance from each query point
,
i
, for
1 i : and 1 , t. Following the strategy of ARES
that selects a query point with the lowest density and issues
a singlerange query with radius , we define a new term, the
group density, with which we choose the ith query points
,
i
of subsequences in the group that has the lowest group
density and issue one range query for the entire group.
Specifically, assume that we have a group of / sub-
sequences o
1
icn
. o
2
icn
. . . . . and o
/
icn
. Let q
i

1
i
.
/
i
.

/
,1
q
i

,
i
. , where
,
i
is the ith query point of new
subsequence o
,
icn
, and q
i

,
i
. the number of candidates
obtained froma range query centered at
,
i
with a radius . In
other words, q
i

1
i
.
/
i
. is the total number of candidates
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1553
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
for the group, if we perform a single group range query using
the ith query points. The group density d
i

1
i
.
/
i
. is
defined as
q
i

1
i
.
/
i
.
/2
, that is, the number of group candidates
divided by the total intervals covered by / separate range
queries (each with interval 2). Let d
iii
i
/ be the lowest group
density d
i

1
i
.
/
i
. in the group. We retrieve all data
within interval iii
/
,1
f
,
i
g . ior
/
,1
f
,
i
g as group
candidates and then refine them using synopsis, similar to
that in Fig. 11, where the only required modification is to let
\
,
be the bit OR of all bloom filters within the interval
iii
/
|1
f
|
,
g . ior
/
|1
f
|
,
g (in line 1 of Fig. 11).
Finally, group candidates are checked by computing their real
euclidean distances to o
,
icn
(1 , /).
The only issue that remains to be addressed is how to
partition new subsequences o
1
icn
. o
2
icn
. . . . . and o
t
icn
into
several groups. In particular, we need to decide the total
number of groups as well as group memberships. Our
basic idea is to treat consecutive new subsequences as one
group in an online fashion, since they tend to be close to
each other. Without loss of generality, assume that we have
already included / consecutive subsequences o
1
icn
. o
2
icn
. . . . .
and o
/
icn
in a group. Then, we need to decide the
membership of the / 1th subsequence o
/1
icn
. In order
to include the / 1th subsequence o
/1
icn
into the same
group of previous / subsequences, we require that
d
iii
i
/ ! d
iii
,
/ 1. Note that after including o
/1
icn
, the
lowest group density may appear when we issue the query
with the ,th query points of (/ 1) subsequences, instead
of the ith ones of / subsequences. However, our grouping
rationale is to add o
/1
icn
to the group if the average
number of candidates to be refined for each subsequence
o
,
icn
in the group does not increase. Otherwise, if
d
iii
i
/ < d
iii
i
/ 1, we start a new group with o
/1
icn
.
5.6 Load Shedding
In previous sections, we always assume that the memory is
large enough to retain all synopsis of stream time series. Now
we consider the case, where such an assumption is violated.
In particular, assume that the system requires 1 extra
memory. In other words, we have to load-shed (evict) data
in synopsis with a total size of 1 from the memory. Our
basic idea of load shedding is to evict those start offsets in
synopsis since they consume most of the memory. How-
ever, we need to cleverly discard these offsets in synopsis
such that the accuracy of SJ query remains high.
We propose three load shedding strategies for SJ over
stream time series. First, for each synopsis oyi
i
of T
i
(1 i i), we randomly evict
1
i
start offsets. In parti-
cular, for the /th cell of the histogram in oyi
i
with
frequency )ic
i/
, a total number of
1
i

)ic
i/

/
,1
)ic
i,
_
start offsets are randomly selected and removed. This
method can immediately free the memory when the memory
is full, however, without considering any query accuracy
issues. The expectedratio of the number of retrieved answers
to that of the actual answers is 1
1
i
,\ n 1,
where \ n 1 is the number of start offsets in oyi
i
.
The second approach always discards those start offsets
that are expected to expire the earliest. In particular, we
always evict offsets in the order of t \ 1. t \
2. . . . . and t \
1
i
. The rationale is that, throwing
away these start offsets would only result in the inaccuracy
of SJ query within a short period in the near future. This
method needs to find start offsets one by one in cells of the
histogram, which incurs some processing cost before load
shedding. However, the amortized SJ cost is the same as
that without load shedding (since the cost of removing start
offsets can be considered as expiration), and moreover the
query accuracy is high if load shedding occurs infrequently.
In contrast to the second strategy, the third one discards
all start offsets in one cell each time. In this case, we need to
decide which cells to load shed, in order to reduce the
number of false dismissals, and thus, obtain SJ answers with
high accuracy. Specifically, we formally model the score of a
cell c
/
that can be shed using a probability ocoicc
/
as
ocoicc
/
1io/

:
,1
iiidi:t
2

i,
. c
/

2
_ _
. 22
where
i,
is the ,th converted query point of new
subsequence from series T
i
.
Intuitively, if the cell c
/
is far away from all query points

i,
, the probability that data in c
/
are not in the SJ result is
high, that is, the score ocoicc
/
is high. Without loss of
generality, let A
,
(1 , :) be a random variable following
the distribution of values iiidi:t
2

i,
. c
/
in (22) for all
series T
i
(1 i i). Assume that j
,
and o
2
,
are mean and
variance of variable A
,
, respectively. According to CLT, we
have
ocoicc
/
1

:
,1
j
,

:
,1
o
2
,
_
_
_
_
_
_
_. 23
where r is the CDF of the normal distribution.
Note that, here, we use statistics of query points
i,
at the
current time stamp as that in the near future. This is
reasonable since query points at consecutive time stamps
have close temporal correlations, that is, they tend to be
close to each other. After estimating scores with (23), our
third strategy selects those cells that have the highest scores
to load shed, since they are very unlikely to affect SJ result
in the near future, and moreover, they may be expired even
if a range query covers the discarded data.
6 EXPERIMENTAL EVALUATION
In this section, we illustrate through extensive experiments
the efficiency and effectiveness of our proposed approach
for the SJ query, using ARES and Synopsis Pruning
techniques (denoted by ARES+SP for brevity). Specifically,
in our experiments, we tested both real and synthetic data
sets, including sstock [30], randomwalk [15], sensor, and EEG
[2]. The sstock data set [30] contains 193 company stocks
1554 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
daily closing price from late 1993 to early 1996; the
randomwalk data set [15] is synthetically generated using
the random walk model; the sensor data set contains
temperature time series collected from 54 sensors deployed
in the Intel Berkeley Research lab between 28th February
and 5th April 2004, which is available online at [http://
db.csail.mit.edu:80/labdata/labdata.html]; moreover, the
EEG data set includes intracranial electroencephalographic
(EEG) time series recorded from epilepsy patients during
the seizure-free interval from inside and outside the seizure
generating area, which can be found at [http://scitation.
aip.org/getpdf/servlet/GetPDFServlet?filetype=pdf&id=
PLEEE8000064000006061907000001&idtype=cvips]. In order
to simulate the SJ scenario, for each of the four data sets, we
concatenate time series of small length into longer ones of
length 204,800 for which data are assumed to continuously
arrive at the system in streams (i.e., stream time series). The
similarity threshold in our stream SJ is set to the one such
that the query selectivity is round 0.1 percent (i.e., the size of
the result set divided by the total number of possible pairs).
Note that for a specific application, the choice of is
determined by experienced data analysts in that domain.
We measure the performance of SJ in terms of the total time,
which is the time that the system can finish SJ processing
over new data at each time stamp. In particular, the total time
is defined as the summation of the filtering time (i.e., the time
to prune candidates) and refinement time (i.e., (the number of
candidates) (a unit of time to refine a candidate pair)).
In the sequel, we first evaluate the performance of SJ
with FRM and ARES+SP, which issue range queries with the
same and different radii, respectively. As a second step, we
demonstrate the efficiency and effectiveness of SJ with
synopsis pruning after ARES (i.e., ARES+SP), compared with
SJ processing with VA

-Stream [19]. Then, we present in


Section 6.3 the experimental result of SJ with batch
processing. Finally, we compare the query performance of
SJ applying three different load shedding techniques in
Section 6.4, in terms of the query accuracy. We conducted our
experiments on a Pentium 4 3.2- GHz PC with 1 GB
memory. All experimental results are averaged over 50 runs.
Fig. 12 summarizes the tested values of parameters in the
experiments as well as their default values in bold font.
6.1 Performance of FRM versus ARES+SP
In the first set of experiments, we compare the query
performance of the SJ using FRM with that using ARES+SP,
under different values of parameters n and i. In particular,
we evaluate the performance with the measure, total time,
which is the average time to process SJ over new data at
one time stamp. For example, assuming that there are
1,000 stream time series, at each time stamp, we obtain
1,000 new subsequences, and each one is joined with
1,000 series, whose processing time in all is the total time.
Figs. 13a, 13b, 13c, and 13d illustrate the total time of SJ with
FRM and ARES+SP over both real and synthetic data sets,
including sstock, randomwalk, sensor, and EEG, respectively,
where n 16. 32. 64. 128, and i. o. /. \, and i are set to
their default values (i.e., i 256. o 128. / 64. \ 512,
and i 1.000). When the window size n increases, the total
time of both approaches decreases. In the figures, we can see
that SJ with ARES+SP always outperforms that with FRM by
an order of magnitude. This is reasonable, since ARES is
based on the formal cost model that can minimize the
number of candidates, and moreover, the synopsis pruning
(SP) technique can utilize synopsis to further shrink the
candidate set from ARES. Therefore, ARES+SP has much
fewer candidates than FRM, which is confirmed by the
number on each column in Figs. 13a, 13b, 13c, and 13d.
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1555
Fig. 12. The parameter settings.
Fig. 13. Performance of FRM and ARES+SP (versus n). (a) sstock,
(b) randomwalk, (c) sensor, and (d) EEG.
Fig. 14. Performance of FRM and ARES+SP (versus i). (a) sstock,
(b) randomwalk, (c) sensor, and (d) EEG.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
Figs. 14a, 14b, 14c, and 14d illustrate the performance of
SJ with FRM and that with ARES+SP, over data sets sstock
and randomwalk, respectively, by varying the length i of
subsequences from 64 to 512, where other parameters are
set to their default values. Note that for fair comparison,
here, we choose different values of similarity threshold
with respect to i such that SJ queries have the same
selectivity (i.e., the output size divided by the maximum
possible join size). Similar to the previous experiment, the
total time of SJ with ARES+SP is less than that of SJ with
FRM by an order of magnitude, and SJ with ARES+SP has
much smaller candidate set than SJ with FRM. In a special
case where i n 64, that is, the entire query subsequence
is considered as one window, since ARES has the same
candidate set as FRM, the performance of ARES+SP is
similar to that of FRM. However, as indicated by the
number of candidates for ARES+SP, SP can still prune false
positives in the candidate set.
6.2 Performance of ARES+SP versus VA

-Stream
After confirming the efficiency and effectiveness of SJ with
ARES+SP compared with FRM, as a second step, we also
investigate the performance of SJ with ARES+SP and SJ
with VA

-Stream [19]. Specifically, we build an index


structure, VA

-Stream, for all subsequences of length i


extracted from each stream time series. When a new data
item arrives at a stream time series, we issue a similarity query
on the index to retrieve all series similar to the new
subsequence. Since each stream time series receives one new
subsequence at every time stamp, the total time is obtained
by summing up the time of i similarity searches in
VA

-stream. As illustrated in Fig. 15, where i 1.000


and i 256, SJ processing with VA

-Stream requires as
many as 30 seconds to handle the new incoming data in
1,000 series at only one time stamp. In contrast, our
approach only needs a few seconds to process the pairwise
join among 1,000 stream time series. Specifically, Figs. 15a
and 15b compare our approach with SJ VA

-Stream, by
varying parameter o (i.e., the number of bits in bloom filters)
from 32 to 512. Note that, since the experimental results of
the four tested data sets are similar, in this and subsequent
experiments, we will only present the results over two real/
synthetic data sets, sstock and randomwalk, due to the space
limit. When parameter o increases, the number of candi-
dates after 11oo1 decreases. However, since large o
results in longer bit vectors, it incurs more computation cost
to perform bit operations during SJ processing. Figs. 15c
and 15d illustrate the performance of SJ with ARES+SP and
VA

-Stream over sstock and randomwalk, respectively, using


different values of /. Since large / indicates more accurate
range queries and more bloom filters, the number of
candidates decreases. On the other hand, however, the
total time increases due to the more bit operations.
Next, we test the scalability of SJ with ARES+SP,
compared that with VA

-Stream. In particular, Figs. 16a,


16b and Figs. 16c, 16d illustrate the experimental result by
varying parameters \ from 256 to 1,024 and i from 200 to
2,000, respectively. SJ with ARES+SP always outperforms
that with VA

-Stream by orders of magnitude, in terms of


the total time, which indicates the robustness of ARES+SP on
these parameters.
6.3 Performance of Batch Processing
Next, we demonstrate the performance of the batch SJ
processing, compared with single SJ processing. Recall that,
for the batch SJ, instead of searching similar subsequences
for t new subsequence one by one, SJ batch processing
performs the searches for groups of subsequences, looking
up synopsis only once for each group. Fig. 17 illustrates the
result of SJ with ARES+SP+batch and ARES+SP, where the
total time of SJ batch processing is defined as the amortized
processing time per time stamp, and other parameters are
set to the default values. Since ARES+SP+batch can save the
cost of pruning with synopsis for individual query
subsequences multiple times, it uses much less total time
than ARES+SP.
6.4 Performance of Load Shedding
Finally, we illustrate the experimental result of SJ with load
shedding. Specifically, we test three methods proposed in
1556 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Fig. 15. Performance of ARES+SP versus VA

-Stream. (a) sstock,


(b) randomwalk, (c) sensor, and (d) EEG.
Fig. 16. Scalability of ARES+SP versus VA

-Stream. (a) sstock,


(b) randomwalk, (c) sensor, and (d) EEG.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
Section 5.6. The first approach randomly selects
1
i
start
offsets in each synopsis oyi
i
from stream time series T
i
. The
second one evicts
1
i
earliest start offsets in each series. The
last method discards all start offsets in a cell with the highest
score that are unlikely to be in the SJ result. Fig. 18a illustrates
the accuracy of the SJ result within
1
i
time stamps after load
shedding over sstock. In particular, we vary the load shedding
ratio from 10 to 50 percent, defined as the percentage of data
(start offsets) that are discarded in each stream time series and
measure the accuracy of SJ by the percentage of false
dismissals in the final SJ result. We find that the query
accuracy of the first approach is very sensitive to the number
of discarded data, whereas the second one is better and the
last one always the best. Considering the shedding time (i.e.,
the time to discard data), however, as illustrated in Fig. 18b,
the first method requires the least time among the three. The
second one takes the shedding time between the first and third
methods when less than 40 percent data are discarded. When
more than 40 percent data are evicted, the second method
requires the most time of all, since searching specific offsets
is very costly. The results on randomwalk are similar and
omitted due to space limit.
7 CONCLUSIONS
SJ in time-series databases plays an important role in many
applications. In this paper, we propose an efficient and
effective approach to incrementally perform SJ over multiple
stream time series. Specifically, we propose a novel ARES
approach, based on a formal cost model to minimize the
resulting number of SJ candidates, which, thus, adapts to
the stream data. Then, we integrate ARES into SJ processing
seamlessly and utilize space-efficient synopses constructed
over subsequences from stream time series to further prune
candidate pairs. The batch processing and load shedding
techniques are also discussed. Extensive experiments have
demonstrated the efficiency and effectiveness of our
proposed approach to answer SJ query over multiple
stream time series.
ACKNOWLEDGMENTS
This work was supported by Hong Kong RGC Grants under
Project 611608, the National Grand Fundamental Research
973 Program of China under Grant 2006CB303000, the
NSFC Key Project Grant 60736013, and the NSFC Project
Grant 60763001.
REFERENCES
[1] R. Agrawal, C. Faloutsos, and A.N. Swami, Efficient Similarity
Search in Sequence Databases, Proc. Fourth Intl Conf. Foundations
of Data Organization and Algorithms (FODO), 1993.
[2] R.G. Andrzejak, K. Lehnertz, C. Rieke, F. Mormann, P. David, and
C.E. Elger, Indications of Nonlinear Deterministic and Finite
Dimensional Structures in Time Series of Brain Electrical Activity:
Dependence on Recording Region and Brain State, Physical Rev.,
vol. 64, no. 6, pp. 061907-1-061907-8, 2001.
[3] S. Berchtold, C. Bo hm, D.A. Keim, and H.-P. Kriegel, A Cost
Model for Nearest Neighbor Search in High-Dimensional Data
Space, Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Princi-
ples of Database Systems (PODS), 1997.
[4] S. Berchtold, D.A. Keim, and H.P. Kriegel, The X-Tree: An Index
Structure for High-Dimensional Data, Proc. 22nd Intl Conf. Very
Large Data Bases (VLDB), 1996.
[5] D.J. Berndt and J. Clifford, Finding Patterns in Time Series: A
Dynamic Programming Approach, Advances in Knowledge Dis-
covery and Data Mining, Am. Assoc. for Artificial Intelligence, 1996.
[6] T. Brinkhoff, H.-P. Kriegel, and B. Seeger, Efficient Processing of
Spatial Joins Using R-Trees, Proc. ACM SIGMOD, 1993.
[7] A. Bulut and A.K. Singh, A Unified Framework for Monitoring
Data Streams in Real Time, Proc. 21st Intl Conf. Data Eng. (ICDE),
2005.
[8] Y. Cai and R. Ng, Indexing Spatio-Temporal Trajectories with
Chebyshev Polynomials, Proc. ACM SIGMOD, 2004.
[9] K.P. Chan and A.W.-C. Fu, Efficient Time Series Matching by
Wavelets, Proc. 15th Intl Conf. Data Eng., 1999.
[10] L. Chen and R. Ng, On the Marriage of Edit Distance and
|
j
Norms, Proc. 30th Intl Conf. Very Large Data Bases (VLDB), 2004.
[11] Q. Chen, L. Chen, X. Lian, Y. Liu, and J.X. Yu, Indexable PLA for
Efficient Similarity Search, Proc. 33rd Intl Conf. Very Large Data
Bases (VLDB), 2007.
[12] C. Cranor, T. Johnson, and O. Spatscheck, Gigascope: A Stream
Database for Network Applications, Proc. ACM SIGMOD, 2003.
[13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, Fast
Subsequence Matching in Time-Series Databases, Proc. ACM
SIGMOD, 1994.
[14] A. Guttman, R-trees: A Dynamic Index Structure for Spatial
Searching, Proc. ACM SIGMOD, 1984.
[15] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, Locally
Adaptive Dimensionality Reduction for Indexing Large Time
Series Databases, Proc. ACM SIGMOD, 2001.
[16] M. Kontaki and A. Papadopoulos, Efficient Similarity Search in
Streaming Time Sequences, Proc. 16th IEEE Conf. Scientific and
Statistical Database Management (SSDBM), 2004.
[17] F. Korn, H. Jagadish, and C. Faloutsos, Efficiently Supporting Ad
Hoc Queries in Large Datasets of Time Sequences, Proc. ACM
SIGMOD, 1997.
[18] X. Lian, L. Chen, X. Yu, G.R. Wang, and G. Yu, Similarity Match
over High Speed Time-Series Streams, Proc. 23rd Intl Conf. Data
Eng. (ICDE), 2007.
[19] X. Liu and H. Ferhatosmanoglu, Efficient k-NN Search on
Streaming Data Series, Proc. Symp. Spatial and Temporal Databases
(SSTD), 2003.
[20] M.L. Lo and C.V. Ravishankar, Spatial Hash-Joins, Proc. ACM
SIGMOD, 1996.
[21] S. Michel, P. Triantafillou, and G. Weikum, Klee: A Framework
for Distributed Top-k Query Algorithms, Proc. 31st Intl Conf.
Very Large Data Bases (VLDB), 2005.
LIAN AND CHEN: EFFICIENT SIMILARITY JOIN OVER MULTIPLE STREAM TIME SERIES 1557
Fig. 18. Performance of load shedding techniques (sstock). (a) Query
accuracy. (b) Shedding time.
Fig. 17. Performance of ARES+SP+batch versus ARES+SP (versus
t). (a) sstock and (b) randomwalk.
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.
[22] M.F. Mokbel, M. Lu, and W.G. Aref, Hash-Merge Join: A Non-
Blocking Join Algorithm for Producing Fast and Early Join
Results, Proc. 20th Intl Conf. Data Eng. (ICDE), 2004.
[23] Y.S. Moon, K.Y. Whang, and W.S. Han, Generalmatch: A
Subsequence Matching Method in Time-Series Databases Based
on Generalized Windows, Proc. ACM SIGMOD, 2002.
[24] Y.S. Moon, K.Y. Whang, and W.K. Loh, Duality-Based Subse-
quence Matching in Time-Series Databases, Proc. 17th Intl Conf.
Data Eng. (ICDE), 2001.
[25] Y. Sakurai, C. Faloutsos, and M. Yamamuro, Stream Monitoring
under the Time Warping Distance, Proc. 23rd Intl Conf. Data Eng.
(ICDE), 2007.
[26] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki,
and D. Gunopulos, Online Outlier Detection in Sensor Data
Using Non-Parametric Models, Proc. 32nd Intl Conf. Very Large
Data Bases (VLDB), 2006.
[27] Y.F. Tao, M.L. Yiu, D. Papadias, M. Hadjieleftheriou, and N.
Mamoulis, RPJ: Producing Fast Join Results on Streams through
Rate-Based Optimization, Proc. ACM SIGMOD, 2005.
[28] T. Urhan and M.J. Franklin, XJoin: A Reactively-Scheduled
Pipelined Join Operator, IEEE Data Eng. Bull., vol. 23, pp. 27-33,
2000.
[29] M. Vlachos, G. Kollios, and D. Gunopulos, Discovering Similar
Multidimensional Trajectories, Proc. 18th Intl Conf. Data Eng.
(ICDE), 2002.
[30] C.Z. Wang and X. Wang, Supporting Content-Based Searches on
Time Series via Approximation, Proc. 12th Intl Conf. Scientific and
Statistical Database Management (SSDBM), 2000.
[31] E.W. Weisstein, Central Limit Theorem, http://mathworld.
wolfram.com/CentralLimitTheorem.html. 2009.
[32] D.A. White and R. Jain, Similarity Indexing with the SS-Tree,
Proc. 12th Intl Conf. Data Eng. (ICDE), 1996.
[33] Y.W. Huang, N. Jing, and E.A. Rundensteiner, Spatial Joins
Using R-Trees: Breadth-First Traversal with Global Optimiza-
tions, Proc. 23rd Intl Conf. Very Large Data Bases (VLDB), 1997.
[34] B.-K. Yi and C. Faloutsos, Fast Time Sequence Indexing for
Arbitrary 1
j
Norms, Proc. 26th Intl Conf. Very Large Data Bases
(VLDB), 2000.
[35] Y. Zhu and D. Shasha, StatStream: Statistical Monitoring of
Thousands of Data Streams in Real Time, Proc. 28th Intl Conf.
Very Large Data Bases (VLDB), 2002.
[36] Y. Zhu and D. Shasha, Efficient Elastic Burst Detection in Data
Streams, Proc. ACM SIGKDD, 2003.
Xiang Lian received the BS degree from the
Department of Computer Science and Technol-
ogy, Nanjing University, in 2003. He is currently
working toward the PhD degree in the Depart-
ment of Computer Science and Engineering,
Hong Kong University of Science and Technol-
ogy. His research interests include query pro-
cessing over stream time series and uncertain
databases. He is a student member of the IEEE.
Lei Chen received the BS degree in computer
science and engineering from Tianjin University,
China, in 1994, the MA degree from the Asian
Institute of Technology, Thailand, in 1997, and
the PhD degree in computer science from the
University of Waterloo, Canada, in 2005. He is
now an assistant professor in the Department of
Computer Science and Engineering at Hong
Kong University of Science and Technology. His
research interests include uncertain databases,
graph databases, multimedia and time series databases, and sensor
and peer-to-peer databases. He is a member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
1558 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on July 20,2010 at 09:38:29 UTC from IEEE Xplore. Restrictions apply.

You might also like