Professional Documents
Culture Documents
This as well as many of the following figures/examples are taken from the tutorial given by Eamonn Keogh at SBBD 2002 (XVII Brazilian Symposium on Databases)
www.cs.ucr.edu/~eamonn/
Sistemi Informativi M
Sistemi Informativi M
First problem: large database size Second problem: subjectivity of similarity evaluation
Sistemi Informativi M
For problem 1. we need to define effective lower-bounding techniques that work in a (much) lower dimensional space For problem 2. we will introduce a new similarity criterion
Time series (1)
L 2 (s, q) =
D-1 t =0
(s t q t )
Sistemi Informativi M
Pre-processing
Diretly comparing two time series (e.g., using Euclidean distance) might lead to counter-intuitive results This is because Euclidean distance, as well as other measures, are very sensitive to some data features that might conflict with the subjective notion of similarity Thus, a good idea is to pre-process time series in order to remove such features
Sistemi Informativi M
Offset translation
Subtract from each sample the mean value of the series
3 2.5 2 1.5 1 0.5 0 3 2.5 2 1.5 1 0.5 0
d(q,s)
50
100
150
200
250
300
50
100
150
200
250
300
50
100
150
200
250
300
Sistemi Informativi M
Amplitude scaling
Normalize the amplitude (divide by the standard deviation of the series)
100
200
300 400
800
900 1000
100
200
300 400
500 600
700 800
900 1000
Sistemi Informativi M
Noise removal
Average the values of each sample with those of its neighbors (smoothing)
8 6 4 2 0 -2 -4 0
8 6 4 2 0 -2 -4 0
20
40
60
80
100
120
140
20
40
60
80
100
120
140
q = smooth(q) s = smooth(s)
Time series (1) Sistemi Informativi M 9
D 1
f = 0,...,D 1
Img
Sf
2
|Sf | Re
where |Sf| = |Re(Sf) + j Img(Sf)| = (Re(Sf)2 + Img(Sf)2) is the absolute value of Sf (its amplitude), which equals the Euclidean distance of Sf from the origin
Time series (1) Sistemi Informativi M 10
= L 2 (S, Q) 2
where |Sf - Qf | is the Euclidean distance between Sf and Qf in the complex plane The above just says that DFT also preserves the Euclidean distance
Img
Qf
|Sf - Qf | Sf Re
Sistemi Informativi M
11
Thus, by only keeping the first few coefficients (D << D), an effective dimensionality reduction can be obtained
Note: this is the basic idea used by well-known compression standards, such as JPEG (which is based on Discrete Cosine Transform)
For what we have seen, this projection technique satisfies the L-B lemma
Time series (1) Sistemi Informativi M 12
Energy spectrum
Sistemi Informativi M
13
Another example
128 points
s s
data values
0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387
Fourier coefficients
1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...
20
40
60
80
100
120
140
Sistemi Informativi M
14
Comments on DFT
Can be computed in O(DlogD) time using FFT (provided D is a power of 2) Difficult to use if one wants to deal with sequences of different length Not really amenable to deal with signals with spots (time-varying energy) An alternative to DFT is to use wavelets, which takes a different perspective:
A signal can be represented as a sum of contributions, each at a different resolution level Discrete Wavelet Transform (DWT) can be computed in O(D) time Also used in the JPEG2000 standard for compressing images
Experimental results however show that the superiority of DWT w.r.t. DFT is dependent on the specific dataset
DWT
0 20 40 60 80 100 120 140
Haar 0
Haar 1
Haar 2
As with DFT, the time series is decomposed into a linear combinations of base elements The Haar DWT, applied to a series of length D = 2n , pairs samples, stores their difference and passes their average to the next stage This process is repeated recursively, which leads to 2n 1 dierence values and one nal average (which is 0 for normalized series) s = (3,2,4,6,5,6,2,2) Differences Averages (1,-2,-1,0) (2.5,5,5.5,2) (-3.5,3.5) (3.75,3.75) (0) (3. 75)
Sistemi Informativi M 16
Haar 3
Haar 4
Haar 5
Haar 6
Haar 7
We take a window of size W and segment our time series into D = D/W pieces (sub-sequences), each of size W iW 1 For each piece, we compute the average of values, i.e. st t =(i 1) W ' Our approximation is therefore s = (s1,,sD) si = W We have W L2(s,q) L2(s,q) (arguments generalize those used for the global average example)
The same can be generalized to work with arbitrary Lp-norms [YF00] s W s'
20
40
60
80
100
120
140
Sistemi Informativi M
17
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
On most datasets, 0.5 the three methods 0.4 yield very similar results 0.3 0.2 The actual response 0.1 time may vary, even for a given method, 0 depending on the 16 1024 32 512 implementation 64
256
0.2
0.2
0.2
0.1
0.1
0.1
DFT
Time series (1)
DWT
PAA
Sistemi Informativi M 18
Seconds
10,000
50,000
100,000
10