You are on page 1of 10

Time Series (1)

Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/

Time series are everywhere


Time series, that is, sequences of observations (samples) made through time, are present in everydays life:
Temperature, rainfalls, seismic traces Weblogs Stock prices EEG, ECG, blood pressure Enrolled students at the Engineering Fac.
29 28 27 26 25 24 230 50 100 150 200 250 300 350 400 450 500

This as well as many of the following figures/examples are taken from the tutorial given by Eamonn Keogh at SBBD 2002 (XVII Brazilian Symposium on Databases)

www.cs.ucr.edu/~eamonn/

Time series (1)

Sistemi Informativi M

Why is similarity search so important?


Similarity search can help you in:
Looking for the occurrence of known patterns Discovering unknown patterns Putting things together (clustering) Classifiying new data Predicting/extrapolating future behaviors

Time series (1)

Sistemi Informativi M

Why is similarity search difficult?


Consider a large time series DB, e.g.:
1 hour of ECG data: 1 GByte Typical Weblog: 5 GBytes per week Space Shuttle DB: 158 GBytes MACHO Astronomical DB: 2 Tbytes, updated with 3 GBytes a day (20 million stars recorded nightly for 4 years) http://wwwmacho.anu.edu.au/

First problem: large database size Second problem: subjectivity of similarity evaluation

Time series (1)

Sistemi Informativi M

How to measure similarity


Given two time series of equal length D, the usual way to measure their (dis-)similarity is based on Euclidean distance However, with Euclidean distance we have to face two basic problems
1. High-dimensionality: (very) large D values 2. Sensitivity to alignment of values

For problem 1. we need to define effective lower-bounding techniques that work in a (much) lower dimensional space For problem 2. we will introduce a new similarity criterion
Time series (1)

L 2 (s, q) =

D-1 t =0

(s t q t )

Sistemi Informativi M

Pre-processing
Diretly comparing two time series (e.g., using Euclidean distance) might lead to counter-intuitive results This is because Euclidean distance, as well as other measures, are very sensitive to some data features that might conflict with the subjective notion of similarity Thus, a good idea is to pre-process time series in order to remove such features

Time series (1)

Sistemi Informativi M

Offset translation
Subtract from each sample the mean value of the series
3 2.5 2 1.5 1 0.5 0 3 2.5 2 1.5 1 0.5 0

d(q,s)

50

100

150

200

250

300

50

100

150

200

250

300

q = q - mean(q) s = s - mean(s) d(q,s)


0 50 100 150 200 250 300

50

100

150

200

250

300

Time series (1)

Sistemi Informativi M

Amplitude scaling
Normalize the amplitude (divide by the standard deviation of the series)

100

200

300 400

500 600 700

800

900 1000

100

200

300 400

500 600

700 800

900 1000

q = (q - mean(q)) / std(q) s = (s - mean(s)) / std(s)

Time series (1)

Sistemi Informativi M

Noise removal
Average the values of each sample with those of its neighbors (smoothing)

8 6 4 2 0 -2 -4 0

8 6 4 2 0 -2 -4 0

20

40

60

80

100

120

140

20

40

60

80

100

120

140

q = smooth(q) s = smooth(s)
Time series (1) Sistemi Informativi M 9

Dimensionality reduction: DFT (1)


The first method for reducing the dimensionality of time series, proposed in [AFS93], was based on the Discrete Fourier Transform (DFT) Remind: given a time series s, the Fourier coefficients are complex numbers (amplitude,phase), defined as:
Sf = 1

s exp( j2f t/D) D


t t =0

D 1

f = 0,...,D 1
Img

Parseval theorem: DFT preserves the energy of the signal:


E(s ) = s = E(S) = S f
2 t t =0 f =0 D 1 D 1

Sf
2

|Sf | Re

where |Sf| = |Re(Sf) + j Img(Sf)| = (Re(Sf)2 + Img(Sf)2) is the absolute value of Sf (its amplitude), which equals the Euclidean distance of Sf from the origin
Time series (1) Sistemi Informativi M 10

Dimensionality reduction: DFT (2)


The key observation is that the DFT is a linear transformation Thus, we have:
L 2 (s, q) 2 = (s t q t ) = E(s q) = E(S Q) = S f Q f
2 t =0 f =0 D 1 D 1 2

= L 2 (S, Q) 2

where |Sf - Qf | is the Euclidean distance between Sf and Qf in the complex plane The above just says that DFT also preserves the Euclidean distance
Img

What can we gain from such transformation?

Qf

|Sf - Qf | Sf Re

Time series (1)

Sistemi Informativi M

11

Dimensionality reduction: DFT (3)


The key observation is that, by keeping only a small set of Fourier coefficients, we can obtain a good approximation of the original signal This is because most of the energy of many real-world signals concentrates in the low frequencies ([AFS93]): More precisely, the energy spectrum (|Sf|2 vs. f) behaves as O(f-b), b > 0:
b = 2 (random walk or brown noise): used to model the behavior of stock movements and currency exchange rates b > 2 (black noise): suitable to model slowly varying natural phenomena (e.g., water levels of rivers) b = 1 (pink noise): according to Birkhoffs theory, musical scores follow this energy pattern

Thus, by only keeping the first few coefficients (D << D), an effective dimensionality reduction can be obtained
Note: this is the basic idea used by well-known compression standards, such as JPEG (which is based on Discrete Cosine Transform)

For what we have seen, this projection technique satisfies the L-B lemma
Time series (1) Sistemi Informativi M 12

An example: EEG data


Sampling rate: 128 Hz

Time series (4 secs, 512 points)

Energy spectrum

Time series (1)

Sistemi Informativi M

13

Another example
128 points
s s
data values
0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387

Fourier coefficients
1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

First 4 Fourier coefficients


1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928

20

40

60

80

100

120

140

s = approximation of s with 4 Fourier coefficients

Time series (1)

Sistemi Informativi M

14

Comments on DFT

Can be computed in O(DlogD) time using FFT (provided D is a power of 2) Difficult to use if one wants to deal with sequences of different length Not really amenable to deal with signals with spots (time-varying energy) An alternative to DFT is to use wavelets, which takes a different perspective:
A signal can be represented as a sum of contributions, each at a different resolution level Discrete Wavelet Transform (DWT) can be computed in O(D) time Also used in the JPEG2000 standard for compressing images

Experimental results however show that the superiority of DWT w.r.t. DFT is dependent on the specific dataset

Good for wavelets bad for Fourier


0 200
Time series (1)

Good for Fourier bad for wavelets


400 600 0 200 400 600
15 Sistemi Informativi M

Haar Wavelets: the intuition


X X'

DWT
0 20 40 60 80 100 120 140

Haar 0

Haar 1

Haar 2

As with DFT, the time series is decomposed into a linear combinations of base elements The Haar DWT, applied to a series of length D = 2n , pairs samples, stores their difference and passes their average to the next stage This process is repeated recursively, which leads to 2n 1 dierence values and one nal average (which is 0 for normalized series) s = (3,2,4,6,5,6,2,2) Differences Averages (1,-2,-1,0) (2.5,5,5.5,2) (-3.5,3.5) (3.75,3.75) (0) (3. 75)
Sistemi Informativi M 16

Haar 3

Haar 4

Haar 5

Haar 6

Haar 7

Time series (1)

Dimensionality reduction: PAA


PAA (Piecewise Aggregate Approximation) [KCP+00,YF00] is a very simple, intuitive and fast, O(D), method to approximate time series
Its performance is comparable to that of DFT and DWT

We take a window of size W and segment our time series into D = D/W pieces (sub-sequences), each of size W iW 1 For each piece, we compute the average of values, i.e. st t =(i 1) W ' Our approximation is therefore s = (s1,,sD) si = W We have W L2(s,q) L2(s,q) (arguments generalize those used for the global average example)

The same can be generalized to work with arbitrary Lp-norms [YF00] s W s'

20

40

60

80

100

120

140

Time series (1)

Sistemi Informativi M

17

Some results (thanks again to Eamonn Keogh!)


The graphs shows the fraction of data (indexed by an R-tree) that must be retrieved from disk to answer a 1-NN query
mixed dataset

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

On most datasets, 0.5 the three methods 0.4 yield very similar results 0.3 0.2 The actual response 0.1 time may vary, even for a given method, 0 depending on the 16 1024 32 512 implementation 64
256

0.2

0.2

0.2

0.1

0.1

0.1

0 1024 512 256 64 32 16

0 1024 512 256 64 32 16

0 1024 512 256 64 32 16

DFT
Time series (1)

DWT

PAA
Sistemi Informativi M 18

Example: computing the Euclidean distance


Consider a sequential 1-NN algorithm based on Euclidean distance, and let r be the lowest distance found so far Two basic optimization techiniques are
1. Avoid taking the square root, i.e., use squared Euclidean distance

This does not influence the result


2. Early terminate to evaluate

5 Euclid Opt1 Opt2

Seconds

the distance on a seres s if the so-far accumulated distance for s is r

Clearly, these optimization are not peculiar to time series

10,000

50,000

100,000

Number of Objects Time series (1) Sistemi Informativi M 19

10

You might also like