You are on page 1of 64

An introduction to time series approaches in biosurveillance

Andrew W. Moore

Professor The Auton Lab School of Computer Science Carnegie Mellon University http://www.autonlab.org
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Associate Member The RODS Lab University of Pittburgh Carnegie Mellon University http://rods.health.pitt.edu

awm@cs.cmu.edu 412-268-7599

Univariate Time Series

Signal

Time Example Signals:


Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today
2

(When) is there an anomaly?

(When) is there an anomaly?


This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess when?

(When) is there an anomaly?


This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess when? Here (much too high for a Friday)

(Ramp outbreak)

An easy case

Signal

Time
Dealt with by Statistical Quality Control Record the mean and standard deviation up the the current time. Signal an alarm if we go outside 3 sigmas

An easy case: Control Charts


Upper Safe Range

Signal

Mean

Time
Dealt with by Statistical Quality Control Record the mean and standard deviation up the the current time. Signal an alarm if we go outside 3 sigmas

Control Charts on the Norfolk Data


Alarm Level

Control Charts on the Norfolk Data


Alarm Level

Looking at changes from yesterday

10

Looking at changes from yesterday


Alarm Level

11

Looking at changes from yesterday


Alarm Level

12

We need a happy medium:


Control Chart: Too insensitive to recent changes Change from yesterday: Too sensitive to recent changes

13

Moving Average

14

Moving Average

15

Moving Average

16

Moving Average

e nw a w c this? t ho out . Bu e ab r ette itativ ks b ant Loo e qu b


17

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
18

Semi-synthetic data: spike outbreaks


1. Take a real time series 2. Add a spike of random height on a random date

3. See what alarm levels your algorithm gives on every day of the data

4. On what fraction of non-spike days is there an equal or higher alarm

Only one

5. Thats an example of the false positive rate this algorithm would need if it was going to detect the actual spike.
19

Semi-synthetic data: spike outbreaks


1. Take a real time series 2. Add a spike of random 2. Add a spike of random height on a spike of random 2. Add a random date height on a spike of random 2. Add a random date height on a spike of random 2. Add a random date height on a spike of random 2. Add a random date height on a spike of random 2. Add a random date height on a random date height on a random date

3. See what alarm levels your algorithm 4. On what fraction of non-spike days is 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm days is gives there an equal or of non-spike 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm days is gives there an equal or of non-spike 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm days is gives there an equal or of non-spike 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm days gives there an equal or of non-spike D 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm da gives there an equal or of non-spike o 3. See on every day of the data what alarm levels your algorithm 4. On what fractionhigher alarm d gives there an equal or of non-spike th gives on every day of the data there an equal or higher alarm ge is Only one Only one t a 10 pe n 00 Only one Only one rf o a t i Only on ve m rm r e Only

an ag s c e e to

5. Thats an example of the false positive 5. Thats an example of the false positive rate this algorithm wouldof the false positive 5. Thats an example need if it was going rate this algorithm wouldof the false positive 5. Thats an example need if it was going to detect the actual spike. of the false positive rate this algorithm would need if it was going 5. Thats an example to detect the actual spike. of the false positive rate this algorithm would need if it was going 5. Thats an example 20 5. Thats an example to detect the actual spike. of the false positive rate this algorithm would need if it was going to detect the actual spike need if it was going rate this algorithm would

Semi-synthetic data: ramp outbreaks


1. Take a real time series 2. Add a ramp of random height on a random date

3. See what alarm levels your algorithm gives on every day of the data

4. If you allowed a specific false positive rate, how far into the ramp would you be before you signaled an alarm?

21

Semi-synthetic data: ramp outbreaks


1. Take a real time series 2. Add a ramp of random 2. Add a ramp of random height on a ramp of random 2. Add a random date height on a ramp of random 2. Add a random date height on a ramp of random 2. Add a random date height on a ramp of random 2. Add a random date height on a random date height on a random date

o th ge is t a 10 pe n 00 rf o a t i rm ver me an ag s c e e to

3. See what alarm levels your algorithm 4. If you allowed a specific false positive 3. See on every day of the data what alarm levels your algorithm 4. If you allowed a the rampfalse positive gives rate, how far into specific would you 3. See on every day of the data what alarm levels your algorithm 4. If you allowed a specific false positive gives rate, far into be 4. If how allowed the ramp would positive before you signaled an alarm? you 3. See on every day of the data what alarm levels your algorithm you a specific gives rate, how farsignaled rampfalse you be 4. If you allowed the an alarm? positi before you into a specificwould 3. See on every day of the data what alarm levels your algorithm false gives rate, far into be 4. If how allowed the ramp would pos before you signaled an alarm? you 3. See on every day of the data what alarm levels your algorithm you far into the rampfalse yo a specific would gives rate, how be before you signaled an alarm? gives on every day of the data rate, how far into the an alarm? be before you signaled ramp would be before you signaled an alarm?

22

Evaluation methods
All synthetic

23

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are

24

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic
25

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic
26

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic
27

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent Your baseline data is realistic

28

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent Your baseline data is realistic Your outbreak data might be unrealistic

29

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent Your baseline data is realistic Your outbreak data might be unrealistic

All real
You cant get many outbreaks to test You need experts to decide what is an outbreak Some kinds of outbreak have no available data You cant share data

30

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent Your baseline data is realistic Your outbreak data might be unrealistic

All real
You cant get many outbreaks to test You need experts to decide what is an outbreak Some kinds of outbreak have no available data You cant share data Your baseline data is realistic Your outbreak data is realistic

31

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests You know where the outbreaks are Your baseline data might be unrealistic

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests You know where the outbreaks are Dont know where the outbreaks arent Your baseline data is realistic Your outbreak data might be unrealistic

All real
You cant get many outbreaks to test You need experts to decide what is an outbreak Some kinds of outbreak have no available data You cant share data Your baseline data is realistic Your outbreak data is realistic Is the test typical?
32

Evaluation methods
All synthetic
You can account for variation in the way the baseline will look. You can publish evaluation data and share results without data agreement problems You can easily generate large numbers of tests e

Semi-Synthetic
Cant account for variation in the baseline. You cant share data You can easily generate large numbers of tests

All real
You cant get many outbreaks to test You need experts to decide what is an outbreak Some kinds of outbreak have no available data

n lly se optio of the Dont know where rithms is rea Your baseline data is Non a l go nd the outbreaks arentl problem, a realistic eil l anc e ea You know where the Biosurv is i s a r tion of outbreaks lare e. Th E v a ua t to bYour baseline data iish it. Your outbreak data li v e w t a s go It h is realistic learn to Your hard. data e mustrealistic baseline w
might be unrealistic Your outbreak data might be unrealistic Is the test typical?
33

You know where the . sfactory You cant share data outbreaks aresati s is

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
34

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
35

Seasonal Effects

Signal

Time
Fit a periodic function (e.g. sine wave) to previous data. Predict todays signal and 3-sigma confidence intervals. Signal an alarm if were off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds.
36

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
37

Day-of-week effects

Fit a day-of-week component E[Signal] = a + deltaday E.G: deltamon= +5.42, deltatue= +2.20, deltawed= +3.33, deltathu= +3.10, deltafri= +4.02, deltasat= -12.2, deltasun= -23.42
A simple form of ANOVA
38

Regression using Hours-in-day & IsMonday

39

Regression using Hours-in-day & IsMonday

40

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
41

Regression using Mon-Tue

42

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
43

CUSUM
CUmulative SUM Statistics Keep a running sum of surprises: a sum of excesses each day over the prediction When this sum exceeds threshold, signal alarm and reset sum

44

CUSUM

45

CUSUM

46

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
47

The Sickness/Availability Model

48

The Sickness/Availability Model

49

The Sickness/Availability Model

50

The Sickness/Availability Model

51

The Sickness/Availability Model

52

The Sickness/Availability Model

53

The Sickness/Availability Model

54

The Sickness/Availability Model

55

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
56

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
57

Exploiting Denominator Data

58

Exploiting Denominator Data

59

Exploiting Denominator Data

60

Exploiting Denominator Data

61

Algorithm Performance

Allowing one False Alarm per TWO weeks

Allowing one False Alarm per SIX weeks

standard control chart using yesterday Moving Average 3 Moving Average 7 Moving Average 56 hours_of_daylight hours_of_daylight is_mon hours_of_daylight is_mon ... is_tue hours_of_daylight is_mon ... is_sat CUSUM sa-mav-1 sa-mav-7 sa-mav-14 sa-regress Cough with denominator Cough with MA

ct te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike ct sp te de to ys p Da ram eak r a tbof ted oun tec tio e ac s d Fr ike sp

0.39 0.14 0.36 0.58 0.54 0.58 0.7 0.72 0.77 0.45 0.86 0.87 0.86 0.73 0.78 0.65

3.47 3.83 3.45 2.79 2.72 2.73 2.25 1.83 2.11 2.03 1.88 1.28 1.27 1.76 2.15 2.78

0.22 0.1 0.33 0.51 0.44 0.43 0.57 0.57 0.59 0.15 0.74 0.83 0.82 0.67 0.59 0.57

4.13 4.7 3.79 3.31 3.54 3.9 3.12 3.16 3.26 3.55 2.73 1.87 1.62 2.21 2.41 3.24
62

Show Walkerton Results

63

Other state-of-the-art methods


Wavelets Change-point detection Kalman filters Hidden Markov Models

64

You might also like