You are on page 1of 36

Discovering Knowledge in Data

Daniel T. Larose, Ph.D.

Chapter 3 Exploratory Data Analysis


Prepared by James Steck, Graduate Assistant

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Hypothesis Testing Versus Exploratory Data Analysis


Analyst may have a priori (presupposed by

experience) hypothesis to test For example, has increasing fee-structure led to decreasing market share? Hypothesis Testing: test hypothesis market share has decreased Many statistical hypothesis test procedures available:
Z-test t-test Z-test population mean population mean population proportion

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Hypothesis Testing Versus Exploratory Data Analysis (contd)


Z-test t-test Z-test


Chi-Square Chi-Square

difference of two population means difference of two population means difference of two population proportions goodness of fit test for multinomial populations test independence among categorical variables

Analysis of variance F-test t-test for the slope of a regression line And many others, including tests for time-series analysis, quality control tests, and nonparametric tests

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Hypothesis Testing Versus Exploratory Data Analysis (contd)


However, not always have a priori notions about data In this case, use Exploratory Data Analysis (EDA) Approach useful for:
Delving into data Examining important interrelationships between attributes Identifying interesting subsets or patterns Discovering possible relationships between predictors and target variable

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Getting to Know the Data Set


Graphs, plots, and tables often uncover important relationships in data The 3,333 records and 20 variables in churn data set are explored Clementine from SPPS, Inc. shows first 10 records from data set in Figure 3.1 Simple approach looks at field values of records

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Getting to Know the Data Set


Eight attributes shown in Figure 3.1:

(contd)

State: Account Length: Area Code: Phone: Intl Plan: VMail Plan: Vmail Messages: Day Mins:

categorical numeric categorical categorical Boolean Boolean numeric numeric

churn attribute (not shown) indicates customers leaving one company in favor of another companys products or services
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Dealing with Correlated Variables


Using correlated variables in data model:
Should be avoided! Incorrectly emphasizes one or more data inputs Creates model instability and produces unreliable results Matrix plot of Day Minutes, Day Calls, and Day Charge shown in Figure 3.2 (Minitab)

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Dealing with Correlated Variables


(contd)
As number of Day Minutes increase we expect Day Charge to increase Example of positive correlation Oddly, lack of graphical evidence supports correlation between Day Minutes and Day Calls, or Day Calls and Day Charge Additionally, r = 0.07 indicating variables uncorrelated However, linear relationship exists between Day Charge and Day

Minutes Day Charge is linear function of Day Minutes

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Dealing with Correlated Variables


(contd)
Regression Analysis: Day Charge versus Day Mins
The regression equation is Day Charge =0.000613 + 0.170 Day Mins Predictor Constant Day Mins S = 0.002864 Coef 0.0006134 0.170000 SE Coef 0.0001711 0.000001 T 3.59 186644.31 P 0.000 0.000

R-Sq = 100.0%

R-Sq(adj) = 100.0%

Estimated regression equation shown in Figure 3.3 (Minitab) expresses relationship Day Charge equals 0.000613 plus 0.17 times Day Minutes Company uses flat-rate billing model of 17 cents/minute R-squared statistic = 1.0 indicates perfect linear relationship Therefore, Day Charge and Day Minutes are correlated
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Dealing with Correlated Variables


(contd)
One of two variables should be eliminated from model Day Charge arbitrarily chosen for removal Evening, Night, and International variable pairs reflect similar results Therefore, Evening Charge, Night Charge, and International Charge also removed Proceeding to data mining without first eliminating correlated variables may have produced compromised results Number of attributes reduced from 20 to 16 Reduction in dimensionality of solution space beneficial to some data mining algorithms

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

10

Exploring Categorical Variables


Goals: Exploratory Data Analysis
Investigate variables as part of the Data Understanding Phase

Numeric Categorical

Analyze Histograms, Scatter Plots, Statistics Examine Distributions, Cross-tabulations, Web Graphs

Become familiar with data Explore relationships among variable sets While performing EDA, remain focused on objective That is, creating data mining model of customer likely to churn

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

11

Exploring Categorical Variables


(contd)

International Plan
Figure 3.4 shows proportion of customers in International Plan with churn overlay International Plan: yes = 9.69%, no = 90.31% Possibly, greater proportion of those in International Plan are churners?

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

12

Exploring Categorical Variables


(contd)

Again, Proportion of customers in International Plan with churn overlay This time, same-sized bars used for each category (normalized) Graphically, proportion of churners in each category more apparent Those selecting International Plan more likely to churn However, relationship not quantified

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

13

Exploring Categorical Variables


(contd)
Cross-tabulation quantifies relationship between Churn and

International Plan International plan and Churn variables both categorical


Churn False. True. no 2,664 346 yes 186 137

First column: Second column: First row: Second row:

total total total total

International plan = no International plan = yes Churn = False Churn = True

Data set contains 346 + 137 = 483 churners, and 2,664 + 186 = 3,010 non-churners
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

14

Exploring Categorical Variables


(contd)
Therefore, quantifying the relationship: 42.4% of customers in International Plan churned (137 / (137 + 186)) 11.5% of customers not in International Plan churned (346 / (346 + 2,664)) Customers selecting International Plan more than 3X likely to leave company, as compared to those not in plan Why does International Plan apparently cause customers to leave? Data models predicting churn will likely include International Plan as predictor
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

15

Exploring Categorical Variables


(contd)

Voice Mail Plan


Figure 3.7 shows proportion of customers in Voice Mail Plan with churn overlay (normalized) Voicemail Plan: yes = 27.66%, no = 72.34% Those not participating in Voice Mail Plan appear more likely to churn

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

16

Exploring Categorical Variables


(contd)
Churn False. True. No 2,008 403 yes 842 80

Cross-tabulation quantifies relationship between Churn and

Voice Mail Plan

First column: Second column: First row: Second row:

total total total total

Voice Mail Plan = no Voice Mail Plan = yes Churn = False Churn = True

Voice Mail Plan has 842 + 80 = 922 customers Remaining 2,008 + 403 = 2,411 customers not in plan
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

17

Exploring Categorical Variables


(contd)
Only 8.7% = 80/922 of those in plan are churners Of those not in plan, 16.7% = 403/2,411 are churners Therefore, those not participating in plan ~2X likely to churn, as compared to those in plan Perhaps customer loyalty can be increased by simplifying enrollment into Voice Mail Plan? Data models predicting churn likely to include Voice Mail Plan as predictor

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

18

Exploring Categorical Variables


(contd)

Two-way Interactions between Voice Mail Plan and International Plan, with respect to churn shown Voice Mail Plan = no (constant) Many customers have neither plan: 1,878 + 302 = 2,180 Of those, 302/2,180 = 14% are churners Customers in International Plan and not in Voice Mail Plan churn at rate 101/231 = 44%
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

19

Exploring Categorical Variables


(contd)

Here, Voice Mail Plan = yes (constant) Many customers have Voice Mail Plan only: 786 + 44 = 830 Those in both plans: 56 + 36 = 92 Churn rate only 44/830 = 5% when customers participate in Voice Mail Plan only However, those enrolled in both plans churn at 36/92 = 39% Customers in International Plan churning at higher rate, regardless of Voice Mail Plan participation
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

20

Exploring Categorical Variables


(contd)

Directed Web Graph shows relationships between International Plan, Voice Mail Plan, and Churn attributes (Clementine) Examine connections from Voice Mail Plan = yes node to Churn = True and Churn = False Heavier line connecting Churn = False indicates greater proportion of those in plan not churners
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

21

Using EDA to Uncover Anomalous Fields

EDA sometimes uncovers anomalous records For example, examine distribution of Area Code variable Area Code used as categorical variable, grouping records geographically Attribute contains only three values: 408, 415, and 510 All area codes located in California Is this strange? Perhaps not, if all records from California
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

22

Using EDA to Uncover Anomalous Fields (contd)

However, cross-tabulation of Area Code and State turns up anomaly Area codes distributed evenly across all states Data for attribute likely in error; or State attribute may have incorrect values? Domain expert should be consulted before including these variables in data mining models
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

23

Exploring Numeric Variables


Numeric summary measures for several variables shown Includes min and max, mean, median, and standard deviation For example, Account Length has min = 1 and max = 243 Mean and median both ~101, which indicates symmetry Voice Mail Messages not symmetric; mean = 8.1 and median = 0
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

24

Exploring Numeric Variables


Median = 0 indicates half of customers had no voice mail messages Recall use of correlated variables should be avoided Correlations of Customer Service Calls and Day Charge with other numeric variables shown All correlations are Weak except for Day Charge and Day Minutes, where r = 1.0 Indicates perfect linear relationship

(contd)

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

25

Exploring Numeric Variables

(contd)

Histogram for Customer Service Calls attribute shown Increases understanding of attributes distribution Distribution is right-skewed and has mode = 1 However, relationship to Churn not indicated (Left) Figure (Right) shows identical histogram including Churn overlay Determining whether Churn proportion varies across number of Customer Service Calls difficult to discern
26

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Exploring Numeric Variables

(contd)

Again, histogram of Customer Service Calls shown Normalized values enhance pattern of churn Customers calling customer service 3 or fewer times, far less likely to churn Results: Carefully track number of customer service calls made by customers; Offer incentives to retain those making higher number of calls Data mining model will probably include Customer Service Calls as predictor
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

27

Exploring Numeric Variables


Normalized histogram of Day Minutes shown with Churn overlay (Top) Indicates high usage customers churn at significantly greater rate Results: Carefully track customer Day Minutes as total exceeds 200 Investigate why those with high usage tend to leave Normalized histogram of Evening Minutes shown with Churn overlay (Bottom) Higher usage customers churn slightly more Results: Based on graphical evidence, no specific conclusions drawn

(contd)

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

28

Exploring Numeric Variables

(contd)

Additional EDA concludes no obvious association between Churn and remaining numeric attributes (not shown) These numeric attributes probably not strong predictors in data model However, they should be retained as input to model Important higher-level associations/interactions may exist In this case, let model identify which inputs are important Different EDA task may encounter huge number of inputs Data mining performance adversely affected by many inputs Possibly exclude inputs not associated with target variable Or, use dimension-reduction technique such as principal components analysis
29

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Exploring Multivariate Relationships

Possible multivariate relationships examined Two and three-dimensional scatter plots used Figure 3.23 shows scatter plot of Customer Service Calls versus

Day Minutes

Upper-left quadrant indicates high-churn area Identifies customers with high number of customer service calls, combined with low day minute usage
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

30

Exploring Multivariate Relationships


(contd)

This relationship not detected using univariate analysis Note, interaction between two variables makes association apparent Univariate analysis determined customers with high number Customer Service Calls churn at higher rates Figure 3.23 shows those with higher day minutes somewhat protected from higher churn rate
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

31

Exploring Multivariate Relationships


(contd)
High-churn rate also shown on far right (Figure 3.23) Those with high day usage churn at higher rate, regardless of their number of customer service calls Three-dimensional scatter plots sometimes enhance analysis For example, figure shows Day Minutes versus Evening Minutes versus Customer Service Calls

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

32

Selecting Interesting Subsets of the Data for Further Investigation


Scatter plots or histograms identify interesting subsets of data Top figure shows selection of churners with high day and evening minutes Clementine enables selection of records for quantification (upper right) Distribution of churn for this subset shown (bottom) 43.5% (192/441) of customers having both high day and evening minutes are churners This is ~3X churn rate of entire data set
33

Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

Binning
Binning categorizes an attributes numeric (or categorical) values into reduced set of classes Makes analysis more convenient For example, number of Day Minutes could be binned into Low, Medium, and High categories For example, State values may be binned into regions California, Oregon, Washington, Alaska, and Hawaii are categorized as Pacific Binning defined as both data preparation and data exploration activity Various strategies exist for binning numeric variables One approach equalizes number of records in each class Another partitions values into groups, with respect to target
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

34

Binning

(contd)

Recall those with fewer Customer Service Calls have lower churn rate For example, bin number of Customer Service Calls into low and high categories Figure shows churn rate for low class is 11.25% (Top) However, those within high group have 51.69% churn rate (Bottom) Churn rate more than 4X higher
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

35

Summary
EDA uncovered some insights into churn data set:

Four Charge fields are linear functions of Minutes fields Correlation among remaining numeric attributes Weak Area Code and/or State fields anomalous Customers with International Plan churn at higher rate Those in Voice Mail Plan churn less frequently Customers calling customer service 4 or more times churn 4X higher than others Customer with high day and evening minutes churn 4X higher rate than others

These observations performed using EDA only; no data mining applied Results can be easily formulated into actionable plan designed to reduce churn rate
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.

36

You might also like