Discovering Knowledge in Data

Discovering Knowledge in Data
Daniel T. Larose, Ph.D.
Chapter 3 Exploratory Data Analysis

Prepared by James Steck, Graduate Assistant
Discovering Knowledge in Data: An Introduction to Data Mining, By Daniel T. Larose. Copyright 2005 John Wiley & Sons, Inc.
Hypothesis Testing Versus Exploratory Data Analysis

Analyst may have a priori (presupposed by
experience) hypothesis to test For example, has increasing fee-structure led to decreasing market share? Hypothesis Testing: test hypothesis market share has decreased Many statistical hypothesis test procedures available:
Z-test t-test Z-test population mean population mean population proportion
Hypothesis Testing Versus Exploratory Data Analysis (contd)

Z-test t-test Z-test

Chi-Square Chi-Square
difference of two population means difference of two population means difference of two population proportions goodness of fit test for multinomial populations test independence among categorical variables
Analysis of variance F-test t-test for the slope of a regression line And many others, including tests for time-series analysis, quality control tests, and nonparametric tests
Hypothesis Testing Versus Exploratory Data Analysis (contd)

However, not always have a priori notions about data In this case, use Exploratory Data Analysis (EDA) Approach useful for:
Delving into data Examining important interrelationships between attributes Identifying interesting subsets or patterns Discovering possible relationships between predictors and target variable
Getting to Know the Data Set

Graphs, plots, and tables often uncover important relationships in data The 3,333 records and 20 variables in churn data set are explored Clementine from SPPS, Inc. shows first 10 records from data set in Figure 3.1 Simple approach looks at field values of records
Getting to Know the Data Set

Eight attributes shown in Figure 3.1:
(contd)
State: Account Length: Area Code: Phone: Intl Plan: VMail Plan: Vmail Messages: Day Mins:
categorical numeric categorical categorical Boolean Boolean numeric numeric
churn attribute (not shown) indicates customers leaving one company in favor of another companys products or services
Dealing with Correlated Variables

Using correlated variables in data model:
Should be avoided! Incorrectly emphasizes one or more data inputs Creates model instability and produces unreliable results Matrix plot of Day Minutes, Day Calls, and Day Charge shown in Figure 3.2 (Minitab)

(contd)
As number of Day Minutes increase we expect Day Charge to increase Example of positive correlation Oddly, lack of graphical evidence supports correlation between Day Minutes and Day Calls, or Day Calls and Day Charge Additionally, r = 0.07 indicating variables uncorrelated However, linear relationship exists between Day Charge and Day
Minutes Day Charge is linear function of Day Minutes

(contd)
Regression Analysis: Day Charge versus Day Mins
The regression equation is Day Charge =0.000613 + 0.170 Day Mins Predictor Constant Day Mins S = 0.002864 Coef 0.0006134 0.170000 SE Coef 0.0001711 0.000001 T 3.59 186644.31 P 0.000 0.000
R-Sq = 100.0%
R-Sq(adj) = 100.0%
Estimated regression equation shown in Figure 3.3 (Minitab) expresses relationship Day Charge equals 0.000613 plus 0.17 times Day Minutes Company uses flat-rate billing model of 17 cents/minute R-squared statistic = 1.0 indicates perfect linear relationship Therefore, Day Charge and Day Minutes are correlated

(contd)
One of two variables should be eliminated from model Day Charge arbitrarily chosen for removal Evening, Night, and International variable pairs reflect similar results Therefore, Evening Charge, Night Charge, and International Charge also removed Proceeding to data mining without first eliminating correlated variables may have produced compromised results Number of attributes reduced from 20 to 16 Reduction in dimensionality of solution space beneficial to some data mining algorithms
10
Exploring Categorical Variables

Goals: Exploratory Data Analysis
Investigate variables as part of the Data Understanding Phase
Numeric Categorical

Analyze Histograms, Scatter Plots, Statistics Examine Distributions, Cross-tabulations, Web Graphs
Become familiar with data Explore relationships among variable sets While performing EDA, remain focused on objective That is, creating data mining model of customer likely to churn
11

(contd)
International Plan
Figure 3.4 shows proportion of customers in International Plan with churn overlay International Plan: yes = 9.69%, no = 90.31% Possibly, greater proportion of those in International Plan are churners?
12

(contd)
Again, Proportion of customers in International Plan with churn overlay This time, same-sized bars used for each category (normalized) Graphically, proportion of churners in each category more apparent Those selecting International Plan more likely to churn However, relationship not quantified
13

(contd)
Cross-tabulation quantifies relationship between Churn and
International Plan International plan and Churn variables both categorical

Churn False. True. no 2,664 346 yes 186 137
First column: Second column: First row: Second row:
total total total total
International plan = no International plan = yes Churn = False Churn = True
Data set contains 346 + 137 = 483 churners, and 2,664 + 186 = 3,010 non-churners
14

(contd)
Therefore, quantifying the relationship: 42.4% of customers in International Plan churned (137 / (137 + 186)) 11.5% of customers not in International Plan churned (346 / (346 + 2,664)) Customers selecting International Plan more than 3X likely to leave company, as compared to those not in plan Why does International Plan apparently cause customers to leave? Data models predicting churn will likely include International Plan as predictor
15

(contd)
Voice Mail Plan

Figure 3.7 shows proportion of customers in Voice Mail Plan with churn overlay (normalized) Voicemail Plan: yes = 27.66%, no = 72.34% Those not participating in Voice Mail Plan appear more likely to churn
16

(contd)
Churn False. True. No 2,008 403 yes 842 80
Cross-tabulation quantifies relationship between Churn and
Voice Mail Plan
First column: Second column: First row: Second row:
total total total total
Voice Mail Plan = no Voice Mail Plan = yes Churn = False Churn = True
Voice Mail Plan has 842 + 80 = 922 customers Remaining 2,008 + 403 = 2,411 customers not in plan
17

(contd)
Only 8.7% = 80/922 of those in plan are churners Of those not in plan, 16.7% = 403/2,411 are churners Therefore, those not participating in plan ~2X likely to churn, as compared to those in plan Perhaps customer loyalty can be increased by simplifying enrollment into Voice Mail Plan? Data models predicting churn likely to include Voice Mail Plan as predictor
18

(contd)
Two-way Interactions between Voice Mail Plan and International Plan, with respect to churn shown Voice Mail Plan = no (constant) Many customers have neither plan: 1,878 + 302 = 2,180 Of those, 302/2,180 = 14% are churners Customers in International Plan and not in Voice Mail Plan churn at rate 101/231 = 44%
19

(contd)
Here, Voice Mail Plan = yes (constant) Many customers have Voice Mail Plan only: 786 + 44 = 830 Those in both plans: 56 + 36 = 92 Churn rate only 44/830 = 5% when customers participate in Voice Mail Plan only However, those enrolled in both plans churn at 36/92 = 39% Customers in International Plan churning at higher rate, regardless of Voice Mail Plan participation
20

(contd)
Directed Web Graph shows relationships between International Plan, Voice Mail Plan, and Churn attributes (Clementine) Examine connections from Voice Mail Plan = yes node to Churn = True and Churn = False Heavier line connecting Churn = False indicates greater proportion of those in plan not churners
21
Using EDA to Uncover Anomalous Fields
EDA sometimes uncovers anomalous records For example, examine distribution of Area Code variable Area Code used as categorical variable, grouping records geographically Attribute contains only three values: 408, 415, and 510 All area codes located in California Is this strange? Perhaps not, if all records from California
22
Using EDA to Uncover Anomalous Fields (contd)
However, cross-tabulation of Area Code and State turns up anomaly Area codes distributed evenly across all states Data for attribute likely in error; or State attribute may have incorrect values? Domain expert should be consulted before including these variables in data mining models
23
Exploring Numeric Variables

Numeric summary measures for several variables shown Includes min and max, mean, median, and standard deviation For example, Account Length has min = 1 and max = 243 Mean and median both ~101, which indicates symmetry Voice Mail Messages not symmetric; mean = 8.1 and median = 0
24

Median = 0 indicates half of customers had no voice mail messages Recall use of correlated variables should be avoided Correlations of Customer Service Calls and Day Charge with other numeric variables shown All correlations are Weak except for Day Charge and Day Minutes, where r = 1.0 Indicates perfect linear relationship
(contd)
25
(contd)
Histogram for Customer Service Calls attribute shown Increases understanding of attributes distribution Distribution is right-skewed and has mode = 1 However, relationship to Churn not indicated (Left) Figure (Right) shows identical histogram including Churn overlay Determining whether Churn proportion varies across number of Customer Service Calls difficult to discern
26
(contd)
Again, histogram of Customer Service Calls shown Normalized values enhance pattern of churn Customers calling customer service 3 or fewer times, far less likely to churn Results: Carefully track number of customer service calls made by customers; Offer incentives to retain those making higher number of calls Data mining model will probably include Customer Service Calls as predictor
27

Normalized histogram of Day Minutes shown with Churn overlay (Top) Indicates high usage customers churn at significantly greater rate Results: Carefully track customer Day Minutes as total exceeds 200 Investigate why those with high usage tend to leave Normalized histogram of Evening Minutes shown with Churn overlay (Bottom) Higher usage customers churn slightly more Results: Based on graphical evidence, no specific conclusions drawn
(contd)
28
(contd)
Additional EDA concludes no obvious association between Churn and remaining numeric attributes (not shown) These numeric attributes probably not strong predictors in data model However, they should be retained as input to model Important higher-level associations/interactions may exist In this case, let model identify which inputs are important Different EDA task may encounter huge number of inputs Data mining performance adversely affected by many inputs Possibly exclude inputs not associated with target variable Or, use dimension-reduction technique such as principal components analysis
29
Exploring Multivariate Relationships
Possible multivariate relationships examined Two and three-dimensional scatter plots used Figure 3.23 shows scatter plot of Customer Service Calls versus
Day Minutes
Upper-left quadrant indicates high-churn area Identifies customers with high number of customer service calls, combined with low day minute usage
30

(contd)
This relationship not detected using univariate analysis Note, interaction between two variables makes association apparent Univariate analysis determined customers with high number Customer Service Calls churn at higher rates Figure 3.23 shows those with higher day minutes somewhat protected from higher churn rate
31

(contd)
High-churn rate also shown on far right (Figure 3.23) Those with high day usage churn at higher rate, regardless of their number of customer service calls Three-dimensional scatter plots sometimes enhance analysis For example, figure shows Day Minutes versus Evening Minutes versus Customer Service Calls
32
Selecting Interesting Subsets of the Data for Further Investigation

Scatter plots or histograms identify interesting subsets of data Top figure shows selection of churners with high day and evening minutes Clementine enables selection of records for quantification (upper right) Distribution of churn for this subset shown (bottom) 43.5% (192/441) of customers having both high day and evening minutes are churners This is ~3X churn rate of entire data set
33
Binning
Binning categorizes an attributes numeric (or categorical) values into reduced set of classes Makes analysis more convenient For example, number of Day Minutes could be binned into Low, Medium, and High categories For example, State values may be binned into regions California, Oregon, Washington, Alaska, and Hawaii are categorized as Pacific Binning defined as both data preparation and data exploration activity Various strategies exist for binning numeric variables One approach equalizes number of records in each class Another partitions values into groups, with respect to target
34
Binning
(contd)
Recall those with fewer Customer Service Calls have lower churn rate For example, bin number of Customer Service Calls into low and high categories Figure shows churn rate for low class is 11.25% (Top) However, those within high group have 51.69% churn rate (Bottom) Churn rate more than 4X higher
35
Summary
EDA uncovered some insights into churn data set:
Four Charge fields are linear functions of Minutes fields Correlation among remaining numeric attributes Weak Area Code and/or State fields anomalous Customers with International Plan churn at higher rate Those in Voice Mail Plan churn less frequently Customers calling customer service 4 or more times churn 4X higher than others Customer with high day and evening minutes churn 4X higher rate than others
These observations performed using EDA only; no data mining applied Results can be easily formulated into actionable plan designed to reduce churn rate
36

Discovering Knowledge in Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discovering Knowledge in Data

Uploaded by

Copyright:

Available Formats

Discovering Knowledge in Data

Daniel T. Larose, Ph.D.

Chapter 3 Exploratory Data Analysis

Hypothesis Testing Versus Exploratory Data Analysis

Hypothesis Testing Versus Exploratory Data Analysis (contd)

Z-test t-test Z-test

Hypothesis Testing Versus Exploratory Data Analysis (contd)

Getting to Know the Data Set

Getting to Know the Data Set

categorical numeric categorical categorical Boolean Boolean numeric numeric

Dealing with Correlated Variables

Dealing with Correlated Variables

Minutes Day Charge is linear function of Day Minutes

Dealing with Correlated Variables

Dealing with Correlated Variables

Exploring Categorical Variables

Exploring Categorical Variables

Exploring Categorical Variables

Exploring Categorical Variables

International Plan International plan and Churn variables both categorical

First column: Second column: First row: Second row:

total total total total

International plan = no International plan = yes Churn = False Churn = True

Exploring Categorical Variables

Exploring Categorical Variables

Voice Mail Plan

Exploring Categorical Variables

Cross-tabulation quantifies relationship between Churn and

Voice Mail Plan

First column: Second column: First row: Second row:

total total total total

Exploring Categorical Variables

Exploring Categorical Variables

Exploring Categorical Variables

Exploring Categorical Variables

Using EDA to Uncover Anomalous Fields

Using EDA to Uncover Anomalous Fields (contd)

Exploring Numeric Variables

Exploring Numeric Variables

Exploring Numeric Variables

Exploring Numeric Variables

Exploring Numeric Variables

Exploring Numeric Variables

Exploring Multivariate Relationships

Exploring Multivariate Relationships

Exploring Multivariate Relationships

Selecting Interesting Subsets of the Data for Further Investigation

You might also like