You are on page 1of 178

.

Data Mining In Excel: Lecture Notes and Cases

Preliminary Draft 2/04

Nitin R. Patel
Peter C. Bruce
(c) Quantlink Corp. 2004
Distributed by:
Resampling Stats, Inc.
612 N. Jackson St.
Arlington, VA 22201
USA
info@xlminer.com
www.xlminer.com

Contents
1 Introduction
1.1 Who is This Book For? . . . . . . . . . .
1.2 What is Data Mining? . . . . . . . . . . .
1.3 Where is Data Mining Used . . . . . . . .
1.4 The Origins of Data Mining . . . . . . . .
1.5 Terminology and Notation . . . . . . . . .
1.6 Organization of Data Sets . . . . . . . . .
1.7 Factors Responsible for the Rapid Growth

. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
of Data Mining

2 Overview of the Data Mining Process


2.1 Core Ideas in Data Mining . . . . . . . . . . . . . . . . .
2.1.1 Classication . . . . . . . . . . . . . . . . . . . .
2.1.2 Prediction . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Anity Analysis . . . . . . . . . . . . . . . . . .
2.1.4 Data Reduction . . . . . . . . . . . . . . . . . . .
2.1.5 Data Exploration . . . . . . . . . . . . . . . . . .
2.1.6 Data Visualization . . . . . . . . . . . . . . . . .
2.2 Supervised and Unsupervised Learning . . . . . . . . . .
2.3 The Steps In Data Mining . . . . . . . . . . . . . . . . .
2.4 SEMMA . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Preliminary Steps . . . . . . . . . . . . . . . . . . . . .
2.5.1 Sampling from a Database . . . . . . . . . . . . .
2.5.2 Pre-processing and Cleaning the Data . . . . . .
2.5.3 Partitioning the Data . . . . . . . . . . . . . . .
2.6 Building a Model - An Example with Linear Regression
2.6.1 Can Excel Handle the Job? . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
1
2
2
3
4
5
5

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

7
7
7
7
7
8
8
8
9
10
11
11
11
12
17
19
27

3 Supervised Learning - Classication & Prediction


29
3.1 Judging Classication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 A Two-class Classier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Bayes Rule for Minimum Error . . . . . . . . . . . . . . . . . . . . . . . . . 30
i

ii

CONTENTS
3.1.3

Practical Assessment of a Classier Using Misclassication Error as the Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


3.1.4 Asymmetric Misclassication Costs and Bayes Risk . . . . . . . . . . . . .
3.1.5 Stratied Sampling and Asymmetric Costs . . . . . . . . . . . . . . . . . .
3.1.6 Generalization to More than Two Classes . . . . . . . . . . . . . . . . . . .
3.1.7 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.8 Example: Boston Housing (Two classes) . . . . . . . . . . . . . . . . . . . .
3.1.9 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.10 Classication using a Triage strategy . . . . . . . . . . . . . . . . . . . . . .
4 Multiple Linear Regression
4.1 A Review of Multiple Linear Regression . . . . . . . . . . . . .
4.1.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Independence . . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Illustration of the Regression Process . . . . . . . . . . . . . .
4.3 Subset Selection in Linear Regression . . . . . . . . . . . . . . .
4.4 Dropping Irrelevant Variables . . . . . . . . . . . . . . . . . . .
4.5 Dropping Independent Variables With Small Coecient Values
4.6 Algorithms for Subset Selection . . . . . . . . . . . . . . . . . .
4.6.1 Forward Selection . . . . . . . . . . . . . . . . . . . . .
4.6.2 Backward Elimination . . . . . . . . . . . . . . . . . . .
4.6.3 Step-wise Regression (Efroymsons method) . . . . . . .
4.6.4 All Subsets Regression . . . . . . . . . . . . . . . . . . .
4.7 Identifying Subsets of Variables to Improve Predictions . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

5 Logistic Regression
5.1 Example 1: Estimating the Probability of Adopting a New Phone Service . . . . .
5.2 Multiple Linear Regression is Inappropriate . . . . . . . . . . . . . . . . . . . . . .
5.3 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Odd Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 Example 2: Financial Conditions of Banks . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 A Model with Just One Independent Variable . . . . . . . . . . . . . . . . .
5.6.2 Multiplicative Model of Odds Ratios . . . . . . . . . . . . . . . . . . . . . .
5.6.3 Computation of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Appendix A - Computing Maximum Likelihood Estimates and Condence Intervals
for Regression Coecients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.2 Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7.3 Loglikelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

32
34
34
35
35
36
40
40

.
.
.
.
.
.
.
.
.
.
.
.
.
.

43
43
43
43
44
45
47
48
49
50
50
50
51
51
51

.
.
.
.
.
.
.
.
.

55
55
56
56
57
58
59
60
61
63

.
.
.
.

63
63
64
64

CONTENTS

5.8

iii

5.7.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendix B - The Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . 65

6 Neural Nets
6.1 The Neuron (a Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 The Neuron (a mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Single Layer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Example 1: Fishers Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4 The Backward Propagation Algorithm - Classication . . . . . . . . . . . . . . . .
6.4.1 Forward Pass - Computation of Outputs of all the Neurons in the Network.
6.4.2 Backward Pass: Propagation of Error and Adjustment of Weights . . . . .
6.5 Adjustment for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Multiple Local Optima and Epochs . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Overtting and the choice of training epochs . . . . . . . . . . . . . . . . . . . . .
6.8 Adaptive Selection of Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.9 Successful Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

67
67
69
69
70
71
73
74
74
75
75
75
76
76

7 Classication and Regression Trees


7.1 Classication Trees . . . . . . . . .
7.2 Recursive Partitioning . . . . . . .
7.3 Example 1 - Riding Mowers . . . .
7.4 Pruning . . . . . . . . . . . . . . .
7.5 Minimum Error Tree . . . . . . . .
7.6 Best Pruned Tree . . . . . . . . . .
7.7 Classication Rules from Trees . .
7.8 Regression Trees . . . . . . . . . .

.
.
.
.
.
.
.
.

77
77
77
78
84
89
89
91
91

.
.
.
.
.
.

.
.
.
.
.
.

93
93
95
98
99
99
103

.
.
.
.
.

105
. 105
. 106
. 106
. 108
. 109

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

8 Discriminant Analysis
8.1 Example 1 - Riding Mowers . . . . . . .
8.2 Fishers Linear Classication Functions
8.3 Measuring Distance . . . . . . . . . . . .
8.4 Classication Error . . . . . . . . . . .
8.5 Example 2 - Classication of Flowers .
8.6 Appendix - Mahalanobis Distance . . .
9 Other Supervised Learning Techniques
9.1 K-Nearest neighbor . . . . . . . . . . .
9.1.1 The K-NN Procedure . . . . . .
9.1.2 Example 1 - Riding Mowers . . .
9.1.3 K-Nearest Neighbor Prediction .
9.1.4 Shortcomings of k-NN algorithms

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

iv

CONTENTS
9.2

Naive
9.2.1
9.2.2
9.2.3
9.2.4

Bayes . . . . . . . . . . . . . . . . .
Bayes Theorem . . . . . . . . . . .
The Problem with Bayes Theorem
Simplify - assume independence . .
Example 1 - Saris . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

10 Anity Analysis - Association Rules


10.1 Discovering Association Rules in Transaction Databases
10.2 Support and Condence . . . . . . . . . . . . . . . . . .
10.3 Example 1 - Electronics Sales . . . . . . . . . . . . . . .
10.4 The Apriori Algorithm . . . . . . . . . . . . . . . . . .
10.5 Example 2 - Randomly-generated Data . . . . . . . . . .
10.6 Shortcomings . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

11 Data Reduction and Exploration


11.1 Dimensionality Reduction - Principal Components Analysis
11.2 Example 1 - Head Measurements of First Adult Sons . . . .
11.3 The Principal Components . . . . . . . . . . . . . . . . . .
11.4 Example 2 - Characteristics of Wine . . . . . . . . . . . . .
11.5 Normalizing the Data . . . . . . . . . . . . . . . . . . . . .
11.6 Principal Components and Orthogonal Least Squares . . .
12 Cluster Analysis
12.1 What is Cluster Analysis? . . . . . . . . . . .
12.2 Example 1 - Public Utilities Data . . . . . .
12.3 Hierarchical Methods . . . . . . . . . . . . . .
12.3.1 Nearest neighbor (Single linkage) . . .
12.3.2 Farthest neighbor (Complete linkage)
12.3.3 Group average (Average linkage) . . .
12.4 Optimization and the k-means algorithm . . .
12.5 Similarity Measures . . . . . . . . . . . . . .
12.6 Other distance measures . . . . . . . . . . . .
13 Cases
13.1 Charles Book Club . . . . . . .
13.2 German Credit . . . . . . . . .
13.3 Textile Cooperatives . . . . . .
13.4 Tayko Software Cataloger . . .
13.5 IMRB : Segmenting Consumers

. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
of Bath Soap

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

110
110
111
111
112

.
.
.
.
.
.

115
. 115
. 115
. 116
. 117
. 118
. 121

.
.
.
.
.
.

123
. 123
. 123
. 124
. 126
. 128
. 129

.
.
.
.
.
.
.
.
.

131
. 131
. 131
. 134
. 134
. 135
. 135
. 137
. 140
. 141

.
.
.
.
.

143
. 143
. 152
. 158
. 161
. 167

Chapter 1

Introduction
1.1

Who is This Book For?

This book arose out of a data mining course at MITs Sloan School of Management. Preparation
for the course revealed that there are a number of excellent books on the business context of data
mining, but their coverage of the statistical and machine-learning algorithms that underlie data
mining is not suciently detailed to provide a practical guide if the instructors goal is to equip
students with the skills and tools to implement those algorithms. On the other hand, there are
also a number of more technical books about data mining algorithms, but these are aimed at the
statistical researcher, or more advanced graduate student, and do not provide the case-oriented
business focus that is successful in teaching business students.
Hence, this book is intended for the business student (and practitioner) of data mining techniques, and its goal is threefold:
1. To provide both a theoretical and practical understanding of the key methods of classication,
prediction, reduction and exploration that are at the heart of data mining;
2. To provide a business decision-making context for these methods;
3. Using real business cases, to illustrate the application and interpretation of these methods.
An important feature of this book is the use of Excel, an environment familiar to business
analysts. All required data mining algorithms (plus illustrative data sets) are provided in an Excel
add-in, XLMiner. The presentation of the cases is structured so that the reader can follow along
and implement the algorithms on his or her own with a very low learning curve.
While the genesis for this book lay in the need for a case-oriented guide to teaching data-mining,
analysts and consultants who are considering applying data mining techniques in contexts where
they are not currently in use will also nd this a useful, practical guide.

1. Introduction

1.2

What is Data Mining?

The eld of data mining is still relatively new, and in a state of evolution. The rst International
Conference on Knowledge Discovery and Data Mining (KDD) was held in 1995, and there are a
variety of denitions of data mining.
A concise denition that captures the essence of data mining is:
Extracting useful information from large data sets (Hand, et al: 2001).
A slightly longer version is:
Data mining is the process of exploration and analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful patterns and rules. (Berry and Lino:
1997 and 2000)
Berry and Lino later had cause to regret the 1997 reference to automatic and semi-automatic
means, feeling it shortchanged the role of data exploration and analysis.
Another denition comes from the Gartner Group, the information technology research rm
(from their web site, Jan. 2004):
Data mining is the process of discovering meaningful new correlations, patterns and trends by
sifting through large amounts of data stored in repositories, using pattern recognition technologies
as well as statistical and mathematical techniques.
A summary of the variety of methods encompassed in the term data mining follows below
(Core Ideas).

1.3

Where is Data Mining Used

Data mining is used in a variety of elds and applications. The military might use data mining
to learn what roles various factors play in the accuracy of bombs. Intelligence agencies might use
it to determine which of a huge quantity of intercepted communications are of interest. Security
specialists might use these methods to determine whether a packet of network data constitutes a
threat. Medical researchers might use them to predict the likelihood of a cancer relapse.
Although data mining methods and tools have general applicability, in this book most examples
are chosen from the business world. Some common business questions one might address through
data mining methods include:
1. From a large list of prospective customers, which are most likely to respond? We could use
classication techniques (logistic regression, classication trees or other methods) to identify those individuals whose demographic and other data most closely matches that of our
best existing customers. Similarly, we can use prediction techniques to forecast how much
individual prospects will spend.
2. Which customers are most likely to commit fraud (or might already have committed it)? We
can use classication methods to identify (say) medical reimbursement applications that have

1.4 The Origins of Data Mining

a higher probability of involving fraud, and give them greater attention.


3. Which loan applicants are likely to default? We might use classication techniques to identify
them (or logistic regression to assign a probability of default value).
4. Which customers are more likely to abandon a subscription service (telephone, magazine,
etc.)? Again, we might use classication techniques to identify them (or logistic regression
to assign a probability of leaving value). In this way, discounts or other enticements might
be proered selectively where they are most needed.

1.4

The Origins of Data Mining

Data mining stands at the conuence of the elds of statistics and machine learning (also known
as articial intelligence). A variety of techniques for exploring data and building models have been
around for a long time in the world of statistics - linear regression, logistic regression, discriminant
analysis and principal components analysis, for example. But the core tenets of classical statistics
- computing is dicult and data are scarce - do not apply in data mining applications where both
data and computing power are plentiful.
This gives rise to Daryl Pregibons description of data mining as statistics at scale and speed.
A useful extension of this is statistics at scale, speed, and simplicity. Simplicity in this case
refers not to simplicity of algorithms, but rather to simplicity in the logic of inference. Due to the
scarcity of data in the classical statistical setting, the same sample is used to make an estimate,
and also to determine how reliable that estimate might be. As a result, the logic of the condence
intervals and hypothesis tests used for inference is elusive for many, and their limitations are not
well appreciated. By contrast, the data mining paradigm of tting a model with one sample and
assessing its performance with another sample is easily understood.
Computer science has brought us machine learning techniques, such as trees and neural
networks, that rely on computational intensity and are less structured than classical statistical
models. In addition, the growing eld of database management is also part of the picture.
The emphasis that classical statistics places on inference (determining whether a pattern or
interesting result might have happened by chance) is missing in data mining. In comparison to
statistics, data mining deals with large data sets in open-ended fashion, making it impossible to
put the strict limits around the question being addressed that inference would require.
As a result, the general approach to data mining is vulnerable to the danger of overtting,
where a model is t so closely to the available sample of data that it describes not merely structural
characteristics of the data, but random peculiarities as well. In engineering terms, the model is
tting the noise, not just the signal.

1.5

1. Introduction

Terminology and Notation

Because of the hybrid parentry of data mining, its practitioners often use multiple terms to refer
to the same thing. For example, in the machine learning (articial intelligence) eld, the variable
being predicted is the output variable or the target variable. To a statistician, it is the dependent
variable. Here is a summary of terms used:
Algorithm refers to a specic procedure used to implement a particular data mining technique
- classication tree, discriminant analysis, etc.
Attribute is also called a feature, variable, or, from a database perspective, a eld.
Case is a set of measurements for one entity - e.g. the height, weight, age, etc. of one person;
also called record, pattern or row (each row typically represents a record, each column a
variable)
Condence has a specic meaning in association rules of the type If A and B are purchased,
C is also purchased. Condence is the conditional probability that C will be purchased, IF A and
B are purchased.
Condence also has a broader meaning in statistics (condence interval), concerning the
degree of error in an estimate that results from selecting one sample as opposed to another.
Dependent variable is the variable being predicted in supervised learning; also called output
variable, target variable or outcome variable.
Estimation means the prediction of the value of a continuous output variable; also called
prediction.
Feature is also called an attribute, variable, or, from a database perspective, a eld.
Input variable is a variable doing the predicting in supervised learning; also called independent
variable, predictor.
Model refers to an algorithm as applied to a data set, complete with its settings (many of
the algorithms have parameters which the user can adjust).
Outcome variable is the variable being predicted in supervised learning; also called dependent variable, target variable or output variable.
Output variable is the variable being predicted in supervised learning; also called dependent
variable, target variable or outcome variable.
P (A|B) is read as the probability that A will occur, given that B has occurred.
Pattern is a set of measurements for one entity - e.g. the height, weight, age, etc. of one
person; also called record, case or row (each row typically represents a record, each column
a variable)
Prediction means the prediction of the value of a continuous output variable; also called
estimation.
Record is a set of measurements for one entity - e.g. the height, weight, age, etc. of one
person; also called case, pattern or row (each row typically represents a record, each column
a variable)
Score refers to a predicted value or class. Scoring new data means to use a model developed
with training data to predict output values in new data.

1.6 Organization of Data Sets

Supervised Learning refers to the process of providing an algorithm (logistic regression, regression tree, etc.) with records in which an output variable of interest is known and the algorithm
learns how to predict this value with new records where the output is unknown.
Test data refers to that portion of the data used only at the end of the model building and
selection process to assess how well the nal model might perform on additional data.
Training data refers to that portion of data used to t a model.
Unsupervised Learning refers to analysis in which one attempts to learn something about the
data other than predicting an output value of interest (whether it falls into clusters, for example).
Validation data refers to that portion of the data used to assess how well the model ts, to
adjust some models, and to select the best model from among those that have been tried.
Variable is also called a feature, attribute, or, from a database perspective, a eld.

1.6

Organization of Data Sets

Data sets are nearly always constructed and displayed so that variables are in columns, and records
are in rows. In the example below (the Boston Housing data), the values of 14 variables are
recorded for a number of census tracts. Each row represents a census tract - the rst tract had a
per capital crime rate (CRIM) of 0.02729, had 0 of its residential lots zoned for over 25,000 square
feet (ZN), etc. In supervised learning situations, one of these variables will be the outcome variable, typically listed at the end or the beginning (in this case it is median value, MEDV, at the end).

1.7

Factors Responsible for the Rapid Growth of Data Mining

Perhaps the most important factor propelling the growth of data mining is the growth of data. The
mass retailer Walmart in 2003 captured 20 million transactions per day in a 10-terabyte database.
In 1950, the largest companies had only enough data to occupy, in electronic form, several dozen
megabytes (a terabyte is 1,000,000 megabytes).
The growth of data themselves is driven not simply by an expanding economy and knowledge
base, but by the decreasing cost and increasing availability of automatic data capture mechanisms.
Not only are more events being recorded, but more information per event is captured. Scannable
bar codes, point of sale (POS) devices, mouse click trails, and global positioning satellite (GPS)
data are examples.
The growth of the internet has created a vast new arena for information generation. Many of
the same actions that people undertake in retail shopping, exploring a library or catalog shopping
have close analogs on the internet, and all can now be measured in the most minute detail.

2. Overview of the Data Mining Process

In marketing, a shift in focus from products and services to a focus on the customer and his or
her needs has created a demand for detailed data on customers.
The operational databases used to record individual transactions in support of routine business
activity can handle simple queries, but are not adequate for more complex and aggregate analysis.
Data from these operational databases are therefore extracted, transformed and exported to a data
warehouse - a large integrated data storage facility that ties together the decision support systems
of an enterprise. Smaller data marts devoted to a single subject may also be part of the system.
They may include data from external sources (e.g. credit rating data).
Many of the exploratory and analytical techniques used in data mining would not be possible
without todays computational power. The constantly declining cost of data storage and retrieval
has made it possible to build the facilities required to store and make available vast amounts of
data. In short, the rapid and continuing improvement in computing capacity is an essential enabler
of the growth of data mining.

Chapter 2

Overview of the Data Mining Process


2.1
2.1.1

Core Ideas in Data Mining


Classication

Classication is perhaps the most basic form of data analysis. The recipient of an oer might
respond or not respond. An applicant for a loan might repay on time, repay late or declare
bankruptcy. A credit card transaction might be normal or fraudulent. A packet of data traveling
on a network might be benign or threatening. A bus in a eet might be available for service or
unavailable. The victim of an illness might be recovered, still ill, or deceased.
A common task in data mining is to examine data where the classication is unknown or will
occur in the future, with the goal of predicting what that classication is or will be. Similar data
where the classication is known are used to develop rules, which are then applied to the data with
the unknown classication.

2.1.2

Prediction

Prediction is similar to classication, except we are trying to predict the value of a variable (e.g.
amount of purchase), rather than a class (e.g. purchaser or nonpurchaser).
Of course, in classication we are trying to predict a class, but the term prediction in this
book refers to the prediction of the value of a continuous variable. (Sometimes in the data mining
literature, the term estimation is used to refer to the prediction of the value of a continuous
variable, and prediction may be used for both continuous and categorical data.)

2.1.3

Anity Analysis

Large databases of customer transactions lend themselves naturally to the analysis of associations
among items purchased, or what goes with what. Association rules can then be used in a variety
of ways. For example, grocery stores might use such information after a customers purchases have
all been scanned to print discount coupons, where the items being discounted are determined by
mapping the customers purchases onto the association rules.
7

2.1.4

2. Overview of the Data Mining Process

Data Reduction

Sensible data analysis often requires distillation of complex data into simpler data. Rather than
dealing with thousands of product types, an analyst might wish to group them into a smaller
number of groups. This process of consolidating a large number of variables (or cases) into a
smaller set is termed data reduction.

2.1.5

Data Exploration

Unless our data project is very narrowly focused on answering a specic question determined in
advance (in which case it has drifted more into the realm of statistical analysis than of data mining),
an essential part of the job is to review and examine the data to see what messages it holds, much
as a detective might survey a crime scene. Here, full understanding of the data may require a
reduction in its scale or dimension to let us to see the forest without getting lost in the trees.
Similar variables (i.e. variables that supply similar information) might be aggregated into a single
variable incorporating all the similar variables. Analogously, cluster analysis might be used to
aggregate records together into groups of similar records.

2.1.6

Data Visualization

Another technique for exploring data to see what information they hold is graphical analysis. For
example, combining all possible scatter plots of one variable against another on a single page allows
us to quickly visualize relationships among variables.
The Boston Housing data is used to illustrate this. In this data set, each row is a city neighborhood (census tract, actually) and each column is a variable (crime rate, pupil/teacher ratio, etc.).
The outcome variable of interest is the median value of a housing unit in the neighborhood. Figure
2.1 takes four variables from this data set and plots them against each other in a series of two-way
scatterplots. In the lower left, for example, the crime rate (CRIM) is plotted on the x-axis and
the median value (MEDV) on the y-axis. In the upper right, the same two variables are plotted
on opposite axes. From the plots in the lower right quadrant, we see that, unsurprisingly, the
more lower economic status residents a neighborhood has, the lower the median house value. From
the upper right and lower left corners we see (again, unsurprisingly) that higher crime rates are
associated with lower median values. An interesting result can be seen in the upper left quadrant.
All the very high crime rates seem to be associated with a specic, mid-range value of INDUS
(proportion of non-retain businesses per neighborhood). That a specic, middling level of INDUS
is really associated with high crime rates seems dubious. A closer examination of the data reveals
that each specic value of INDUS is shared be a number of neighborhoods, indicating that INDUS
is measured for a broader area than that of the census tract neighborhood. The high crime rate
associated so markedly with a specic value of INDUS indicates that the few neighborhoods with
extremely high crime rates fall mainly within one such broader area.

2.2 Supervised and Unsupervised Learning

Figure 2.1 Matrix scatterplot for four variables from the Boston Housing data.

2.2

Supervised and Unsupervised Learning

A fundamental distinction among data mining techniques is between supervised methods and unsupervised methods.
Supervised learning algorithms are those used in classication and prediction. We must have
data available in which the value of the outcome of interest (e.g. purchase or no purchase) is known.
These training data are the data from which the classication or prediction algorithm learns,
or is trained, about the relationship between predictor variables and the outcome variable. Once
the algorithm has learned from the training data, it is then applied to another sample of data (the
validation data) where the outcome is known, to see how well it does, in comparison to other
models. If many dierent models are being tried out, it is prudent to save a third sample of known
outcomes (the test data) to use with the nal, selected model to predict how well it will do. The
model can then be used to classify or predict the outcome variable of interest in new cases where

10

2. Overview of the Data Mining Process

the outcome is unknown.


Simple linear regression analysis is an example of supervised learning (though rarely called that
in the introductory statistics course where you likely rst encountered it). The Y variable is the
(known) outcome variable. A regression line is drawn to minimize the sum of squared deviations
between the actual Y values and the values predicted by this line. The regression line can now be
used to predict Y values for new values of X for which we do not know the Y value.
Unsupervised learning algorithms are those used where there is no outcome variable to predict
or classify. Hence, there is no learning from cases where such an outcome variable is known.
Anity analysis, data reduction methods and clustering techniques are all unsupervised learning
methods.

2.3

The Steps In Data Mining

This book focuses on understanding and using data mining algorithms (steps 4-7 below). However,
some of the most serious errors in data analysis result from a poor understanding of the problem
- an understanding that must be developed well before we get into the details of algorithms to be
used. Here is a list of the steps to be taken in a typical data mining eort:
1. Develop an understanding of the purpose of the data mining project (if it is a one-shot eort
to answer a question or questions) or application (if it is an ongoing procedure).
2. Obtain the data set to be used in the analysis. This often involves random sampling from
a large database to capture records to be used in an analysis. It may also involve pulling
together data from dierent databases. The databases could be internal (e.g. past purchases
made by customers) or external (credit ratings). While data mining deals with very large
databases, usually the analysis to be done requires only thousands or tens of thousands of
records.
3. Explore, clean, and preprocess the data. This involves verifying that the data are in reasonable
condition. How should missing data be handled? Are the values in a reasonable range, given
what you would expect for each variable? Are there obvious outliers? The data are reviewed
graphically - for example, a matrix of scatterplots showing the relationship of each variable
with each other variable. We also need to ensure consistency in the denitions of elds, units
of measurement, time periods, etc.
4. Reduce the data, if necessary, and (where supervised training is involved) separate it into
training, validation and test data sets. This can involve operations such as eliminating unneeded variables, transforming variables (for example, turning money spent into spent
> $100 vs. spent <= $100), and creating new variables (for example, a variable that
records whether at least one of several products was purchased). Make sure you know what
each variable means, and whether it is sensible to include it in the model.

2.4 SEMMA

11

5. Determine the data mining task (classication, prediction, clustering, etc.). This involves
translating the general question or problem of step 1 into a more specic statistical question.
6. Choose the data mining techniques to be used (regression, neural nets, Wards method of
hierarchical clustering, etc.).
7. Use algorithms to perform the task. This is typically an iterative process - trying multiple
variants, and often using multiple variants of the same algorithm (choosing dierent variables or settings within the algorithm). Where appropriate, feedback from the algorithms
performance on validation data is used to rene the settings.
8. Interpret the results of the algorithms. This involves making a choice as to the best algorithm
to deploy, and, where possible, testing our nal choice on the test data to get an idea how
well it will perform. (Recall that each algorithm may also be tested on the validation data
for tuning purposes; in this way the validation data becomes a part of the tting process and
is likely to underestimate the error in the deployment of the model that is nally chosen.)
9. Deploy the model. This involves integrating the model into operational systems and running
it on real records to produce decisions or actions. For example, the model might be applied
to a purchased list of possible customers, and the action might be include in the mailing if
the predicted amount of purchase is > $10.
We concentrate in this book on steps 3-8.

2.4

SEMMA

The above steps encompass the steps in SEMMA, a methodology developed by SAS:
Sample from data sets, partition into training, validation and test data sets
Explore data set statistically and graphically
Modify: transform variables, impute missing values
Model: t predictive models, e.g. regression, tree, collaborative ltering
Assess: compare models using validation data set
SPSS-Clementine also has a similar methodology, termed CRISP-DM (CRoss-Industry Standard Process for Data Mining).

2.5
2.5.1

Preliminary Steps
Sampling from a Database

Quite often, we will want to do our data mining analysis on less than the total number of records
that are available. Data mining algorithms will have varying limitations on what they can handle
in terms of the numbers of records and variables, limitations that may be specic to computing
power and capacity as well as software limitations. Even within those limits, many algorithms will
execute faster with smaller data sets.

12

2. Overview of the Data Mining

From a statistical perspective, accurate models can be built with as few as several hundred
records (see below). Hence, often we will want to sample a subset of records for model building.
If the event we are interested in is rare, however (e.g. customers purchasing a product in
response to a mailing), sampling a subset of records may yield so few events (e.g. purchases) that
we have little information on them. We would end up with lots of data on non-purchasers, but
little on which to base a model that distinguishes purchasers from non-purchasers. In such cases,
we would want our sampling procedure to over-weight the purchasers relative to the non-purchasers
so that our sample would end up with a healthy complement of purchasers.
For example, if the purchase rate were 1% and we were going to be working with a sample of
1000 records, unweighted sampling would be expected to yield only 10 purchasers. If, on the other
hand, a purchaser has a probability of being selected that is 99 times the probability of selecting a
non-purchaser, then the proportions selected for the sample will be more roughly equal.

2.5.2

Pre-processing and Cleaning the Data

2.5.2.1 Types of Variables


There are several ways of classifying variables. Variables can be numeric or text (character). They
can be continuous (able to assume any real numeric value, usually in a given range), integer (assuming only integer values), or categorical (assuming one of a limited number of values). Categorical
variables can be either numeric (1, 2, 3) or text (payments current, payments not current, bankrupt). Categorical variables can also be unordered (North America, Europe, Asia) or ordered (high
value, low value, nil value).
2.5.2.2 Variable Selection
More is not necessarily better when it comes to selecting variables for a model. Other things being
equal, parsimony, or compactness is a desirable feature in a model.
For one thing, the more variables we include, the greater the number of records we will need
to assess relationships among the variables. 15 records may suce to give us a rough idea of the
relationship between Y and a single dependent variable X. If we now want information about
the relationship between Y and fteen dependent variables X1 X15 , fteen variables will not be
enough (each estimated relationship would have an average of only one records worth of information, making the estimate very unreliable).
2.5.2.3 Overtting
For another thing, the more variables we include, the greater the risk of overtting the data. What
is overtting?
Consider the following hypothetical data about advertising expenditures in one time period,
and sales in a subsequent time period:

2.5 Preliminary Steps

13
Advertising
239
364
602
644
770
789
911

Sales
514
789
550
1386
1394
1440
1354

Figure 2.2 : X-Y Scatterplot for advertising and Sales Data

We could connect up these lines with a smooth and very complex function, one that explains
all these data points perfectly and leaves no error (residuals).

14

2. Overview of the Data Mining


X-Y scatterplot, smoothed

However, we can see that such a curve is unlikely to be that accurate, or even useful, in
predicting future sales on the basis of advertising expenditures.
A basic purpose of building a model is to describe relationships among variables in such a way
that this description will do a good job of predicting future outcome (dependent) values on the
basis of future predictor (independent) values. Of course, we want the model to do a good job of
describing the data we have, but we are more interested in its performance with data to come.
In the above example, a simple straight line might do a better job of predicting future sales on
the basis of advertising than the complex function does.
In this example, we devised a complex function that t the data perfectly, and in doing so
over-reached. We certainly ended up explaining some variation in the data that was nothing
more than chance variation. We have mislabeled the noise in the data as if it were a signal.
Similarly, we can add predictors to a model to sharpen its performance with the data at hand.
Consider a database of 100 individuals, half of whom have contributed to a charitable cause.
Information about income, family size, and zip code might do a fair job of predicting whether or not
someone is a contributor. If we keep adding additional predictors, we can improve the performance
of the model with the data at hand and reduce the misclassication error to a negligible level.
However, this low error rate is misleading, because it likely includes spurious explanations.
For example, one of the variables might be height. We have no basis in theory to suppose
that tall people might contribute more or less to charity, but if there are several tall people in our
sample and they just happened to contribute heavily to charity, our model might include a term
for height - the taller you are, the more you will contribute. Of course, when the model is applied

2.5 Preliminary Steps

15

to additional data, it is likely that this will not turn out to be a good predictor.
If the data set is not much larger than the number of predictor variables, then it is very likely
that a spurious relationship like this will creep into the model. Continuing with our charity example,
with a small sample just a few of whom are tall, whatever the contribution level of tall people may
be, the computer is tempted to attribute it to their being tall. If the data set is very large relative
to the number of predictors, this is less likely. In such a case, each predictor must help predict the
outcome for a large number of cases, so the job it does is much less dependent on just a few cases,
which might be ukes.
Overtting can also result from the application of many dierent models, from which the best
performing is selected (more about this below).
2.5.2.4 How Many Variables and How Much Data?
Statisticians could give us procedures to learn with some precision how many records we would
need to achieve a given degree of reliability with a given data set and a given model. Data miners
needs are usually not so precise, so we can often get by with rough rules of thumb. A good rule of
thumb is to have ten records for every predictor variable. Another, used by Delmater and Hancock
for classication procedures (2001, p. 68) is to have at least 6*M*N records, where
M = number of outcome classes, and
N = number of variables
Even when we have an ample supply of data, there are good reasons to pay close attention to
the variables that are included in a model. Someone with domain knowledge (i.e. knowledge of the
business process and the data) should be consulted - knowledge of what the variables represent can
often help build a good model and avoid errors.
For example, shipping paid might be an excellent predictor of amount spent, but it is not
a helpful one. It will not give us much information about what distinguishes high-paying from
low-paying customers that can be put to use with future prospects.
In general, compactness or parsimony is a desirable feature in a model. A matrix of X-Y plots
can be useful in variable selection. In such a matrix, we can see at a glance x-y plots for all variable
combinations. A straight line would be an indication that one variable is exactly correlated with
another. Typically, we would want to include only one of them in our model. The idea is to weed
out irrelevant and redundant variables from our model.
2.5.2.5 Outliers
The more data we are dealing with, the greater the chance of encountering erroneous values resulting
from measurement error, data entry error, or the like. If the erroneous value is in the same range
as the rest of the data, it may be harmless. If it is well outside the range of the rest of the data
(a misplaced decimal, for example), it may have substantial eect on some of the data mining
procedures we plan to use.
Values that lie far away from the bulk of the data are called outliers/indexoutliers. The term
far away is deliberately left vague because what is or is not called an outlier is basically an

16

2. Overview of the Data Mining

arbitrary decision. Analysts use rules of thumb like anything over 3 standard deviations away
from the mean is an outlier, but no statistical rule can tell us whether such an outlier is the result
of an error. In this statistical sense, an outlier is not necessarily an invalid data point, it is just a
distant data point.
The purpose of identifying outliers is usually to call attention to data that needs further review.
We might come up with an explanation looking at the data - in the case of a misplaced decimal,
this is likely. We might have no explanation, but know that the value is wrong - a temperature
of 178 degrees F for a sick person. Or, we might conclude that the value is within the realm
of possibility and leave it alone. All these are judgments best made by someone with domain
knowledge. (Domain knowledge is knowledge of the particular application being considered
direct mail, mortgage nance, etc., as opposed to technical knowledge of statistical or data mining
procedures.) Statistical procedures can do little beyond identifying the record as something that
needs review.
If manual review is feasible, some outliers may be identied and corrected. In any case, if the
number of records with outliers is very small, they might be treated as missing data.
How do we inspect for outliers? One technique in Excel is to sort the records by the rst
column, then review the data for very large or very small values in that column. Then repeat
for each successive column. For a more automated approach that considers each record as a unit,
clustering techniques could be used to identify clusters of one or a few records that are distant from
others. Those records could then be examined.

2.5.2.6 Missing Values


Typically, some records will contain missing values. If the number of records with missing values
is small, those records might be omitted.
However, if we have a large number of variables, even a small proportion of missing values can
aect a lot of records. Even with only 30 variables, if only 5% of the values are missing (spread
randomly and independently among cases and variables), then almost 80% of the records would
have to be omitted from the analysis. (The chance that a given record would escape having a
missing value is 0.9530 = 0.215.)
An alternative to omitting records with missing values is to replace the missing value with an
imputed value, based on the other values for that variable across all records. For example, if, among
30 variables, household income is missing for a particular record, we might substitute instead the
mean household income across all records.
Doing so does not, of course, add any information about how household income aects the
outcome variable. It merely allows us to proceed with the analysis and not lose the information
contained in this record for the other 29 variables. Note that using such a technique will understate
the variability in a data set. However, since we can assess variability, and indeed the performance
of our data mining technique, using the validation data, this need not present a major problem.

2.5 Preliminary Steps

17

2.5.2.7 Normalizing (Standardizing) the Data


Some algorithms require that the data be normalized before the algorithm can be eectively implemented. To normalize/indexstandardizing data the data, we subtract the mean from each value,
and divide by the standard deviation of the resulting deviations from the mean. In eect, we are
expressing each value as number of standard deviations away from the mean.
To consider why this might be necessary, consider the case of clustering. Clustering typically
involves calculating a distance measure that reects how far each record is from a cluster center, or
from other records. With multiple variables, dierent units will be used - days, dollars, counts, etc.
If the dollars are in the thousands and everything else is in the 10s, the dollar variable will come
to dominate the distance measure. Moreover, changing units from (say) days to hours or months
could completely alter the outcome.
Data mining software, including XLMiner, typically has an option that normalizes the data in
those algorithms where it may be required. It is an option, rather than an automatic feature of
such algorithms, because there are situations where we want the dierent variables to contribute
to the distance measure in proportion to their scale.

2.5.3

Partitioning the Data

In supervised learning, a key question presents itself:


How well will our prediction or classication model perform when we apply it to new data? We
are particularly interested in comparing the performance among various models, so we can choose
the one we think will do the best when it is actually implemented.
At rst glance, we might think it best to choose the model that did the best job of classifying
or predicting the outcome variable of interest with the data at hand. However, when we use the
same data to develop the model then assess its performance, we introduce bias.
This is because when we pick the model that does best with the data, this models superior
performance comes from two sources:

A superior model
Chance aspects of the data that happen to match the chosen model better than other models.
The latter is a particularly serious problem with techniques (such as trees and neural nets) that
do not impose linear or other structure on the data, and thus end up overtting it..
To address this problem, we simply divide (partition) our data and develop our model using
only one of the partitions. After we have a model, we try it out on another partition and see how
it does. We can measure how it does in several ways. In a classication model, we can count the
proportion of held-back records that were misclassied. In a prediction model, we can measure the
residuals (errors) between the predicted values and the actual values.
We will typically deal with two or three partitions.

18

2. Overview of the Data Mining

2.5.3.1 Training Partition


Typically the largest partition, these are the data used to build the various models we are examining.
The same training partition is generally used to develop multiple models.
2.5.3.2 Validation Partition
This partition (sometimes called the test partition) is used to assess the performance of each
model, so that you can compare models and pick the best one. In some algorithms (e.g. classication
and regression trees), the validation partition may be used in automated fashion to tune and improve
the model.
2.5.3.3 Test Partition
This partition (sometimes called the holdout or evaluation partition) is used if we need to
assess the performance of the chosen model with new data.
Why have both a validation and a test partition? When we use the validation data to assess
multiple models and then pick the model that does best with the validation data, we again encounter
another (lesser) facet of the overtting problem chance aspects of the validation data that happen
to match the chosen model better than other models.
The random features of the validation data that enhance the apparent performance of the
chosen model will not likely be present in new data to which the model is applied. Therefore, we
may have overestimated the accuracy of our model. The more models we test, the more likely
it is that one of them will be particularly eective in explaining the noise in the validation data.
Applying the model to the test data, which it has not seen before, will provide an unbiased estimate
of how well it will do with new data.
Sometimes (for example, when we are concerned mainly with nding the best model and less
with exactly how well it will do), we might use only training and validation partitions.
The partitioning should be done randomly to avoid getting a biased partition. In XLMiner,
the user can supply a variable (column) with a value t (training), v (validation) and s
(test) assigned to each case (row). Alternatively, the user can ask XLMiner to do the partitioning
randomly.
Note that with nearest neighbor algorithms for supervised learning, each record in the validation
set is compared to all the records in the training set to locate its nearest neighbor(s). In a sense,
the training partition itself is the model - any application of the model to new data requires the
use of the training data. So the use of two partitions is an essential part of the classication or
prediction process, not merely a way to improve or assess it. Nonetheless, we can still interpret the
error in the validation data in the same way we would interpret error from any other model.
XLMiner has a utility that can divide the data up into training, validation and test sets either
randomly according to user-set proportions, or on the basis of a variable that denotes which partition
a record is to belong to. It is possible (though cumbersome) to divide the data into more than 3
partitions by successive partitioning - e.g. divide the initial data into 3 partitions, then take one
of those partitions and partition it further.

2.6 Building a Model ...

2.6

19

Building a Model - An Example with Linear Regression

Lets go through the steps typical to many data mining tasks, using a familiar procedure - multiple
linear regression. This will help us understand the overall process before we begin tackling new
algorithms. We will illustrate the Excel procedure using XLMiner.
1. Purpose. Lets assume that the purpose of our data mining project is to predict the median
house value in small Boston area neighborhoods.
2. Obtain the data. We will use the Boston Housing data. The data set in question is small
enough that we do not need to sample from it - we can use it in its entirety.
3. Explore, clean, and preprocess the data.
Lets look rst at the description of the variables (crime rate, number of rooms per dwelling,
etc.) to be sure we understand them all. These descriptions are available on the description
tab on the worksheet, as is a web source for the data set. They all seem fairly straightforward,
but this is not always the case. Often variable names are cryptic and their descriptions may
be unclear or missing.
This data set has 14 variables and a description of each variable is given in the table below.

CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
MEDV

Per capita crime rate by town


Proportion of residential land zoned for lots over
25,000 sq.ft.
Proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract
bounds river; 0 otherwise)
Nitric oxides concentration (parts per 10 million)
Average number of rooms per dwelling
Proportion of owner-occupied units built prior to 1940
Weighted distances to ve Boston employment centers
Index of accessibility to radial highways
Full-value property-tax rate per $10,000
Pupil-teacher ratio by town
1000(Bk - 0.63)2 where Bk is the proportion of blacks
by town
% Lower status of the population
Median value of owner-occupied homes in $1000s

20

2. Overview of the Data Mining Process


The data themselves look like this:

It is useful to pause and think about what the variables mean, and whether they should be
included in the model. Consider the variable TAX. At rst glance, we consider that tax on
a home is usually a function of its assessed value, so there is some circularity in the model
- we want to predict a homes value using TAX as a predictor, yet TAX itself is determined
by a homes value. TAX might be a very good predictor of home value in a numerical sense,
but would it be useful if we wanted to apply our model to homes whose assessed value might
not be known? Reect, though, that the TAX variable, like all the variables, pertains to
the average in a neighborhood, not to individual homes. While the purpose of our inquiry
has not been spelled out, it is possible that at some stage we might want to apply a model
to individual homes and, in such a case, the neighborhood TAX value would be a useful
predictor. So, we will keep TAX in the analysis for now.
In addition to these variables, the data set also contains an additional variable, CATMEDV,
which has been created by categorizing median value (MEDV) into two categories high
and low. The variable CATMEDV is actually a categorical variable created from MEDV.
If MEDV >=$30,000, CATV = 1. If MEDV <=$30,000, CATV = 0. If we were trying to
categorize the cases into high and low median values, we would use CAT MEDV instead of
MEDV. As it is, we do not need CAT MEDV so we will leave it out of the analysis.
There are a couple of aspects of MEDV the median house value that bear noting. For
one thing, it is quite low, since it dates from the 1970s. For another, there are a lot of 50s,
the top value. It could be that median values above $50,000 were recorded as $50,000.
We are left with 13 independent (predictor) variables, which can all be used.
It is also useful to check for outliers that might be errors. For example, suppose the RM (#
of rooms) column looked like this, after sorting the data in descending order based on rooms:

2.6 Building a Model ...

21

We can tell right away that the 79.29 is in error - no neighborhood is going to have houses that
have an average of 79 rooms. All other values are between 3 and 9. Probably, the decimal
was misplaced and the value should be 7.929. (This hypothetical error is not present in the
data set supplied with XLMiner.)

4. Reduce the data and partition it into training, validation and test partitions. Our data set has
only 13 variables, so data reduction is not required. If we had many more variables, at this
stage we might want to apply a variable reduction technique such as Principal Components
Analysis to consolidate multiple similar variables into a smaller number of variables. Our task
is to predict the median house value, and then assess how well that prediction does. We will
partition the data into a training set to build the model, and a validation set to see how well
the model does. This technique is part of the supervised learning process in classication
and prediction problems. These are problems in which we know the class or value of the
outcome variable for some data, and we want to use that data in developing a model that
can then be applied to other data where that value is unknown.
In Excel, select XLMiner Partition and the following dialog box appears:

22

2. Overview of the Data Mining Process


.

Here we specify which data range is to be partitioned, and which variables are to be included
in the partitioned data set.
The partitioning can be handled in one of two ways:
a) The data set can have a partition variable that governs the division into training and
validation partitions (e.g. 1 = training, 2 = validation), or
b) The partitioning can be done randomly. If the partitioning is done randomly, we have the
option of specifying a seed for randomization (which has the advantage of letting us duplicate
the same random partition later, should we need to).
In this case, we will divide the data into two partitions - training and validation. The training
partition is used to build the model, the validation partition is used to see how well the model does
when applied to new data. We need to specify the percent of the data used in each partition.
Note: Although we are not using it here, a test partition might also be used.
Typically, a data mining endeavor involves testing multiple models, perhaps with multiple
settings on each model. When we train just one model and try it out on the validation data, we
can get an unbiased idea of how it might perform on more such data.

2.6 Building a Model ...

23

However, when we train lots of models and use the validation data to see how each one does,
then pick the best performing model, the validation data no longer provide an unbiased estimate
of how the model might do with more data. By playing a role in picking the best model, the
validation data have become part of the model itself. In fact, several algorithms (classication and
regression trees, for example) explicitly factor validation data into the model building algorithm
itself (in pruning trees, for example).
Models will almost always perform better with the data they were trained on than fresh data.
Hence, when validation data are used in the model itself, or when they are used to select the best
model, the results achieved with the validation data, just as with the training data, will be overly
optimistic.
The test data, which should not be used either in the model building or model selection process,
can give a better estimate of how well the chosen model will do with fresh data. Thus, once we
have selected a nal model, we will apply it to the test data to get an estimate of how well it will
actually perform.
1. Determine the data mining task. In this case, as noted, the specic task is to predict the
value of MEDV using the 13 predictor variables.
2. Choose the technique. In this case, it is multiple linear regression.
3. Having divided the data into training and validation partitions, we can use XLMiner to build
a multiple linear regression model with the training data - we want to predict median house
price on the basis of all the other values.

24

2. Overview of the Data Mining Process


4. Use the algorithm to perform the task. In XLMiner, we select Prediction Multiple Linear
Regression:

The variable MEDV is selected as the output (dependent) variable, the variable CAT.MEDV
is left unused, and the remaining variables are all selected as input (independent or predictor)
variables. We will ask XLMiner to show us the tted values on the training data, as well as
the predicted values (scores) on the validation data.

2.6 Building a Model ...

25

XLMiner produces standard regression output, but we will defer that for now, as well as the
more advanced options displayed above. See the chapter on multiple linear regression, or the
user documentation for XLMiner, for more information. Rather, we will review the predictions
themselves. Here are the predicted values for the rst few records in the training data, along
with the actual values and the residual (prediction error). Note that these predicted values
would often be called the tted values, since they are for the records that the model was t to.

26

2. Overview of the Data Mining Process


And here are the results for the validation data:

Lets compare the prediction error for the training and validation data:

Prediction error can be measured several ways. Three measures produced by XLMiner are
shown above.
On the right is the average error - simply the average of the residuals (errors). In both cases,
it is quite small, indicating that, on balance, predictions average about right - our predictions
are unbiased. Of course, this simply means that the positive errors and negative errors
balance each other out. It tells us nothing about how large those positive and negative errors
are.
The residual sum of squares on the left adds up the squared errors, so whether an error is
positive or negative it contributes just the same. However, this sum does not yield information
about the size of the typical error.
The RMS error or root mean squared error is perhaps the most useful term of all. It takes
the square root of the average squared error, so gives an idea of the typical error (whether
positive or negative) in the same scale as the original data.

2.6 Building a Model ...

27

As we might expect, the RMS error for the validation data ($5,337), which the model is seeing
for the rst time in making these predictions, is larger than for the training data ($4,518),
which were used in training the model.
5. Interpret the results.
At this stage, we would typically try other prediction algorithms (regression trees, for example) and see how they do, error-wise. We might also try dierent settings on the various
models (for example, we could use the best subsets option in multiple linear regression to
chose a reduced set of variables that might perform better with the validation data). After
choosing the best model (typically, the model with the lowest error while also recognizing
that simpler is better), we then use that model to predict the output variable in fresh data.
These steps will be covered in more detail in the analysis of cases.
6. Deploy the model. After the best model is chosen, it is then applied to new data to predict
MEDV for records where this value is unknown. This, of course, was the overall purpose.

2.6.1

Can Excel Handle the Job?

An important aspect of this process to note is that the heavy duty analysis does not necessarily
require huge numbers of records. The data set to be analyzed may have millions of records, of
course, but in doing multiple linear regression or applying a classication tree the use of a sample
of (say) 20,000 is likely to yield as accurate an answer as using the whole data set. The principle
involved is the same as the principal behind polling - 2000 voters, if sampled judiciously, can give
an estimate of the entire populations opinion within one or two percentage points.
Therefore, in most cases, the number of records required in each partition (training, validation
and test) can be accommodated within the rows allowed by Excel.
Of course, we need to get those records into Excel, so the standard version of XLMiner provides
an interface for random sampling of records from an external database.
Likewise, we need to apply the results of our analysis to a large database, so the standard
version of XLMiner has a facility for scoring the output of the model to an external database. For
example, XLMiner would write an additional column (variable) to the database consisting of the
predicted purchase amount for each record.

28

3. Supervised Learning - Classication & Prediction

Chapter 3

Supervised Learning - Classication &


Prediction
In supervised learning, we are interested in predicting the class (classication) or continuous value
(prediction) of an outcome variable. In the previous chapter, we worked through a simple example.
Lets now examine the question of how to judge the usefulness of a classier or predictor and how
to compare dierent ones.

3.1

Judging Classication Performance

Not only do we have a wide choice of dierent types of classiers to choose from but within each
type of classier we have many options such as how many nearest neighbors to use in a k-nearest
neighbors classier, the minimum number of cases we should require in a leaf node in a tree classier,
which subsets of predictors to use in a logistic regression model, and how many hidden layer neurons
to use in a neural net. Before we study these various algorithms in detail and face decisions on how
to set these options, we need to know how we will measure success.

3.1.1

A Two-class Classier

Let us rst look at a single classier for two classes. The two-class situation is certainly the most
common and occurs very frequently in practice. We will extend our analysis to more than two
classes later.
A natural criterion for judging the performance of a classier is the probability that it makes
a misclassication error. A classier that makes no errors would be perfect but we do not expect
to be able to construct such classiers in the real world due to noise and to not having all the
information needed to precisely classify cases. Is there a minimum probability of misclassication
we should require of a classier?
At a minimum, we hope to do better than the crude rule classify everything as belonging to
the most prevalent class. Imagine that, for each case, we know what the probability is that it
belongs to one class or the other. Suppose that the two classes are denoted by C0 and C1 . Let p(C0 )
29

30

3. Supervised Learning - Classication & Prediction

and p(C1 ) be the apriori probabilities that a case belongs to C0 and C1 respectively. The apriori
probability is the probability that a case belongs to a class without any more knowledge about it
than that it belongs to a population where the proportion of C0 s is p(C0 ) and the proportion of C1 s
is p(C1 ). In this situation we will minimize the chance of a misclassication error by assigning class
C1 to the case if p(C1 ) > p(C0 ) and to C0 otherwise. The probability of making a misclassication
error would be the minimum of p(C0 ) and p(C1 ). If we are using misclassication rate as our
criterion any classier that uses predictor variables must have an error rate better than this.
What is the best performance we can expect from a classier? Clearly the more training data
available to a classier the more accurate it will be. Suppose we had a huge amount of training data,
would we then be able to build a classier that makes no errors? The answer is no. The accuracy
of a classier depends critically on how separated the classes are with respect to the predictor
variables that it the classier uses. We can use the well-known Bayes formula from probability
theory to derive the best performance we can expect from a classier for a given set of predictor
variables if we had a very large amount of training data. Bayes formula uses the distributions of
the decision variables in the two classes to give us a classier that will have the minimum error
amongst all classiers that use the same predictor variables. This classier uses the Minimum Error
Bayes Rule.

3.1.2

Bayes Rule for Minimum Error

Let us take a simple situation where we have just one continuous predictor variable, say X, to use
in predicting our two-class outcome variable. Now X is a random variable, since its value depends
on the individual case we sample from the population consisting of all possible cases of the class to
which the case belongs.
Suppose that we have a very large training data set. Then the relative frequency histogram of
the variable X in each class would be almost identical to the probability density function (p.d.f.)
of X for that class. Let us assume that we have a huge amount of training data and so we know
the p.d.f.s accurately. These p.d.f.s are denoted f0 (x) and f1 (x) for classes C0 and C1 in Fig. 1
below.
Figure 1

3.1 Judging Classication Performance

31

Now suppose we wish to classify an object for which the value of X is x0 . Let us use Bayes
formula to predict the probability that the object belongs to class 1 conditional on the fact that it
has an X value of x0 . Applying Bayes formula, the probability, denoted by p(C1 |X = x0 ), is given
by:
p(X = x0 |C1 )p(C1 )
p(C1 |X = x0 ) =
p(X = x0 |C0 )p(C0 ) + p(X = x0 |C1 )p(C1 )
Writing this in terms of the density functions, we get
p(C1 |X = x0 ) =

f1 (x0 )p(C1 )
f0 (x0 )p(C0 ) + f1 (x0 )p(C1 )

Notice that to calculate p(C1 |X = x0 ) we need to know the apriori probabilities p(C0 ) and
p(C1 ). Since there are only two possible classes, if we know p(C1 ) we can always compute p(C0 )
because p(C0 ) = 1 p(C1 ). The apriori probability p(C1 ) is the probability that an object belongs
to C1 without any knowledge of the value of X associated with it. Bayes formula enables us to
update this apriori probability to the aposteriori probability, the probability of the object belonging
to C1 after knowing that its X value is x0 .
When p(C1 ) = p(C0 ) = 0.5, the formula shows that p(C1 |X = x0 ) > p(C0 |X = x0 ) if f1 (x0 ) >
f0 (x0 ). This means that if x0 is greater than a (Figure 1), and we classify the object as belonging
to C1 we will make a smaller misclassication error than if we were to classify it as belonging to C0 .
Similarly if x0 is less than a, and we classify the object as belonging to C0 we will make a smaller
misclassication error than if we were to classify it as belonging to C1 . If x0 is exactly equal to a
we have a 50% chance of making an error for either classication.
Figure 2

What if the prior class probabilities were not the same (Figure 2)? Suppose C0 is twice as likely
apriori as C1 . Then the formula says that p(C1 |X = x0 ) > p(C0 |X = x0 ) if f1 (x0 ) > 2 f0 (x0 ).

32

3. Supervised Learning - Classication & Prediction

The new boundary value, b for classication will be to the right of a as shown in Fig.2. This is
intuitively what we would expect. If a class is more likely we would expect the cut-o to move in
a direction that would increase the range over which it is preferred.
In general we will minimize the misclassication error rate if we classify a case as belonging
to C1 if p(C1 ) f1 (x0 ) > p(C0 ) f0 (x0 ), and to C0 otherwise. This rule holds even when X is a
vector consisting of several components, each of which is a random variable. In the remainder of
this note we shall assume that X is a vector.
An important advantage of Bayes Rule is that, as a by-product of classifying a case, we can
compute the conditional probability that the case belongs to each class. This has two advantages.
First, we can use this probability as a score for each case that we are classifying. The score
enables us to rank cases that we have predicted as belonging to a class in order of condence that we
have made a correct classication. This capability is important in developing a lift curve (explained
later) that is important for many practical data mining applications.
Second, it enables us to compute the expected prot or loss for a given case. This gives us a
better decision criterion than misclassication error when the loss due to error is dierent for the
two classes.

3.1.3

Practical Assessment of a Classier Using Misclassication Error as the


Criterion

In practice, we can estimate p(C1 ) and p(C0 ) from the data we are using to build the classier by
simply computing the proportion of cases that belong to each class. Of course, these are estimates
and they can be incorrect, but if we have a large enough data set and neither class is very rare
our estimates will be reliable. Sometimes, we may be able to use public data such as census data
to estimate these proportions. However, in most practical business settings we will not know f1 (x)
and f0 (x) . If we want to apply Bayes Rule we will need to estimate these density functions in some
way. Many classication methods can be interpreted as being methods for estimating such density
functions1 . In practice X will almost always be a vector. This complicates the task because of the
curse of dimensionality - the diculty and complexity of the calculations increases exponentially,
not linearly, as the number of variables increases.
To obtain an honest estimate of classication error, let us suppose that we have partitioned a
data set into training and validation data sets by random selection of cases. Let us assume that we
have constructed a classier using the training data. When we apply it to the validation data, we
will classify each case into C0 or C1 . These classications can be displayed in what is known as a
confusion table, with rows and columns corresponding to the true and predicted classes respectively.
(Although we can summarize our results in a confusion table for training data as well, the resulting
confusion table is not useful for getting an honest estimate of the misclassication rate due to the
1
There are classifiers that focus on simply finding the boundary between the regions to predict each class without
being concerned with estimating the density of cases within each region. For example, Support Vector Machine
Classifiers have this characteristic

3.1 Judging Classication Performance

33

danger of overtting.)

Confusion Table
(Validation Cases)
True Class
C0

C1

Predicted Class
C0
True Negatives (Number of correctly
classied cases that belong to C0 )
False Negatives (Number of cases
incorrectly classied as C0
that belong to C1 )

C1
False Positives (Number of cases
incorrectly classied as C1
that belong to C0 )
True Positives (Number of
correctly classied cases
that belong to C1 )

If we denote the number in the cell at row i and column j by Nij , the estimated misclassication
rate Err = (N01 + N10 )/Nval where Nval = (N00 + N01 + N10 + N11 ), or the total number of cases
in the validation data set. If Nval is reasonably large, our estimate of the misclassication rate is
probably reasonably accurate. We can compute a condence interval using the standard formula
for estimating a population proportion from a random sample.
Note that we are assuming that the cost (or benet) of making correct classications is zero. At
rst glance, this may seem incomplete. After all, the benet (negative cost) of correctly classifying a
buyer as a buyer would seem substantial. And, in other circumstances (e.g. scoring our classication
algorithm to fresh data to implement our decisions), it will be appropriate to consider the actual net
dollar impact of each possible classication (or misclassication). Here, however, we are attempting
to assess the value of a classier in terms of classication error, so it greatly simplies matter if we
can capture all cost/benet information in the misclassication cells. So, instead of recording the
benet of correctly classifying a buyer, we record the cost of failing to classify him as a buyer. It
amounts to the same thing and our goal becomes the minimization of costs, whether the costs are
actual costs or foregone benets (opportunity costs).
The table below gives an idea of how the accuracy of the estimate varies with Nval . The column
headings are values of the misclassication rate and the rows give the desired accuracy in estimating the misclassication rate as measured by the half-width of the condence interval at the 99%
condence level. For example, if we think that the true misclassication rate is likely to be around
0.05 and we want to be 99% condent that Err is within 0.01 of the true misclassication rate,
we need to have a validation data set with 3,152 cases.

0.025
0.010
0.005

0.01
250
657
2,628

0.05
504
3,152
12,608

0.10
956
5,972
23,889

0.15
1,354
8,461
33,842

0.20
1,699
10,617
42,469

0.30
2,230
13,935
55,741

0.40
2,548
15,926
63,703

0.50
2,654
16,589
66,358

34

3.1.4

3. Supervised Learning - Classication & Prediction

Asymmetric Misclassication Costs and Bayes Risk

Up to this point we have been using the misclassication rate as the criterion for judging the ecacy
of a classier. However, there are circumstances when this measure is not appropriate. Sometimes
the error of misclassifying a case belonging to one class is more serious than for the other class. For
example, misclassifying a household as unlikely to respond to a sales oer when it belongs to the
class that would respond incurs a greater opportunity cost than the converse error. In the former
case, you are missing out on a sale worth perhaps tens or hundreds of dollars. In the latter, you are
incurring the costs of mailing a letter to someone who will not purchase. In such a scenario using
the misclassication rate as a criterion can be misleading. Consider the situation where the sales
oer is accepted by 1% of the households on a list. If a classier simply classies every household as
a non-responder it will have an error rate of only 1% but will be useless in practice. A classier that
misclassies 30% of buying households as non-buyers and 2% of the non-buyers as buyers would
have a higher error rate but would be better if the prot from a sale is substantially higher than
the cost of sending out an oer. In these situations, if we have estimates of the cost of both types of
misclassication, we can use the confusion table to compute the expected cost of misclassication
for each case in the validation data. This enables us to compare dierent classiers using overall
expected costs as the criterion. However, it does not improve the actual classications themselves.
A better method is to change the classication rules (and hence the misclassication rates) to reect
the asymmetric costs. In fact, there is a Bayes classier for this situation which gives rules that
are optimal for minimizing the overall expected loss from misclassication (including both actual
and opportunity costs). This classier is known as the Bayes Risk Classier and the corresponding
minimum expected cost of misclassication is known as the Bayes Risk. The Bayes Risk Classier
employs the following classication rule:
Classify a case as belonging to C1 if p(C1 ) f( x0 ) C(0|1) > p(C0 ) f0 (x0 ) C(1|0), and to
C0 otherwise. Here C(0|1) is the cost of misclassifying a C1 case as belonging to C0 and C(1|0) is
the cost of misclassifying a C0 case as belonging to C1 . Note that the opportunity costs of correct
classication for either class is zero. Notice also that this rule reduces to the Minimum Error Bayes
Rule when C(0|1) = C(1|0).
Again, as we rarely know f1 (x0 ) and f0 (x0 ), we cannot construct this classier in practice.
Nonetheless, it provides us with an ideal that the various classiers we construct for minimizing
expected opportunity cost attempt to emulate.

3.1.5

Stratied Sampling and Asymmetric Costs

When classes are not present in roughly equal proportions, stratied sampling is often used to
oversample the cases from the more rare class and improve the performance of classiers. If a class
occurs only rarely in the training set, the classier will have little information to use in learning
what distinguishes it from the other classes. The most commonly used weighted sampling scheme
is to sample an equal number of cases from each class.
It is often the case that the more rare events are the more interesting or important ones responders to a mailing, those who commit fraud, defaulters on debt, etc. - and hence the more

3.1 Judging Classication Performance

35

costly to misclassify. Hence, after oversampling and training a model on a biased sample, two
adjustments are required:
Adjusting the responses for the biased sampling (e.g. if a class was over-represented in the
training sample by a factor of 2, its predicted outcomes need to be divided by 2)
Translating the results (in terms of numbers of responses) into expected gains or losses in a
way that accounts for asymmetric costs.

3.1.6

Generalization to More than Two Classes

All the comments made above about two-class classiers extend readily to classication into more
than two classes. Let us suppose we have k classes C0 , C1 , C2 , Ck1 . Then Bayes formula gives
us:
fj (x0 )p(Cj )
.
p(Cj |X = x0 ) = k1

fi (x0 )p(C1 )
i=1

The Bayes Rule for Minimum Error is to classify a case as belonging to Cj if


p(Cj ) fj (x0 )

max

i=0,1,,k1

p(Cj ) fi (x0 ).

The confusion table has k rows and k columns. The misclassication cost associated with the
diagonal cells is, of course, always zero. If the costs are asymmetric the Bayes Risk Classier follows
the rule: Classify a case as belonging to C1 if
p(Cj ) fj (x0 ) C( j|j) max p(Ci ) fi (x0 ) C( i|i)
i=j

where C( j|j) is the cost of misclassifying a case that belongs to Cj to any other class Ci , i = j.

3.1.7

Lift Charts

Often in practice, misclassication costs are not known accurately and decision makers would like to
examine a range of possible costs. In such cases, when the classier gives a probability of belonging
to each class and not just a binary classication to C1 or C0 , we can use a very useful device known
as the lift curve, also called a gains curve or gains chart. The lift curve is a popular technique in
direct marketing and one useful way to think of a lift curve is to consider a data mining model
that attempts to identify the likely responders to a mailing by assigning each case a probability of
responding score. The lift curve helps us determine how eectively we can skim the cream by
selecting a relatively small number of cases and getting a relatively large portion of the responders.
The input required to construct a lift curve is a validation data set that has been scored by
appending to each case, the estimated probability that it will belong to a given class.

36

3.1.8

3. Supervised Learning - Classication & Prediction

Example: Boston Housing (Two classes)

Let us t a logistic regression model to the Boston Housing data. (We will cover logistic regression in detail later; for now think of it as like linear regression, except the outcome variable being
predicted is binary.) We t a logistic regression model to the training data (304 randomly selected
cases) with all the 13 variables available in the data set as predictor variables and with the binary
variable CAT.MEDV (1 = median value >= $30,000, 0 = median value <= $30,000) as the dependent variable. The model coecients are applied to the validation data (the remaining 202 cases
in the data set). The rst three columns of XLMiner output for the rst 30 cases in the validation
data are shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Predicted Log-odds
of Success
3.5993
-6.5073
0.4061
-14.2910
4.5273
-1.2916
-37.6119
-1.1157
-4.3290
-24.5364
-21.6854
-19.8654
-13.1040
4.4472
3.5294
3.6381
-2.6806
-0.0402
-10.0750
-10.2859
-14.6084
8.9016
0.0874
-6.0590
-1.9183
-13.2349
-9.6509
-13.4562
-13.9340
1.7257

Predicted Prob.
of Success
0.9734
0.0015
0.6002
0.0000
0.9893
0.2156
0.0000
0.2468
0.0130
0.0000
0.0000
0.0000
0.0000
0.9884
0.9715
0.9744
0.0641
0.4900
0.0000
0.0000
0.0000
0.9999
0.5218
0.0023
0.1281
0.0000
0.0001
0.0000
0.0000
0.8489

Actual Value
of HICLASS
1
0
0
0
1
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
1
0
1
1
0
0
0
0
1

3.1 Judging Classication Performance

37

The same 30 cases are shown below sorted in descending order of the predicted probability of
being a HICLASS=1 case.

22
5
14
16
1
15
30
3
23
18
8
6
25
17
9
24
2
27
19
20
13
26
28
29
4
21
12
11
10
7

Predicted Log-odds
of Success
8.9016
4.5273
4.4472
3.6381
3.5993
3.5294
1.7257
0.4061
0.0874
-0.0402
-1.1157
-1.2916
-1.9183
-2.6806
-4.3290
-6.0590
-6.5073
-9.6509
-10.0750
-10.2859
-13.1040
-13.2349
-13.4562
-13.9340
-14.2910
-14.6084
-19.8654
-21.6854
-24.5364
-37.6119

Predicted Prob.
of Success
0.9999
0.9893
0.9884
0.9744
0.9734
0.9715
0.8489
0.6002
0.5218
0.4900
0.2468
0.2156
0.1281
0.0641
0.0130
0.0023
0.0015
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

Actual Value
of HICLASS
1
1
1
1
1
1
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0

38

3. Supervised Learning - Classication & Prediction

First, we need to set a cuto probability value, above which we will consider a case to be a
positive or 1, and below which we will consider a case to be a negative or 0. For any given cuto
level, we can use the sorted table to compute a confusion table for a given cuto probability. For
example, if we use a cuto probability level of 0.400, we will predict 10 positives (7 true positives
and 3 false positives); we will also predict 20 negatives (18 true negatives and 2 false negatives).
For each cuto level, we can calculate the appropriate confusion table. Instead of looking at a large
number of confusion tables, it is much more convenient to look at the cumulative lift curve (sometimes called a gains chart) which summarizes all the information in these multiple confusion tables
into a graph. The graph is constructed with the cumulative number of cases (in descending order of
probability) on the x axis and the cumulative number of true positives on the y axis as shown below.
Probability Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Predicted Prob.
of Success
0.9999
0.9893
0.9884
0.9744
0.9734
0.9715
0.8489
0.6002
0.5218
0.4900
0.2468
0.2156
0.1281
0.0641
0.0130
0.0023
0.0015
0.0001
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

Actual Value
of HICLASS
1
1
1
1
1
1
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Cumulative Actual
Value
1
2
3
4
5
6
7
7
7
7
7
7
8
8
8
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9

3.1 Judging Classication Performance

39

The cumulative lift chart is shown below.

The line joining the points (0,0) to (30,9) is a reference line. It represents the expected number
of positives we would predict if we did not have a model but simply selected cases at random. It
provides a benchmark against which we can see performance of the model. If we had to choose 10
neighborhoods as HICLASS=1 neighborhoods and used our model to pick the ones most likely to
be 1s,, the lift curve tells us that we would be right about 7 of them. If we simply select 10 cases
at random we expect to be right for 10 9/30 = 3 cases. The model gives us a lift in predicting
HICLASS of 7/3 = 2.33. The lift will vary with the number of cases we choose to act on. A good
classier will give us a high lift when we act on only a few cases (i.e. use the prediction for the ones
at the top). As we include more cases the lift will decrease. The lift curve for the best possible
classier - a classier that makes no errors - would overlap the existing curve at the start, continue
with a slope of 1 until it reached 9 successes (all the successes), then continue horizontally to the
right.
XLMiner automatically creates lift charts from probabilities predicted by logistic regression for
both training and validation data. The charts created for the full Boston Housing data are shown
below.
It is worth mentioning that a curve that captures the same information as the lift curve in a
slightly dierent manner is also popular in data mining applications. This is the ROC (short for
Receiver Operating Characteristic) curve. It uses the same variable on the y axis as the lift curve
(but expressed as a percentage of the maximum) and on the x axis it shows the true negatives (also
expressed as a percentage of the maximum) for diering cut-o levels.

40

3.1.9

3. Supervised Learning - Classication & Prediction

ROC Curve

The ROC curve for our 30 cases example above is shown below.

3.1.10

Classication using a Triage strategy

In some cases it is useful to have a cant say option for the classier. In a two-class situation
this means that for a case we can make one of three predictions. The case belongs to C0 , or the
case belongs to C1 , or we cannot make a prediction because there is not enough information to
condently pick C0 or C1 . Cases that the classier cannot classify are subjected to closer scrutiny
either by using expert judgment or by enriching the set of predictor variables by gathering additional
information that is perhaps more dicult or expensive to obtain. This is analogous to the strategy
of triage that is often employed during retreat in battle. The wounded are classied into those
who are well enough to retreat, those who are too ill to retreat even if medically treated under the
prevailing conditions, and those who are likely to become well enough to retreat if given medical
attention. An example is in processing credit card transactions where a classier may be used to
identify clearly legitimate cases and the obviously fraudulent ones while referring the remaining
cases to a human decision-maker who may look up a database to form a judgment. Since the vast
majority of transactions are legitimate, such a classier would substantially reduce the burden on
human experts. To gain some insight into forming such a strategy let us revisit the simple two-class
single predictor variable classier that we examined at the beginning of this chapter.
Clearly the grey area of greatest doubt in classication is the area around a. At a the ratio of
the conditional probabilities of belonging to the classes is one. A sensible rule way to dene the

41
grey area is the set of x values such that:
t>

p(C1 ) f1 (x0 )
> 1/t
p(C0 ) f0 (x0 )

where t is a threshold for the ratio. A typical value of t may in the range 1.05 or 1.2.

42

4. Multiple Linear Regression

Chapter 4

Multiple Linear Regression


4.1

A Review of Multiple Linear Regression

4.1.1

Linearity

Perhaps the most popular mathematical model for making predictions is the multiple linear regression model encountered in most introductory statistics classes. Multiple linear regression is
applicable to numerous data mining situations. Examples are: predicting customer activity on
credit cards from demographics and historical activity patterns, predicting the time to failure of
equipment based on utilization and environment conditions, predicting expenditures on vacation
travel based on historical frequent ier data, predicting stang requirements at help desks based
on historical data and product and sales information, predicting sales from cross selling of products
from historical information and predicting the impact of discounts on sales in retail outlets.
There are two important conceptual ideas that we will develop:
1. Relaxing the assumption that errors follow a Normal distribution;
2. Identifying subsets of the independent variables to improve predictions.

4.1.2

Independence

Relaxing the Normal distribution assumption


Let us review the typical multiple regression model. There is a continuous random variable
called the dependent variable, Y , and a number of independent variables, x1 , x2 , , xp . Our
purpose is to predict the value of the dependent variable (also referred to as the outcome or response
variable) using a linear function of the independent variables. The values of the independent
variables (also referred to as predictor variables, input variables, regressors or covariates) are known
quantities for purposes of prediction, and the model is:
Y = 0 + 1 x1 + 2 x2 + + p xp + (1)
where , the noise variable, is a Normally-distributed random variable with mean = 0 and
standard deviation = whose value we do not know. We also do not know the values of the
43

44

4. Multiple Linear Regression

coecients 0 , 1 , 2 , , p . We estimate all these (p + 2) unknown values from the available data.
The data consist of n cases (rows of observations) which give us values yi , xi1 , xi2 , , xip ; i =
1, 2, , n. The estimates for the coecients are computed so as to minimize the sum of squares
of dierences between the tted (predicted) values and the observed Y values in the data. The sum
of squared dierences is given by
n


(yi 0 1 xi1 2 xi2 p xip )2

i=1

Let us denote the values of the coecients that minimize this expression by 0 , 1 , 2 , , p . These
are our estimates for the unknown values and are called OLS (ordinary least squares) estimates.
2
Once we have computed the estimates 0 , 1 , 2 , , p we can calculate an unbiased estimate
2
for using the formula:

2 =
=

4.1.3

n

1
(yi 0 1 xi1 2 xi2 p xip )2
n p 1 i=1

Sum of squares of residuals


.
observations-coecients

Unbiasedness

We plug in the values of 0 , 1 , 2 , , p in the linear regression model (1) to predict the value
of the dependent value from known values of the independent values, x1 , x2 , , xp . The predicted
value, Y , is computed from the equation Y = 0 + 1 x1 + 2 x2 + + p xp . Predictions based
on this equation are the best predictions possible in the sense that they will be unbiased (equal to
the true values on the average) and will have the smallest expected squared error compared to any
unbiased estimates if we make the following assumptions:
1. The expected value of the dependent variable is a linear function of the independent variables.
More specically,
E(Y |x1 , x2 , , xp ) = 0 + 1 x1 + 2 x2 + + p xp .
2. The noise random variables i are independent between all the cases. Here i is the noise
random variable in observation i for i = 1 n
3. E(i ) = 0 for i = 1, 2, , n.
4. Homoskedasticity. The standard deviation of i equals the same (unknown) value, , for
i = 1, 2, , n.
5. Normality. The noise random variables, i , are Normally- distributed.
An important and interesting fact for our purposes is that even if we drop the last assumption
and allow the noise variables to follow arbitrary distributions, these estimates are very good for

4.2 Illustration of the Regression Process

45

prediction. We can show that predictions based on these estimates are the best linear predictions
in that they minimize the expected squared error. In other words, amongst all linear models, as
dened by equation (1) above, the model using the least squares estimates, 0 , 1 , 2 , , p , will
give the smallest value of squared error on the average.
The Normal distribution assumption is required in the classical implementation of multiple
linear regression to derive condence intervals for predictions. In this classical world, data are
scarce and the same data are used to t the regression model and to assess its reliability (with
condence limits). In data mining applications we have two distinct sets of data: the training
data set and the validation data set that are both representative of the relationship between the
dependent and independent variables. The training data is used to estimate the regression coecients 0 , 1 , 2 , , p . The validation data set constitutes a hold-out sample and is not used in
computing the coecient estimates. This enables us to estimate the error in our predictions without having to assume that the noise variables follow the Normal distribution. We use the training
data to t the model and to estimate the coecients. These coecient estimates are used to make
predictions for each case in the validation data. The prediction for each case is then compared to
the value of the dependent variable that was actually observed in the validation data. The average
of the square of this error enables us to compare dierent models and to assess the accuracy of the
model in making predictions.

4.2

Illustration of the Regression Process

Example 1: Supervisor Performance Data (adapted from Chaterjee, Hadi and Price)
The data shown in Table 2.1 are from a large nancial organization. The dependent (outcome)
variable is an overall measure of supervisor eectiveness. The independent (predictor) variables are
clerical employees ratings of these same supervisors on more specic attributes of performance.
All ratings are on a scale of 1 to 5 by 25 clerks reporting to the supervisor. These ratings are
answers to survey questions given to a sample of 25 clerks in each of 30 departments. The purpose
of the analysis was to explore the feasibility of using a questionnaire for predicting eectiveness
of supervisors, thus saving the considerable eort required to directly measure eectiveness. The
variables are answers to questions on the survey and are described below.
Y Measure of eectiveness of supervisor
X1 Handles employee complaints
X2 Does not allow special privileges
X3 Opportunity to learn new things
X4 Raises based on performance
X5 Too critical of poor performance
X6 Rate of advancing to better jobs

46

4. Multiple Linear Regression


Table 1: Training Data (20 departments)
Case
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Y
43
63
71
61
81
43
58
71
72
67
64
67
69
68
77
81
74
65
65
50

X1
51
64
70
63
78
55
67
75
82
61
53
60
62
83
77
90
85
60
70
58

X2
30
51
68
45
56
49
42
50
72
45
53
47
57
83
54
50
64
65
46
68

X3
39
54
69
47
66
44
56
55
67
47
58
39
42
45
72
72
69
75
57
54

X4
61
63
76
54
71
54
66
70
71
62
58
59
55
59
79
60
79
55
75
64

X5
92
73
86
84
83
49
68
66
83
80
67
74
63
77
77
54
79
80
85
78

X6
45
47
48
35
47
34
35
41
31
41
34
41
25
35
46
36
63
60
46
52

The multiple linear regression estimates (as computer by XLMiner) are reported below.

Multiple R-squared
Residual SS
Std. Dev. Estimate

Constant
X1
X2
X3
X4
X5
X6

Coecient
13.182
0.583
-0.044
0.329
-0.057
0.112
-0.197

0.656
738.900
7.539
StdError
16.746
0.232
0.167
0.219
0.317
0.196
0.247

t-statistic
0.787
2.513
-0.263
1.501
-0.180
0.570
-0.798

p-value
0.445
0.026
0.797
0.157
0.860
0.578
0.439

The equation to predict performance is Y = 13.182 + 0.583X1 0.044X2 + 0.329X3 0.057X4 +


0.112X5 0.197X6 .
Applying this equation to the validation data gives the predictions and errors shown in Table
2.

4.3 Subset Selection in Linear Regression

47

Table 2: Predictions for validation cases


Case
21
22
23
24
25
26
27
28
29
30
Averages:
Std Devs:

Y
50
64
53
40
63
66
78
48
85
82

X1
40
61
66
37
54
77
75
57
85
82

X2
33
52
52
42
42
66
58
44
71
39

X3
34
62
50
58
48
63
74
45
71
59

X4
43
66
63
50
66
88
80
51
77
64

X5
64
80
80
57
75
76
78
83
74
78

X6
33
41
37
49
33
72
49
38
55
39

Prediction
44.46
63.98
63.91
45.87
56.75
65.22
73.23
58.19
76.05
76.10
62.38
11.30

Error=(Pred-Y)
-5.54
-0.02
10.91
5.87
-6.25
-0.78
-4.77
10.19
-8.95
-5.90
-0.52
7.17

We note that the average error in the predictions is small (0.52) and so the predictions are
unbiased. Further the errors are roughly Normal so that this model gives prediction errors that are
approximately 95% of the time within 14.34 (two standard deviations) of the true value. (If we
want to be very conservative we can use a result known as Tchebychevs inequality which says that
the probability that any random variable is more than k standard deviations away from its mean
is at most 1/k2 . This more conservative formula tells us that the chances of our prediction being
within 3 7.17 = 21.51 of the true value are at least 8/9 90%).

4.3

Subset Selection in Linear Regression

A frequent problem in data mining is that of using a regression equation to predict the value of a
dependent variable when we have many variables available to choose as independent variables in
our model. Given the high speed of modern algorithms for multiple linear regression calculations,
it is tempting in such a situation to take a kitchen-sink approach: why bother to select a subset,
just use all the variables in the model. There are several reasons why this could be undesirable.
It may be expensive (or not feasible) to collect the full complement of variables for future
predictions.
We may be able to measure fewer variables more accurately (for example in surveys).
We may need to delete fewer observations in data sets with missing values of observations.
Parsimony is an important property of good models. We obtain more insight into the inuence
of regressors in models with a few parameters.
Estimates of regression coecients are likely to be unstable due to multicollinearity in models
with many variables. (Multicollinearity is the presence of two or more predictor variables

48

4. Multiple Linear Regression


sharing the same linear relationship with the outcome variable.) Regression coecients are
more stable for parsimonious models. One rough thumbrule (where n = # of cases and k =
# of variables): n 5(k + 2).)
It can be shown that using independent variables that are uncorrelated with the dependent
variable will increase the variance of predictions.
It can be shown that dropping independent variables that have small (non-zero) coecients
can reduce the average error of predictions.

Let us illustrate the last two points using the simple case of two independent variables. The
reasoning remains valid in the general situation of more than two independent variables.

4.4

Dropping Irrelevant Variables

Suppose that the true equation for Y , the dependent variable, is:
Y = 1 X1 +

(2)

and suppose that we estimate Y (using an additional variable X2 that is actually irrelevant) with
the equation:
(3)
Y = 1 X1 + 2 X2 + .
This equation is true with 2 = 0.
We can show that in this situation the least squares estimates 1 and 2 will have the following
expected values and variances:
E(1 ) = 1 , V ar(1 ) =

2
2 )  x2
(1 R12
i1

2
2 )  x2
(1 R12
i2
where R12 is the correlation coecient between X1 and X2 .
We notice that 1 is an unbiased estimator of 1 and 2 is an unbiased estimator of 2 since it
has an expected value of zero. However the variance of 1 is larger than it would have been if we
had used equation (2). In that case
E(2 ) = 0, V ar(2 ) =

2
E(1 ) = 1 , V ar(1 ) =  2 .
x1
The variance is the expected value of the squared error for an unbiased estimator. So we
are worse o using the irrelevant estimator in making predictions. Even if X2 happens to be
2 = 0 and the variance of
1 is the same in both models, we can
uncorrelated with X1 so that R12
show that the variance of a prediction based on (3) will be greater than that of a prediction based
on (2) due to the added variability introduced by estimation of 2 .
Although our analysis has been based on one useful independent variable and one irrelevant
independent variable, the result holds true in general. It is always better to make predictions with
models that do not include irrelevant variables.

4.5. DROPPING INDEPENDENT VARIABLES WITH SMALL COEFFICIENT VALUES

4.5

49

Dropping Independent Variables With Small Coecient Values

Suppose that the situation is the reverse of what we have discussed above, namely that equation (3)
is the correct equation but we use equation (2) for our estimates and predictions ignoring variable
X2 in our model. To keep our results simple let us suppose that we have scaled the values of X1 , X2
and Y so that their variances are equal to 1. In this case the least squares estimate 1 has the
following expected value and variance.
E(1 ) = 1 + R12 2 , V ar(1 ) = 2 .
Notice that 1 is a biased estimator of 1 with bias equal to R12 2 and its Mean Square Error is
given by:
M SE(1 ) = E[(1 1 )2 ]
= E[{1 E(1 ) + E(1 ) 1 }2 ]
= [Bias(1 )]2 + V ar(1 )
= (R12 2 )2 + 2 .
If we use equation (3), the least squares estimates 1 and 2 have the following expected values
and variances.
2
2 )
(1 R12
2
V ar(2 ) =
2 ).
(1 R12

E(1 ) = 1 V ar(1 ) =
E(2 ) = 2

Now let us compare the Mean Square Errors for predicting Y at X1 = u1 , X2 = u2 . For equation
(2),
M SE2(Y ) = E[(Y Y )2 ]
= E[(u1 1 u1 1 )2 ]
= u21 M SE2(1 ) + 2
= u21 (R12 2 )2 + u21 2 + 2 .
For equation (3),
M SE3(Y ) = E[(Y Y )2 ]
= E[(u1 1 + u2 2 u1 1 u2 2 )2 ]
= V ar(u1 1 + u2 2 ) + 2
because now Y is unbiased
= u21 V ar(1 ) + u22 V ar(2 ) + 2u1 u2 Covar(1 , 2 )
(u21 + u22 2u1 u2 R12 ) 2
+ 2 .
=
2 )
(1 R12

50

4. Multiple Linear Regression

Equation (2) can lead to lower mean squared error for many combinations of values for u1 , u2 , R12 ,
and (2 /)2 .
For example, if u1 = 1, u2 = 0,
|2 |
2
< 1 2 .
M SE2(Y ) < M SE3(Y ) when (R12 2 )2 + 2 < (1R
2 ) or when
12

|2 |

2 ;
R12

2
R12

||

1R12

If
< 1 this will be true for all values of
also if
> .9, for
< 2.
In general, accepting some bias can reduce MSE. This bias-variance trade-o generalizes to
models with several independent variables and is particularly important for large values of p since
in that case it is very likely that there are variables in the model that have small coecients relative
to the standard deviation of the noise term and also exhibit at least moderate correlation with other
variables. Dropping such variables will improve the predictions as it will reduce the MSE.
This type of bias-variance trade-o is a basic aspect of most data mining procedures for prediction and classication.

4.6

Algorithms for Subset Selection

Selecting subsets to improve MSE is a dicult computational problem for large p. The most
common procedure for p greater than about 20 is to use heuristics to select good subsets rather
than to look for the best subset for a given criterion. The heuristics most often used and available in
statistics software are step-wise procedures. There are three common procedures: forward selection,
backward elimination and step-wise regression.

4.6.1

Forward Selection

Here we keep adding variables one at a time to construct what we hope is a reasonably good subset.
The steps are as follows:
1. Start with constant term only in subset (S)
2. Compute the reduction in the sum of squares of the residuals (SSR) obtained by including
each variable that is not presently in R. For the variable, say, i, that gives the largest reduction
in SSR compute
SSR(S) SSR(S {i})
Fi = M axiS
/

2 (S {i})
If Fi > Fin add i to S.
3. Repeat 2 until no variables can be added.
(Typical values for Fin are in the range [2,4])

4.6.2

Backward Elimination

1. Start with all variables in R.

4.7 Identifying Subsets of Variables to Improve Predictions

51

2. Compute the increase in the sum of squares of the residuals (SSR) obtained by excluding
each variable that is presently in R. For the variable, say, i, that gives the smallest increase
in SSR compute
SSR(S {i}) SSR(S)
.
Fi = M iniS
/

2 (S)
If Fi < Fout then drop i from S.
3. Repeat 2 until no variable can be dropped.
(Typical values for Fout are in the range [2,4]).
Backward Elimination has the advantage that all variables are included in S at some stage.
This gets around a problem of forward selection that will never select a variable that is better than
a previously selected variable that is strongly correlated with it. The disadvantage is that the full
model with all variables is required at the start and this can be time-consuming and numerically
unstable.

4.6.3

Step-wise Regression (Efroymsons method)

This procedure is like Forward Selection except that at each step we consider dropping variables
as in Backward Elimination.
Convergence is guaranteed if Fout < Fin (but it is possible for a variable to enter S and then
leave S at a subsequent step and even rejoin S at a yet later step).
As stated above these methods pick one best subset. There are straightforward variations of
the methods that do identify several close to best choices for dierent sizes of independent variable
subsets.
None of the above methods guarantee that they yield the best subset for any criterion such
as adjusted R2 (dened later). They are reasonable methods for situations with large numbers of
independent variables but for moderate numbers of independent variables the method discussed
next is preferable.

4.6.4

All Subsets Regression

The idea here is to evaluate all subsets. Ecient implementations use branch and bound algorithms
(of the type used for integer programming) to avoid explicitly enumerating all subsets. (In fact the
subset selection problem can be set up as a quadratic integer program.) We compute a criterion
2 , the adjusted R2 , for each subset then choose the best one. (This is only feasible if p
such as Radj
is less than about 20).

4.7

Identifying Subsets of Variables to Improve Predictions

The All Subsets regression (as well as modications of the heuristic algorithms) will produce a
number of subsets. Since the number of subsets for even moderate values of p is very large we need
some way to examine the most promising subsets and to select from them. An intuitive metric to

52

4. Multiple Linear Regression

compare subsets is R2 . However since R2 = 1 SSR


SST where SST , the Total Sum of Squares, is the
Sum of Squared Residuals for the model with just the constant term, if we use it as a criterion we
will always pick the full model with all p variables. One approach is therefore to select the subset
with the largest R2 for each possible size k, k = 2, ...p + 1. (The size is the number of coecients in
the model and is therefore one more than the number of variables in the subset to account for the
constant term.) We then examine the increase in R2 as a function of k amongst these subsets and
choose a subset such that subsets that are larger in size give only insignicant increases in R2 .
2 , a modication
Another, more automatic, approach is to choose the subset that maximizes, Radj
2 is
of R2 that makes an adjustment to account for size. The formula for Radj
2
=1
Radj

n1
(1 R2 ).
nk

2
to choose a subset is equivalent to picking the subset that
It can be shown that using Radj
2
2 to be negative.)
minimizes
. (Note that it is possible, though rare, for Radj
Table 3 gives the results of the subset selection procedures applied to the training data in
Example 1.

4.7 Identifying Subsets of Variables to Improve Predictions

53

Table 3: Subset Selection for Example 1


SST= 2149.000

Fin= 3.840
Fout= 2.710

Forward, backward, and all subsets selections


Models
Size
SSR
RSq RSq
Cp
1
(adj)
2
874.467 0.593 0.570 -0.615 Constant
3
786.601 0.634 0.591 -0.161 Constant
4
759.413 0.647 0.580 1.361 Constant
5
743.617 0.654 0.562 3.083 Constant
6
740.746 0.655 0.532 5.032 Constant
7
738.900 0.656 0.497 7.000 Constant

X1
X1
X1
X1
X1
X1

X3
X3
X3
X2
X2

X6
X5 X6
X3 X5 X6
X3 X4 X5 X6

Models
1
2

Constant
Constant
Constant
Constant
Constant
Constant

X3
X2
X2
X2
X2

X3
X3 X4
X3 X4 X5
X3 X4 X5 X6

Stepwise Selection
Size

SSR

RSq

2
3
4
5
6
7

874.467
786.601
783.970
781.089
775.094
738.900

0.593
0.634
0.635
0.637
0.639
0.656

RSq
(adj)
0.570
0.591
0.567
0.540
0.511
0.497

Cp
-0.615
-0.161
1.793
3.742
5.637
7.000

X1
X1
X1
X1
X1
X1

Notice that the step-wise heuristic fails to nd the best subset for sizes of 4, 5, and 6 variables.
The Forward and Backward heuristics do nd the best subsets of all sizes and so give identical
results as the All Subsets algorithm. The best subset of size 3 consisting of {X1 , X3 } maximizes
2 for all the algorithms. This suggests that we may be better o in terms of MSE of predictions
Radj
if we use this subset rather than the full model of size 7 with all six variables in the model. Using
this model on the validation data gives a slightly higher standard deviation of error (7.3) than
the full model (7.1) but this may be a small price to pay if the cost of the survey can be reduced
substantially by having 2 questions instead of 6. This example also underscores the fact that we
are basing our analysis on small (tiny by data mining standards!) training and validation data sets.
Small data sets make our estimates of R2 unreliable.
A criterion that is often used for subset selection is known as Mallows Cp . This criterion assumes
that the full model is unbiased although it may have variables that, if dropped, would improve the
M SE. With this assumption we can show that if a subset model is unbiased, E(Cp ) equals the
number of parameters k + 1, the size of the subset. So a reasonable approach to identifying subset
models with small bias is to examine those with values of Cp that are near k + 1. Cp is also an
estimate of the sum of MSE (standardized by dividing by 2 ) for predictions (the tted values)

54

5. Logistic Regression

at the x-values observed in the training set. Thus good models are those that have values of Cp
near k + 1 and that have small k (i.e. are of small size). Cp is computed from the formula:
+ 2(k + 1) n, where
F2 ull is the estimated value of 2 in the full model that includes all
Cp = SSR
2
F ull
the variables. It is important to remember that the usefulness of this approach depends heavily on
the reliability of the estimate of 2 for the full model. This requires that the training set contain
a large number of observations relative to the number of variables. We note that for our example
only the subsets of size 6 and 7 seem to be unbiased as for the other models Cp diers substantially
from k. This is a consequence of having too few observations to estimate 2 accurately in the full
model.
2
and Cp all select the
Finally, a useful point to note is that for a xed size of subset, R2 , Radj
same subset. In fact there is no dierence between them in the order of merit they ascribe to
subsets of a xed size.

Chapter 5

Logistic Regression
Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values as 0 and 1). As with
multiple linear regression the independent variables x1 , x2 , , xk may be categorical or continuous
variables or a mixture of these two types. While in multiple linear regression we end up with an
estimate of the value of the continuous dependent variable, in logistic regression we end up with an
estimate of the probability that the dependent variable is a 1 (as opposed to a ). We can then
use this probability to classify each case as a or as a 1.
Let us take some examples to illustrate:

5.1

Example 1: Estimating the Probability of Adopting a New


Phone Service

The data in Table 1 were obtained in a survey conducted by AT & T in the US from a national sample of cooperating households. We are interested in the adoption rate for a new telecommunications
service, as a function of education, residential stability and income.
Table 1: Adoption of New Telephone Service

Low
Income
High
Income

High School or below


No Change in
Change in
Residence during Residence during
Last ve years
Last ve years
153/2160 = 0.071 226/1137 = 0.199

Some College or above


No change in
Change in
Residence during Residence during
Last ve years
Last ve years
61/886 = 0.069
233/1091 = 0.214

147/1363 = 0.108

287/1925 = 0.149

139/ 547 = 0.254

382/1415 = 0.270

(For fractions in cells above, the numerator is the number of adopters and the denominator is
the number surveyed in that category).
Note that the overall probability of adoption in the sample is 1628/10524 = 0.155. However, the
adoption probability varies depending on the categorical independent variables education, residen55

56

5. Logistic Regression

tial stability and income. The lowest value is 0.069 for low- income no-residence-change households
with some college education while the highest is 0.270 for high-income residence changers with some
college education.

5.2

Multiple Linear Regression is Inappropriate

The standard multiple linear regression model is inappropriate to model this data for the following
reasons:
1. The models predicted probabilities could fall outside the range 0 to 1.
2. The dependent variable (adoption) is not normally-distributed. In fact a binomial model
would be more appropriate. For example, if a cell total is 11 then this variable can take on
only 12 distinct values 0, 1, 2 , 11. Think of the response of the households in a cell being
determined by independent ips of a coin with, say, heads representing adoption with the
probability of heads varying between cells.
3. We cannot use the expedient of considering the normal distribution as an approximation for
the binomial model because the variance of the dependent variable is not constant across all
cells: it will be higher for cells where the probability of adoption, p, is near 0.5 than where it
is near 0 or 1. It will also increase with the total number of households, n, falling in the cell.
The variance equals n(p(1 p)).

5.3

The Logistic Regression Model

The logistic regression model was developed to account for all these diculties. It is used in a
variety of elds whenever a structured model is needed to explain or predict binary outcomes.
One such application is in describing choice behavior in econometrics, which is useful in the context
of the above example. In the context of choice behavior the logistic model can be shown to follow
from the random utility theory developed by Manski as an extension of the standard economic
theory of consumer behavior.
In essence the consumer theory states that when faced with a set of choices a consumer makes
a choice which has the highest utility (a numeric measure of worth with arbitrary zero and scale).
It assumes that the consumer has a preference order on the list of choices that satises reasonable
criteria such as transitivity. The preference order can depend on the individual (e.g. socioeconomic
characteristics as in the Example 1 above) as well as attributes of the choice. The random utility
model considers the utility of a choice to incorporate a random element. When we model the
random element as coming from a reasonable distribution, we can logically derive the logistic
model for predicting choice behavior.
If we let y = 1 represent choosing an option versus y = 0 for not choosing it, the logistic
regression model stipulates:

5.4 Odd Ratios

57

Probability (Y = 1|x1 , x2 xk ) =

exp(0 + 1 x1 + . . . k xk )
1 + exp(0 + 1 x1 + k xk )

where 0 , 1 , 2 . . . k are unknown constants analogous to the multiple linear regression model.
The independent variables for our model would be:
x1 ( Education: high school or below = 0, some college or above = 1
x2 (Residential stability: no change over past ve years = 0, change over past ve years = 1
x3 Income: low = 0, high = 1
The data in Table 1 are shown below in another summary format.

x1
0
0
0
0
1
1
1
1

x2
0
0
1
1
0
1
0
1

x3
0
1
0
1
0
0
1
1

# in sample
2160
1363
1137
547
886
1091
1925
1415
10524

#adopters
153
147
226
139
61
233
287
382
1628

# Non-adopters
2007
1216
911
408
825
858
1638
1033
8896

Fraction adopters
.071
.108
.199
.254
.069
.214
.149
.270
1.000

Typical rows in the actual data le might look like this:


Adopt
0
1
0

X1
0
1
0

X2
1
0
0

X3
0
1
0

etc.
In other words, for given value of X1 , X2 and X3 , the probability that Y = 1 is estimated by
the expression above.

5.4

Odd Ratios

The logistic model for this example is:


P rob(Y = 1|x1 , x2 , x3 ) =

exp(0 + 1 xl + 2 x2 + 3 x3 )
.
1 + exp(0 + 1 xl + 2 x2 + 3 x3 )

We obtain a useful interpretation for the coecients 0 , 1 , b2 and 3 by noting that:


exp(0 ) =

P rob(Y = 1|x1 = x2 = x3 = 0)
P rob(Y = 0|x1 = x2 = x3 = 0)

58

5. Logistic Regression
=
exp(1 ) =
exp(2 ) =
exp(3 ) =

Odds of adopting in the base case (x1 = x2 = x3 = 0)


Odds of adopting when x1 = 1, x2 = x3 = 0
Odds of adopting in the base case
Odds of adopting when x2 = 1, x1 = x3 = 0
Odds of adopting in the base case
Odds of adopting when x3 = 1, x1 = x2 = 0
Odds of adopting in the base case

The logistic model is multiplicative in odds in the following sense:


Odds of adopting for a given x1 , x2 , x3
= exp(0 ) exp(1 x1 ) exp(2 x2 ) exp(3 x3 )




Odds
F actor
F actor
F actor

f or
due
due
due
=

basecase

to x

to x

to x

1
2
3

If x1 = 1 the odds of adoption get multiplied by the same Factor due to X1 , regardless of the
level of x2 and x3 . Similarly the multiplicative factors for x2 and x3 do not vary with the levels of
the remaining factors. The factor for a variable gives us the impact of the presence of that factor
on the odds of adopting.
If i = 0, the presence of the corresponding factor has no eect (multiplication by one). If
i < 0, presence of the factor reduces the odds (and the probability) of adoption, whereas if i > 0,
presence of the factor increases the probability of adoption.
The computations required to produce estimates of the beta coecients require iterations using
a computer program. The output of a typical program is shown below:
Variable
Constant
x1
x2
x3

5.5

Coe.
-2.500
0.161
0.992
0.444

Std. Error
0.058
0.058
0.056
0.058

p-Value
0.000
0.006
0.000
0.000

Odds
0.082
1.175
2.698
1.560

95% Conf. Intvl. for odds


Lower Limit Upper Limit
0.071
0.095
1.048
1.316
2.416
3.013
1.393
1.746

Probabilities

From the estimated values of the coecients, we see that the estimated probability of adoption for
a household with values x1 , x2 and x3 for the independent variables is:
P rob(Y = 1|x1 , x2 , x3 ) =

exp(2.500 + 0.161 x1 + 0.992 x2 + 0.444 x3 )


.
1 + exp(2.500 + 0.161 x1 + 0.992 x2 + 0.444 x3 )

The estimated number of adopters from this model will be the total number of households with
a given set of values x1 , x2 and x3 for the independent variables, multiplied by the above probability,
then summed over all observed combinations of value for X1 , X2 , and X3 .

5.6 Example 2: Financial Conditions of Banks

59

The table below shows the estimated number of adopters for the various combinations of the
independent variables.
x1

x2

x3

0
0
0
0
1
1
1
1

0
0
1
1
0
1
0
1

0
1
0
1
0
0
1
1

# in
sample
2160
1363
1137
547
886
1091
1925
1415

# adopters
153
147
226
139
61
233
287
382

Estimated
(# adopters)
164
155
206
140
78
225
252
408

Fraction
Adopters
0.071
0.108
0.199
0.254
0.069
0.214
0.149
0.270

Estimated
P rob(Y = 1|x1 , x2 , x3 )
0.076
0.113
0.181
0.257
0.088
0.206
0.131
0.289

In data mining applications we will have validation data that is a hold-out sample not used in
tting the model. We can now apply the model to these validation data.
Let us suppose we have the following validation data consisting of 598 households:
x1

x2

x3

0
0
0
0
1
1
1
1

0
0
1
1
0
1
0
1

0
1
0
1
0
0
1
1
Totals

# in
validation
sample
29
23
112
143
27
54
125
85
598

# adopters
in validation
sample
3
7
25
27
2
12
13
30
119

Estimated
(# adopters)
2.199508829
2.609970361
20.30202944
36.7051326
2.373982216
11.14502996
16.33789657
24.52828063
116.2018306

Error
(Estimate
-Actual)
-0.800490828
-4.390029332
-4.697968993
9.705133471
0.373982568
-0.854969242
3.337898368
-5.471717729
-2.798161717

The total error is -2.798 adopters or a percentage error in estimating adopters of -2.798/119 =
-2.4%.
As with multiple linear regression, we can build more complex models that reect interactions
between independent variables by including factors that are calculated from the interacting factors.
For example if we felt that there is an interactive eect between x1 and x2 we would add an interaction term x4 = x1 x2 .

5.6

Example 2: Financial Conditions of Banks

Table 2 gives data on a sample of banks. The second column records the judgment of an expert
on the nancial condition of each bank. The last two columns give the values of two ratios used in

60

5. Logistic Regression

nancial analysis of banks.


Table 2: Financial Conditions of Banks

Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Financial
Condition
(y)
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0

Total Loans & Leases/


Total Assets
(x1 )
0.64
1.04
0.66
0.80
0.69
0.74
0.63
0.75
0.56
0.65
0.55
0.46
0.72
0.43
0.52
0.54
0.30
0.67
0.51
0.79

Total Expenses /
Total Assets
(x2 )
0.13
0.10
0.11
0.09
0.11
0.14
0.12
0.12
0.16
0.12
0.10
0.08
0.08
0.08
0.07
0.08
0.09
0.07
0.09
0.13

Financial Condition = 1 for nancially weak banks;


= 0 for nancially strong banks.

5.6.1

A Model with Just One Independent Variable

Consider rst a simple logistic regression model with just one independent variable. This is analogous to the simple linear regression model in which we t a straight line to relate the dependent
variable, y, to a single independent variable, x.
Let us construct a simple logistic regression model for classication of banks using the Total
Loans & Leases to Total Assets ratio as the independent variable in our model. This model would
have the following variables:
Dependent variable: Y = 1, if nancially distressed; Y = 0, otherwise.
Independent (or explanatory) variable: x1 = Total Loans & Leases / Total Assets Ratio
The equation relating the dependent variable to the explanatory variable is:
P rob(Y = 1|x1 ) =

exp(0 + 1 xl )
1 + exp(0 + 1 xl )

5.6 Example 2: Financial Conditions of Banks

61

or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 1 xl ).
The maximum likelihood estimates (more on this below) of the coecients for the model are:

0 = 6.926, 1 = 10.989
So that the tted model is:
P rob(Y = 1|x1 ) =

exp(6.926 + 10.989 x1 )
.
(1 + exp(6.926 + 10.989 x1 )

Figure 1 displays the data points and the tted logistic regression model.

5.6.2

Multiplicative Model of Odds Ratios

We can think of the model as a multiplicative model of odds ratios as we did for Example 1. The
odds that a bank with a Loan & Leases/Assets Ratio that is zero will be in nancial distress =
exp(6.926) = 0.001. These are the base case odds. The odds of distress for a bank with a ratio

62

5. Logistic Regression

of 0.6 will increase by a multiplicative factor of exp(10.989 0.6) = 730 over the base case, so the
odds that such a bank will be in nancial distress = 0.730.
Notice that there is a small dierence in interpretation of the multiplicative factors for this
example compared to Example 1. While the interpretation of the sign of i remains as before, its
magnitude gives the amount by which the odds of Y = 1 against Y = 0 are changed for a unit
change in xi .
If we construct a simple logistic regression model for classication of banks using the Total
Expenses/Total Assets ratio as the independent variable we would have the following variables:
Dependent variable: Y = 1, if nancially distressed; Y = 0, otherwise.
Independent (or explanatory) variable: x2 = Total Expenses/Total Assets Ratio
The equation relating the dependent variable with the explanatory variable is:
P rob(Y = l|x1 ) =

exp(0 + 2 x2 )
1 + exp(0 + 2 x2 )

or, equivalently,
Odds (Y = 1 versus Y = 0) = (0 + 2 x2 ).
The maximum likelihood estimates of the coecients for the model are: 0 = 9.587, 2 =
94.345
Figure 2 displays the data points and the tted logistic regression model.

5.7 Appendix A - Computing Maximum Likelihood Estimates ......

5.6.3

63

Computation of Estimates

As illustrated in Examples 1 and 2, estimation of coecients is usually carried out based on the
principle of maximum likelihood which ensures good asymptotic (large sample) properties for the
estimates. Under very general conditions maximum likelihood estimators are:
Consistent : the probability of the estimator diering from the true value approaches zero
with increasing sample size;
Asymptotically Ecient : the variance is the smallest possible among consistent estimators;
Asymptotically Normally-Distributed: This allows us to compute condence intervals and
perform statistical tests in a manner analogous to the analysis of linear multiple regression
models, provided the sample size is large.
Algorithms to compute the coecient estimates and condence intervals are iterative and less
robust than algorithms for linear regression. Computed estimates are generally reliable for wellbehaved datasets where the number of observations with dependent variable values of both 0 and
1 are large; their ratio is not too close to either zero or one; and when the number of coecients
in the logistic regression model is small relative to the sample size (say, no more than 10%). As
with linear regression, collinearity (strong correlation amongst the independent variables) can lead
to computational diculties. Computationally intensive algorithms have been developed recently
that circumvent some of these diculties.

5.7

Appendix A - Computing Maximum Likelihood Estimates


and Condence Intervals for Regression Coecients

We denote the coecients by the p 1 column vector with the row element i equal to i , The n
observed values of the dependent variable will be denoted by the n 1 column vector y with the
row element j equal to yj ; and the corresponding values of the independent variable i by xij for
i = 1 p; j = 1 n.

5.7.1

Data
yj , x1j , x2j , , xpj ,

j = 1, 2, , n.

64

5. Logistic Regression

5.7.2

Likelihood Function

The likelihood function, L, is the probability of the observed data viewed as a function of the
parameters (2i in a logistic regression).
n

eyi (0 +1 x1j +2 x2j ...+p xpj )


j=1

1 + e0 +1 x1j +2 x2j ...+i xpj )

ei yj i xij

1 + ei i xij

j=1

ei (j yj xij )i
n

[1 + ei i xij ]

j=1

ei i ti
n

[1 + ei i xij ]

j=1

where ti = j yj xij
These are the sucient statistics for a logistic regression model analogous to y and S in linear
regression.

5.7.3

Loglikelihood Function

This is the logarithm of the likelihood function,


l = i i ti j log[1 + ei i xij ].
We nd the maximum likelihood estimates, i , of i by maximizing the loglikelihood function
for the observed values of yj and xij in our data. Since maximizing the log of a function is
equivalent to maximizing the function, we often work with the loglikelihood because it is generally
less cumbersome to use for mathematical operations such as dierentiation.
Since the likelihood function can be shown to be concave, we will nd the global maximum of
the function (if it exists) by equating the partial derivatives of the loglikelihood to zero and solving
the resulting nonlinear equations for i .
l
i

xij ei i xij
[1 + ei i xij ]
j = 0;
i = 1, 2, . . . , p
= ti j xij
= ti j

j = ti
or i xij
where
j =

ei i xij
[1+ei i xij ]

= E(Yj ).

An intuitive way to understand these equations is to note that for i = 1, 2, , p:


j xij E(Yj ) = j xij yj

5.8 Appendix B - The Newton-Raphson Method

65

In words, the maximum likelihood estimates are such that the expected value of the sucient
statistics are equal to their observed values.
Note : If the model includes the constant term xij = 1 for all j then j E(Yj ) = j yj , i.e. the
expected number of successes (responses of one) using MLE estimates of i equals the observed
number of successes. The i s are consistent, asymptotically ecient and follow a multivariate
Normal distribution (subject to mild regularity conditions).

5.7.4

Algorithm

A popular algorithm for computing i uses the Newton-Raphson method for maximizing twice
dierentiable functions of several variables (see Appendix B).
The Newton-Raphson method involves computing the following successive approximations to
nd i , the likelihood function
t+1 = t + [I( t )]1 I( t )
where
Iij =

2l
i j j

On convergence, the diagonal elements of I( t )1 give squared standard errors (approximate


variance) for i .
Condence intervals and hypothesis tests are based on asymptotic Normal distribution of i .
The loglikelihood function is always negative and does not have a maximum when it can be
made arbitrary close to zero. In that case the likelihood function can be made arbitrarily close
to one and the rst term of the loglikelihood function given above approaches innity. In this
situation the predicted probabilities for observations with yj = 0 can be made arbitrarily close to
0 and those for yj = 1 can be made arbitrarily close to 1 by choosing suitable very large absolute
values of some i . This is the situation when we have a perfect model (at least in terms of the
training data set)! This phenomenon is more likely to occur when the number of parameters is a
large fraction (say > 20%) of the number of observations.

5.8

Appendix B - The Newton-Raphson Method

This method nds the values of i that maximize a twice dierentiable concave function, g().
If the function is not concave, it nds a local maximum. The method uses successive quadratic
approximations to g based on Taylor series. It converges rapidly if the starting value, 0 , is
of .
reasonably close to the maximizing value, ,

66

5. Logistic Regression

The gradient vector and the Hessian matrix, H, as dened below, are used to update an
estimate t to t+1 .

..
.

g( t ) =

..
.

H( t ) = . . .

..
.
2g
i k

...

..
.

.
t

The Taylor series expansion around t gives us:


g() g( t ) + g( t )( t ) + 1/2( t ) H( t )( t )
Provided H( t ) is positive denite, the maximum of this approximation occurs when its derivative
is zero.
g( t ) H( t )( t ) = 0
or
= t [H( t )]1 g( t ).
This gives us a way to compute t+1 , the next value in our iterations.
t+1 = t [H( t ]1 g( t ).
To use this equation H should be non-singular. This is generally not a problem although sometimes
numerical diculties can arise due to collinearity.
Near the maximum the rate of convergence is quadratic as it can be shown that
|it+1 i | c|it i |2 for some c 0 when it is near i for all i.

Chapter 6

Neural Nets
6.1

The Neuron (a Mathematical Model

After going through major development periods in the early 60s and mid 80s, articial neural
networks have emerged as a major paradigm for data mining applications. They were a key development in the eld of machine learning. Articial neural networks were inspired by biological
ndings relating to the behavior of the brain as a network of units called neurons. The human
brain is estimated to have around 10 billion neurons each connected on average to 10,000 other
neurons. Each neuron receives signals through synapses that control the eects of the signal on the
neuron. These synaptic connections are believed to play a key role in the behavior of the brain. The
fundamental building block in an articial neural network is the mathematical model of a neuron
as shown in Figure 1. The three basic components of the (articial) neuron are:
1. The synapses or connecting links that provide weights, wj , to the input values, xj for j =
1, ...m;
2. An adder that sums the weighted input values to compute the input to the activation function
v = w0 +

m


wj xj , where w0 is called the bias (not to be confused with statistical bias in

j=1

prediction or estimation) and is a numerical value associated with the neuron. It is convenient
to think of the bias as the weight for an input x0 whose value is always equal to one, so that
v=

m


wj xj ;

j=0

3. An activation function g (also called a squashing function) that maps v to g(v) the output
value of the neuron. This function is a monotone function.

67

68

6. Neural Nets
Figure 1 - A Neuron

While there are numerous dierent (articial) neural network architectures that have been
studied by researchers, the most successful applications in data mining of neural networks have been
multilayer feedforward networks. These are networks in which there is an input layer consisting of
nodes that simply accept the input values and successive layers of nodes that are neurons like the
one depicted in Figure 1. The outputs of neurons in a layer are inputs to neurons in the next layer.
The last layer is called the output layer. Layers between the input and output layers are known as
hidden layers. Figure 2 is a diagram for this architecture.

6.2 The Multilayer Neural Networks

6.2

69

The Neuron (a mathematical model


Figure 2 : Multilayer Feed-forward Neural Network

In a supervised setting where a neural net is used to predict a numerical quantity there is one
neuron in the output layer and its output is the prediction. When the network is used for classication, the output layer typically has as many nodes as the number of classes and the output layer
node with the largest output value gives the networks estimate of the class for a given input. In the
special case of two classes it is common to have just one node in the output layer, the classication
between the two classes being made by applying a cut-o to the output value at the node.

6.2.1

Single Layer Networks

Let us begin by examining neural networks with just one layer of neurons (output layer only, no
hidden layers). The simplest network consists of just one neuron with the function g chosen to
be the identity function, g(v) = v for all v. In this case notice that the output of the network is
m


wj xj , a linear function of the input vector x with components xj . Does this seem familiar?

j=0

It looks similar to multiple linear regression. If we are modeling the dependent variable y using
multiple linear regression, we can interpret the neural network as a structure that predicts a value
y for a given input vector x with the weights being the coecients. If we choose these weights to

70

6. Neural Nets

minimize the mean square error using observations in a training set, these weights would simply be
the least squares estimates of the coecients. The weights in neural nets are also often designed to
minimize mean square error in a training data set. There is, however, a dierent orientation in the
case of neural nets: the weights are learned over time, rather than calculated in one step. The
network is presented with cases from the training data one at a time and the weights are revised
after each case in an attempt to minimize the mean square error.
This process of incremental adjustment of weights is based on the error made on training cases
and is known as training the neural net. The almost universally used dynamic updating algorithm
for the neural net version of linear regression is known as the Widrow-Ho rule or the least-meansquare (LMS) algorithm. It is simply stated. Let x(i) denote the input vector x for the ith case
used to train the network, and the weights (before this case is presented to the net) by the vector
w(i). The updating rule is w(i+1) = w(i)+(y(i) y(i))x(i) with w(0) = 0. It can be shown that if
the network is trained in this manner by repeatedly presenting test data observations one-at-a-time
then for suitably small (absolute) values of the network will learn (converge to) the optimal values
of w. Note that the training data may have to be presented several times for w(i) to be close to the
optimal w. The advantage of dynamic updating is that the network tracks moderate time trends
in the underlying linear model quite eectively.
If we consider using the single layer neural net for classication into c classes, we would use c
nodes in the output layer. If we think of classical discriminant analysis in neural network terms,
the coecients in Fishers classication functions give us weights for the network that are optimal if
the input vectors come from multivariate Normal distributions with a common covariance matrix.
Maximum likelihood coecients for logistic regression can also be considered as weights in a
neural network to
a function of the residuals called the deviance. In this case the logistic
 minimize

ev
is
the
activation function for the output node.
function g(v) = 1+e
v

6.2.2

Multilayer Neural Networks

Multilayer neural networks are undoubtedly the most popular networks used in applications. While
it is possible to consider many activation functions,
in practice it has been found that the logistic

ev
(also called the sigmoid) function g(v) = 1+ev (or minor variants such as the tanh function) works
best as the activation function to map the sum of the weighted inputs to the neurons output. In
fact the revival of interest in neural nets was sparked by successes in training neural networks using
this function in place of the historically (biologically inspired) step function (the perceptron}.
Notice that using a linear function does not achieve anything in multilayer networks that is beyond
what can be done with single layer networks with linear activation functions. The practical value
of the logistic function arises from the fact that it has a squashing eect on very small or very large
values of v, but is almost linear in the range where g(v) is between 0.1 and 0.9.
In theory it is sucient to consider networks with two layers of neuronsone hidden and one
output layerand this is certainly the case for most applications. There are, however, a number of
situations where three and sometimes four and ve layers have been more eective. For prediction

6.3 Example 1: Fishers Iris data

71

the output node is often given a linear activation function to provide forecasts that are not limited
to the zero to one range. An alternative is to scale the output to the linear part (0.1 to 0.9) of the
logistic function.
Unfortunately there is no clear theory to guide us on choosing the number of nodes in each
hidden layer or indeed the number of layers. The common practice is to use trial and error,
although there are schemes for combining optimization methods such as genetic algorithms with
network training for these parameters.
Since trial and error is a necessary part of neural net applications it is important to have an
understanding of the standard method used to train a multilayered network: backpropagation. It
is no exaggeration to say that the speed of the backprop algorithm made neural nets a practical
tool in the manner that the simplex method made linear optimization a practical tool. The revival
of strong interest in neural nets in the mid 80s was in large measure due to the eciency of the
backprop algorithm.

6.3

Example 1: Fishers Iris data

Let us look at the Iris data that Fisher analyzed using discriminant analysis. Recall that the data
consisted of four measurements on three types of iris owers. There are 50 observations for each
class of iris. A part of the data is reproduced below.

OBS#
1
2
3
4
5
6
7
8
9
10
...
51
52
53
54
55

SPECIES
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
...
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor

CLASSCODE
1
1
1
1
1
1
1
1
1
1
...
2
2
2
2
2

SEPLEN
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
4.9
...
7
6.4
6.9
5.5
6.5

SEPW
3.5
3
3.2
3.1
3.6
3.9
3.4
3.4
2.9
3.1
...
3.2
3.2
3.1
2.3
2.8

PETLEN
1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5
...
4.7
4.5
4.9
4
4.6

PETW
0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1
...
1.4
1.5
1.5
1.3
1.5

72

6. Neural Nets
OBS#
56
57
58
59
60
...
101
102
103
104
105
106
107
108
109
110

SPECIES
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
...
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica

CLASSCODE
2
2
2
2
2
...
3
3
3
3
3
3
3
3
3
3

SEPLEN
5.7
6.3
4.9
6.6
5.2
...
6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2

SEPW
2.8
3.3
2.4
2.9
2.7
...
3.3
2.7
3
2.9
3
3
2.5
2.9
2.5
3.6

PETLEN
4.5
4.7
3.3
4.6
3.9
...
6
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1

PETW
1.3
1.6
1
1.3
1.4
...
2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5

If we use a neural net architecture for this classication problem we will need 4 nodes (not
counting the bias node) in the input layer, one for each of the 4 independent variables, and 3
neurons (one for each class) in the output layer. Let us have one hidden layer with 25 neurons.
Notice that there will be a total of 25 connections from each node in the input layer to each node
in the hidden layer. This makes a total of 4 x 25 = 100 connections between the input layer and
the hidden layer. In addition there will be a total of 3 connections from each node in the hidden
layer to each node in the output layer. This makes a total of 25 x 3 = 75 connections between the
hidden layer and the output layer. Using the standard sigmoid (logistic) activation functions, the
network was trained with a run consisting of 60,000 iterations.
Each iteration consists of presentation to the input layer of the independent variables in a case,
followed by successive computations of the outputs of the neurons of the hidden layer and the output layer using the appropriate weights. The output values of neurons in the output layer are used
to compute the error. This error is used to adjust the weights of all the connections in the network
using the backward propagation (backprop) to complete the iteration. Since the training data
has 150 cases, each case was presented to the network 400 times. Another way of stating this is
to say the network was trained for 400 epochs where an epoch consists of one sweep through the
entire training data. The results following the last epoch of training the neural net on this data
are shown below:

6.4 The Backward Propagation (backprop) Algorithm - Classication

73

Figure 3 : XL Miner output for neural network for Iris data


Classication Confusion Matrix
Desired class
1
2
3
Total

Computed
1
2
50
49
1
50 50

Class
3
1
49
50

Total
50
50
50
150

Error Report
Class
1
2
3
Overall

Patterns
50
50
50
150

# Errors
0
1
1
2

% Errors
0.00
2.00
2.00
1.3

Std Dev.
(0.00)
(1.98)
(1.98)
(0.92)

The classication error of 1.3% is comparable to the error using discriminant analysis, which
was 2% (see section on discriminant analysis). Notice that had we stopped after only one pass of
the data (150 iterations) the error would have been much worse as shown below:

Figure 4 : XL Miner output for neural network for Iris data, after only one epoch.
Classication Confusion Matrix
Desired Class
1
2
3
Total

Computed Class
1
2
3
10
7
2
13
1
6
12
5
4
35 13
12

Total
19
20
21
60

The classication error rate of 1.3% was obtained by careful choice of key control parameters
for the training run by trial and error. If we set the control parameters to poor values we can have
terrible results. To understand the parameters involved we need to understand how the backward
propagation algorithm works.

6.4

The Backward Propagation Algorithm - Classication

The backprop algorithm cycles through two distinct passes, a forward pass followed by a backward
pass though the layers of the network. The algorithm alternates between these passes several times
as it scans the training data. Typically the training data has to be scanned several times before
the networks learns to make good classications.

74

6.4.1

6. Neural Nets

Forward Pass - Computation of Outputs of all the Neurons in the Network.

The algorithm starts with the rst hidden layer using as input values the independent variables of
a case (often called an exemplar) from the training data set. The neuron outputs are computed
for all neurons in the rst hidden layer by performing the relevant sum and activation function
evaluations. These outputs are the inputs for neurons in the second hidden layer. Again the
relevant sum and activation function calculations are performed to compute the outputs of second
layer neurons. This continues layer by layer until we reach the output layer and compute the
outputs for this layer. These output values constitute the neural nets guess at the value of the
dependent (output) variable. If we are using the neural net for classication and we have c classes,
the activation function yield c neuron outputs for the c output nodes. The output node with the
largest value determines the nets classication. (If c = 2, we can use just one output node with a
cut-o value to map a numerical output value to one of the two classes).
Where do the initial weights come from and how are they adjusted? Let us denote by wij
the weight of the connection from node i to node j. The values of wij are initialized to small
(generally random) numbers in the range 0.00 0.05. These weights are adjusted to new values in
the backward pass as described below.

6.4.2

Backward Pass: Propagation of Error and Adjustment of Weights

.
This phase begins with the computation of error at each neuron in the output layer. A popular
error function is the squared dierence between ok the output of node k and yk the target value for
that node. The target value is just 1 for the output node corresponding to the class of the exemplar
and zero for other output nodes.(In practice it has been found better to use values of 0.9 and 0.1
respectively.) For each output layer node compute an adjustment term k = ok (1 ok )(yk ok ).
These terms are used to adjust the weights of the connections between the last-but-one layer of the
network and the output layer. The adjustment is similar to the simple Widrow-Hu rule that we
saw earlier. The new value of the weight wjk of the connection from node j to node k is given by:
new = wold + o . Here is an important tuning parameter that is chosen by trial and error by
wjk
j k
jk
repeated runs on the training data. Typical values for are in the range 0.1 to 0.9. Low values give
slow but steady learning, high values give erratic learning and may lead to an unstable network.
The process is repeated for the connections between nodes in the last hidden layer and the
new =
last-but-one hidden layer. The weight for the connection between nodes i and j is given by: wij

old
wij + oi j where j = oj (1 oj ) wjk k , for each node j in the last hidden layer.
k

The backward propagation of weight adjustments along these lines continues until we reach the
input layer. At this time we have a new set of weights on which we can make a new forward pass
when presented with a training data observation.

6.5 Adjustment for Prediction

6.5

75

Adjustment for Prediction

There is a minor adjustment for prediction problems where we are trying to predict a continuous
numerical value. In that situation we change the activation function for output layer neurons to
the identity function that has output value=input value. (An alternative is to rescale and recenter
the logistic function to permit the outputs to be approximately linear in the range of dependent
variable values).

6.6

Multiple Local Optima and Epochs

Due to the complexity of the function and the large numbers of weights that are being trained as
the network learns, there is no assurance that the backprop algorithm (or indeed any practical
algorithm) will nd the optimum weights that minimize error. The procedure can get stuck at a
local minimum. It has been found useful to randomize the order of presentation of the cases in
a training set between dierent scans. It is possible to speed up the algorithm by batching, that
is updating the weights for several cases in a pass, rather than after each case. However, at least
the extreme case of using the entire training data set on each update has been found to get stuck
frequently at poor local minima.
A single scan of all cases in the training data is called an epoch. Most applications of feedforward
networks and backprop require several epochs before errors are reasonably small. A number of
modications have been proposed to reduce the number of epochs needed to train a neural net.
One commonly employed idea is to incorporate a momentum term that injects some inertia in the
weight adjustment on the backward pass. This is done by adding a term to the expression for
weight adjustment for a connection that is a fraction of the previous weight adjustment for that
connection. This fraction is called the momentum control parameter. High values of the momentum
parameter will force successive weight adjustments to be in similar directions. Another idea is to
vary the adjustment parameter so that it decreases as the number of epochs increases. Intuitively
this is useful because it avoids overtting that is more likely to occur at later epochs than earlier
ones.

6.7

Overtting and the choice of training epochs

A weakness of the neural network is that it can be easily overtted, causing the error rate on
validation data (and, most importantly, new data)to be too large. It is therefore important to limit
the number of training epochs and not to overtrain the data. One system that some algorithms
use to set the number of training epochs is to use the validation data set periodically to compute
the error rate for it while the network is being trained. The validation error decreases in the early
epochs of backprop but after a while it begins to increase. The point of minimum validation error
is a good indicator of the best number of epochs for training and the weights at that stage are
likely to provide the best error rate in new data.

76

6.8

7. Classication and Regression Trees

Adaptive Selection of Architecture

One of the time consuming and complex aspects of using backprop is that we need to decide on an
architecture before we can use backprop. The usual procedure is to make intelligent guesses using
past experience and to do several trial and error runs on dierent architectures. Algorithms exist
that grow the number of nodes selectively during training or trim them in a manner analogous to
what we have seen with CART. Research continues on such methods. However, as of now there
seems to be no automatic method that is clearly superior to the trial and error approach.

6.9

Successful Applications

There have been a number of very successful applications of neural nets in engineering applications.
One of the well known ones is ALVINN that is an autonomous vehicle driving application for normal
speeds on highways. The neural net uses a 30x32 grid of pixel intensities from a xed camera on
the vehicle as input, the output is the direction of steering. It uses 30 output units representing
classes such as sharp left, straight ahead, and bear right. It has 960 input units and a single
layer of 4 hidden neurons. The backprop algorithm is used to train ALVINN.
A number of successful applications have been reported in nancial applications (see Trippi
and Turban, 1996) such as bankruptcy predictions, currency market trading, picking stocks and
commodity trading. Credit card and CRM (customer relationship management) applications have
also been reported.

Chapter 7

Classication and Regression Trees


If one had to choose a classication technique that performs well across a wide range of situations
without requiring much eort from the analyst while being readily understandable by the consumer
of the analysis, a strong contender would be the tree methodology developed by Brieman, Friedman,
Olshen and Stone (1984). We will discuss this classication procedure rst, then in later sections
we will show how the procedure can be extended to prediction of a continuous dependent variable.
The program that Brieman et. al. created to implement these procedures was called CART for
Classication And Regression Trees.

7.1

Classication Trees

There are two key ideas underlying classication trees. The rst is the idea of recursive partitioning
of the space of the independent variables. The second is of pruning using validation data. In
the next few sections we describe recursive partitioning, subsequent sections explain the pruning
methodology.

7.2

Recursive Partitioning

Let us denote the dependent (outcome) variable by y and the independent (predictor) variables by
x1 , x2 , x3 , , xp . In classication, the outcome variable will be a categorical variable. Recursive
partitioning divides up the p dimensional space of the x variables into non-overlapping multidimensional rectangles. The X variables here are considered to be continuous, binary or ordinal.
This division is accomplished recursively (i.e. sequentially, operating on the results of prior divisions). First, one of the variables is selected, say xi , and a value of xi , say si , is chosen to split the
p dimensional space into two parts: one part that contains all the points with xi si and the other
with all the points with xi > ci . Then one of these two parts is divided in a similar manner by
choosing a variable again (it could be xi or another variable) and a split value for the variable. This
results in three (multi-dimensional) rectangular regions. This process is continued so that we get
smaller and smaller rectangular regions. The idea is to divide the entire x-space up into rectangles
such that each rectangle is as homogenous or pure as possible. By pure we mean containing
77

78

7. Classication and Regression Trees

points that belong to just one class. (Of course, this is not always possible, as there may be points
that belong to dierent classes but have exactly the same values for every one of the independent
variables.) Let us illustrate recursive partitioning with an example.

7.3

Example 1 - Riding Mowers

A riding-mower manufacturer would like to nd a way of classifying families in a city into those
likely to purchase a riding mower and those not likely to buy one. A pilot random sample of 12
owners and 12 non-owners in the city is undertaken. The data are shown in Table I and plotted in
Figure 1 below. The independent variables here are Income (x1 ) and Lot Size (x2 ). The categorical
y variable has two classes: owners and non-owners.
Table 1
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Income
($ 000s)
60.0
85.5
64.8
61.5
87.0
110.1
108.0
82.8
69.0
93.0
51.0
81.0
75.0
52.8
64.8
43.2
84.0
49.2
59.4
66.0
47.4
33.0
51.0
63.0

Lot Size
(000s sq. ft.)
18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20.0
20.8
22.0
20.0
19.6
20.8
17.2
20.4
17.6
17.6
16.0
18.4
16.4
18.8
14.0
14.8

Owners=1,
Non-owners=2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2

7.3 Example 1 - Riding Mowers

79
Figure 1

If we apply the classication tree procedure to this data it will choose x2 for the rst split with
a splitting value of 19. The (x1 , x2 ) space is now divided into two rectangles, one with the Lot Size
variable, x2 19 and the other with x2 > 19. See Figure 2.
Figure 2

Notice how the split into two rectangles has created two rectangles each of which is much more
homogenous than the rectangle before the split. The upper rectangle contains points that are

80

7. Classication and Regression Trees

mostly owners (9 owners and 3 non-owners) while the lower rectangle contains mostly non-owners
(9 non-owners and 3 owners).
How was this particular split selected? It examined each variable and all possible split values for
each variable to nd the best split. What are the possible split values for a variable? They are simply
the mid-points between pairs of consecutive values for the variable. The possible split points for x1
are {38.1, 45.3, 50.1, , 109.5} and those for x2 are {14.4, 15.4, 16.2, , 23}. These split points are
ranked according to how much they reduce impurity (heterogeneity of composition). The reduction
in impurity is dened as overall impurity before the split minus the sum of the impurities for the
two rectangles that result from a split. There are a number of ways we could measure impurity.
One popular measure of impurity is the Gini index. If we denote the classes by k, k = 1, 2, , C,
where C is the total number of classes for the y variable, the Gini impurity index for a rectangle A
is dened by I(A) = 1

c

p2k where pk is the fraction of observations in rectangle A that belong

k1

to class k. I(A) = 0 if all the observations belong to a single class. I(A) is maximized when all
classes appear in equal proportions in rectangle A. Its maximum value is (C 1)/C. XLMiner uses
the delta splitting rule which is the modied version of twoing splitting rule. The twoing splitting
rule coincides with Gini splitting rule when number of classes = 2; see Classication and Regression
Trees by Leo Breiman (1984, Sec. 11.8).
The next split is on the Income variable, x1 at the value 84.75. Figure 3 shows that once again
the tree procedure has astutely chosen to split a rectangle to increase the purity of the resulting
rectangles. The left lower rectangle, which contains data points with x1 84.75 and x2 19, has
all points that are non-owners (with one exception); while the right lower rectangle, which contains
data points with x1 > 84.75 and x2 19, consists exclusively of owners.
Figure 3

7.3 Example 1 - Riding Mowers

81

The next split is shown below:


Figure 4

We can see how the recursive partitioning is rening the set of constituent rectangles to become
purer as the algorithm proceeds. The nal stage of the recursive partitioning is shown in Figure 5.
Figure 5

Notice that now each rectangle is pure - it contains data points from just one of the two classes.

82

7. Classication and Regression Trees

The reason the method is called a classication tree algorithm is that each split can be depicted
as a split of a node into two successor nodes. The rst split is shown as a branching of the root
node of a tree in Figure 6.
Figure 6

The tree representing the rst three splits is shown in Figure 7 below.
Figure 7

The full tree is shown in Figure 8 below. We have represented the nodes that have successors by
circles. The numbers inside the circle are the splitting values and the name of the variable chosen

7.4 Pruning

83

for splitting at that node is shown below the node. The numbers on the left fork at a decision node
shows the number of points in the decision node that had values less than or equal to the splitting
value while the number on the right fork shows the number that had a greater value. These are
called decision nodes because if we were to use a tree to classify a new observation for which we
knew only the values of the independent variables we would drop the observation down the tree
in such a way that at each decision node the appropriate branch is taken until we get to a node
that has no successors. Such terminal nodes are called the leaves of the tree. Each leaf node is
depicted with a rectangle, rather than a circle, and corresponds to one of the nal rectangles into
which the x-space is partitioned. When the observation has dropped down all the way to a leaf we
can predict a class for it by simply taking a vote of all the training data that belonged to the leaf
when the tree was grown. The class with the highest vote is the class that we would predict for the
new observation. The number below the leaf node is the class with the most votes in the rectangle.
The % value in a leaf node shows the percentage of the total number of training observations that
belonged to that node. It is useful to note that the type of trees grown by CART (called binary
trees) have the property that the number of leaf nodes is exactly one more than the number of
decision nodes.

Figure 8

84

7.4

7. Classication and Regression Trees

Pruning

The second key idea in the classication and regression tree procedure, that of using the validation
data to prune back the tree that is grown from training data, was the real innovation. Previously,
methods had been developed that were based on the idea of recursive partitioning but they had used
rules to prevent the tree from growing excessively and overtting the training data. For example,
CHAID (Chi-Squared Automatic Interaction Detection) is a recursive partitioning method that
predates classication and regression tree (CART) procedures by several years and is widely used
in database marketing applications to this day. It uses a well-known statistical test (the chi-square
test for independence) to assess whether splitting a node improves the purity by a statistically
signicant amount. If the test does not show a signicant improvement the split is not carried out.
By contrast, CART and CART like procedures use validation data to prune back the tree that has
been deliberately overgrown using the training data.
The idea behind pruning is to recognize that a very large tree is likely to be overtting the
training data. In our example, the last few splits resulted in rectangles with very few points (indeed
four rectangles in the full tree have just one point). We can see intuitively that these last splits are
likely to be simply capturing noise in the training set rather than reecting patterns that would
occur in future data such as the validation data. Pruning consists of successively selecting a decision
node and re-designating it as a leaf node (lopping o the branches extending beyond that decision
node (its subtree) and thereby reducing the size of the tree). The pruning process trades o
misclassication error in the validation data set against the number of decision nodes in the pruned
tree to arrive at a tree that captures the patterns but not the noise in the training data. It uses a
criterion called the cost complexity of a tree to generate a sequence of trees that are successively
smaller to the point of having a tree with just the root node. (What is the classication rule for
a tree with just one node?). We then pick as our best tree the one tree in the sequence that gives
the smallest misclassication error in the validation data.
The cost complexity criterion that classication and regression procedures use is simply the
misclassication error of a tree (based on the training data) plus a penalty factor for the size of
the tree. The penalty factor is based on a parameter, let us call it , that is the per node penalty.
The cost complexity criterion for a tree is thus Err(T ) + |L(T )| where Err(T ) is the fraction of
training data observations that are misclassied by tree T, L(T ) is the number of leaves in tree T
and is the per node penalty cost: a number that we will vary upwards from zero. When = 0
there is no penalty for having too many nodes in a tree and the best tree using the cost complexity
criterion is the full-grown unpruned tree. When we increase to a very large value the penalty cost
component swamps the misclassication error component of the cost complexity criterion function
and the best tree is simply the tree with the fewest leaves, namely the tree with simply one node.
As we increase the value of from zero at some value we will rst encounter a situation where
for some tree T1 formed by cutting o the subtree at a decision node we just balance the extra
cost of increased misclassication error (due to fewer leaves) against the penalty cost saved from
having fewer leaves. We prune the full tree at this decision node by cutting o its subtree and
redesignating this decision node as a leaf node. Lets call this tree T1 . We now repeat the logic that

7.4 Pruning

85

we had applied previously to the full tree, with the new tree T1 by further increasing the value of
. Continuing in this manner we generate a succession of trees with diminishing number of nodes
all the way to the trivial tree consisting of just one node.
From this sequence of trees it seems natural to pick the one that gave the minimum misclassication error on the validation data set. We call this the Minimum Error Tree.
Let us use the Boston Housing data to illustrate. (Note : There are both 2-class and 3-class
versions of this problem. The 3-class has median house value splits at $15,000/- and $30,000/-.
The 2-class has a split only at $ 30,000/-). Shown below is the output that XLMiner generates
when it is using the training data in the tree-growing phase of the algorithm :
Training Log
Growing the Tree
# Nodes Error
0
38.16
1
15.64
2
5.75
3
3.25
4
2.94
5
1.86
6
1.42
7
1.26
8
1.2
9
0.63
10
0.59
11
0.49
12
0.42
13
0.35
14
0.34
15
0.32

86

7. Classication and Regression Trees


Growing the Tree
# Nodes Error
16
0.25
17
0.22
18
0.21
19
0.15
20
0.09
21
0.09
22
0.09
23
0.08
24
0.05
25
0.03
26
0.03
27
0.02
28
0.01
29
0
30
0
Training Misclassication Summary
Classication Confusion Matrix
Predicted Class
Actual Class 1
2
3
1
59
0
0
2
0 194
0
3
0
0
51

Class
1
2
3
Overall

Error Report
# Cases # Errors
59
0
19.4
0
51
0
30.4
0

% Error
0.00
0.00
0.00
0.00

(These are cases in the training data)


The top table logs the tree-growing phase by showing in each row the number of decision nodes
in the tree at each stage and the corresponding (percentage) misclassication error for the training
data applying the voting rule at the leaves. We see that the error steadily decreases as the number
of decision nodes increases from zero (where the tree consists of just the root node) to thirty. The
error drops steeply in the beginning, going from 36% to 3% with just an increase of decision nodes
from 0 to 3. Thereafter the improvement is slower as we increase the size of the tree. Finally we

7.4 Pruning

87

stop at a full tree of 30 decision nodes (equivalently, 31 leaves) with no error in the training data,
as is also shown in the confusion table and the error report by class.
The output generated by XLMiner during the pruning phase is shown below.
# Decision
Nodes
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

Training
Error
0.00%
0.00%
0.01%
0.02%
0.03%
0.03%
0.05%
0.08%
0.09%
0.09%
0.09%
0.15%
0.21%
0.22%
0.25%
0.32%
0.34%
0.35%
0.42%
0.49%
0.59%
0.63%
1.20%
1.28%
1.42%
1.88%
2.94%
3.29%
5.75%
15.64%

Validation Error
15.84%
15.84%
15.84%
15.84%
15.84%
15.84%
15.84%
15.84%
15.84%
16.84%
15.84%
15.84%
15.84%
15.84%
15.84%
15.84%
15.35%
14.85%
14.85%
15.35%
14.85%
15.84%
15.84%
16.83%
16.83%
15.84%
21.78%
21.78%
30.20%
33.66%

Minimum Error Prune| Std.EIT.|0.02501957

Best Prune

Validation Misclassication Summary


Classication Confusion Matrix
Predicted Class
Actual Class 1
2
3
1
25 10
0
2
5 120
9
3
0
8
25

88

7. Classication and Regression Trees

Class
1
2
3
Overall

Error Report
# Cases # Errors
35
10
134
14
33
8
202
32

% Error
28.57
10.45
24.24
15.84

Notice now that as the number of decision nodes decreases, the error in the validation data
has a slow decreasing trend (with some uctuation) up to a 14.85% error rate for the tree with 10
nodes. This is more readily visible from the graph below. Thereafter, the error increases, going up
sharply when the tree is quite small. The Minimum Error Tree is selected to be the one with 10
decision nodes (why not the one with 13 decision nodes?).

7.5 Minimum Error Tree

7.5

89

Minimum Error Tree

This Minimum Error Tree is shown in Figure 9.


Figure 9

7.6

Best Pruned Tree

You will notice that the XLMiner output from the pruning phase highlights another tree besides
the Minimum Error Tree. This is the Best Pruned Tree, the tree with 5 decision nodes. The reason
this tree is important is that it is the smallest tree in the pruning sequence that has an error that
is within one standard error of the Minimum Error Tree. The estimate of error that we get from
the validation data is just an estimate. If wed had another set of validation data, the minimum
error would have been dierent. The minimum error rate we have computed can be viewed as an
observed value of a random variable with standard error (estimated standard deviation) equal to

where [Emin (1 Emin )/Nval ] is the error rate (as a fraction) for the minimum error tree and
Nval is the number of observations in the validation data set. For our example Emin = 0.1485 and
Nval = 202, so that the standard error is 0.025. The Best Pruned Tree is shown in Figure 10.

90

7. Classication and Regression Trees


Figure 10

We show the confusion table and summary of classication errors for the Best Pruned Tree
below.

7.7 Classication Rules from Trees

7.7

91

Classication Rules from Trees

One of the reasons tree classiers are very popular is that they provide easily understandable
classication rules (at least if the trees are not too large). Each leaf is equivalent to a classication
rule. For example, the upper left leaf in the Best Pruned Tree, above, gives us the rule:
IF(LSTAT 15.145) AND (ROOM 6.5545) THEN CLASS = 2.
Compared to the output of other classiers such as discriminant functions, such rules are easily
explained to managers and operating sta. Their logic is certainly far more transparent than that
of weights in neural networks!

7.8

Regression Trees

Regression trees for prediction operate in much the same fashion as classication trees. The output
variable is a continuous variable in this case, but both the principle and the procedure are the same
- many splits are attempted and, for each, we measure impurity in each branch of the resulting
tree (e.g. squared residuals). The tree procedure then selects the split that minimizes the sum of
such measures.
The tree method is a good o-the-shelf classier and predictor.
We said at the beginning of this chapter that classication trees require relatively little eort
from developers. Let us give our reasons for this statement. Trees need no tuning parameters.
There is no need for transformation of variables (any monotone transformation of the variables will
give the same trees). Variable subset selection is automatic since it is part of the split selection;
in our example notice that the Best Pruned Tree has automatically selected just three variables
(LSTAT, RM and CRIM) out of the set thirteen variables available. Trees are also intrinsically
robust to outliers, since the choice of a split depends on the ordering of observation values and not
on the absolute magnitudes of these values.
Finally, trees handle missing data without having to impute values or delete observations with
missing values. The method can also be extended to incorporate an importance ranking for the
variables in terms of their impact on quality of the classication.
Notes:
1. We have not described how categorical independent variables are handled in CART. In principle there is no diculty. The split choices for a categorical variable are all ways in which the
set of categorical values can be divided into two subsets. For example a categorical variable
with 4 categories, say {1,2,3,4} can be split in 7 ways into two subsets: {1} and {2,3,4};
{2} and {1,3,4}; {3} and {1,2,4}; {4} and {1,2,3}; {1.2} and {3,4}; {1,3} and {2,4}; {1,4}
and {2,3}. When the number of categories is large the number of splits becomes very large.
XLMiner supports only binary categorical variables (coded as numbers). If you have a categorical independent variable that takes more than two values, you will need to replace the

92

8. Discriminant Analysis
variable with several dummy variables each of which is binary in a manner that is identical
to the use of dummy variables in regression.
2. Besides CHAID, another popular tree classication method is ID3 (and its successor C4.5).
This method was developed by Quinlan, a leading researcher in machine learning, and is
popular with developers of classiers who come from a background in machine learning.

Chapter 8

Discriminant Analysis
Introduction
Discriminant analysis uses continuous variable measurements on dierent groups of items to
highlight aspects that distinguish the groups and to use these measurements to classify new items.
Common uses of the method have been in classifying organisms into species and sub-species, classifying applications for loans, credit cards and insurance into low risk and high risk categories,
classifying customers of new products into early adopters, early majority, late majority and laggards, classifying bonds into bond rating categories, classifying skulls of human fossils, as well as
in research studies involving disputed authorship, decision on college admission, medical studies
involving alcoholics and non-alcoholics, and methods to identify human ngerprints.
It will be easier to understand the discriminant analysis method if we rst consider an example.

8.1

Example 1 - Riding Mowers

A riding-mower manufacturer would like to nd a way of classifying families in a city into those
likely to purchase a riding mower and those not likely to buy one. A pilot random sample of 12
owners and 12 non-owners in the city is undertaken. The data are shown in Table I and plotted in
Figure 1 below:

93

94

8. Discriminant Analysis
Table 1
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Income
($ 000s)
60
85.5
64.8
61.5
87
110.1
108
82.8
69
93
51
81
75
52.8
64.8
43.2
84
49.2
59.4
66
47.4
33
51
63

Lot Size
(000s sq. ft.)
18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20
20.8
22
20
19.6
20.8
17.2
20.4
17.6
17.6
16
18.4
16.4
18.8
14
14.8

Owners=1,
Non-owners=2
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2

8.2 Fishers Linear Classication Functions

95

Figure 1: Riding mower owners and non-owners, separated by a line placed by hand

We can think of a linear classication rule as a line that separates the x1 , x2 region into
two parts where most of the owners are in one half-plane and most of the non-owners are in the
complementary half-plane. A good classication rule would separate out the data so that the fewest
points are misclassied: the line shown in Figure 1 seems to do a good job in discriminating between
the two groups as it makes 4 misclassications out of 24 points. Can we do better?

8.2

Fishers Linear Classication Functions

Linear classication functions that were suggested by the noted statistician R. A. Fisher can help
us do better. Well learn more about these functions below. First, lets review the output of
XLMiners Discriminant Analysis routine. Figure 2 shows the results of invoking the discriminant
routine where we select the option for displaying the classication functions.

96

8. Discriminant Analysis
.

Figure 2: Discriminant Analysis output

We note that it is possible to have a misclassication rate (3 in 24) that is lower than that our
initial lines (4 in 24) by using the classication functions specied in the output. Heres how these
classication functions work: A family is classied into Class 1 of owners if Function 1 is higher
than Function 2, and into Class 2 if the reverse is the case. These functions are specied in a way
that can be easily generalized to more than two classes. The values given for the functions are
simply the weights to be associated with each variable in the linear function in a manner analogous
to multiple linear regression. Let us compute these functions for the observations in our data set.
The results are shown in Figure 3.

8.2 Fishers Linear Classication Functions

97
Figure 3

XLMiner output, discriminant analysis classication for Mower data

Notice that observations 1, 13 and 17 are misclassied (summarized in the confusion matrix
and error report in Figure 1).
Let us describe the reasoning behind Fishers linear classication rules. Figure 4 depicts the
logic.

98

8. Discriminant Analysis
Figure 4

In the X1 , X2 space, consider various directions such as directions D1 and D2 shown in Figure 4.
One way to identify a good linear discriminant function is to choose amongst all possible directions
the one that has the property that when we project (drop a perpendicular line from) the means
of the two groups (owners and non-owners) onto a line in the chosen direction the projections of
the group means (feet of the perpendiculars, e.g. P1 and P2 in direction D1) are separated by the
maximum possible distance. The means of the two groups are:

Mean1 (owners)
Mean2 (non-owners)

8.3

Income
79.5
57.4

Lot Size
20.3
17.6

Measuring Distance

We still need to decide how to measure the distance. We could simply use Euclidean distance.
This has two drawbacks. First, the distance would depend on the units we choose to measure
the variables. We will get dierent answers if we decided to measure lot size in say, square yards
instead of thousands of square feet. Second, we would not be taking any account of the correlation
structure. This is often a very important consideration, especially when we are using many variables
to separate groups. In this case often there will be variables which, by themselves, are useful
discriminators between groups but, in the presence of other variables, are practically redundant as
they capture the same eects as the other variables.

8.4 Classication Error

99

Fishers method gets over these objections by using a measure of distance that is a generalization
of Euclidean distance known as Mahalanobis distance (see appendix to this chapter for details).
If we had a list of prospective customers with data on income and lot size, we could use the
classication functions to identify the sublist of families that are classied as group 1: predicted
purchasers of the product.

8.4

Classication Error

What is the accuracy we should expect from our classication functions? We have an error rate of
12.5% in our example. However, this is a biased estimate - it is overly optimistic. This is because
we have used the same data for tting the classication parameters and for estimating the error.
Common sense tells us that if we used these same classication functions with a fresh data set, we
would get a higher error rate. In data mining applications we would randomly partition our data
into training and validation subsets. We would use the training part to estimate the classication
functions and hold out the validation part to get a more reliable, unbiased estimate of classication
error.
So far we have assumed that our objective is to minimize the classication error. the method
as presented to its point also assumes that the chances of encountering an item from either group
requiring classication is the same. If the probability of encountering an item for classication in
the future is not equal for both groups we should modify our functions to reduce our expected (long
run average) error rate. Also, we may not want to minimize the misclassication rate in certain
situations. If the cost of mistakenly classifying a group 1 item as group 2 is very dierent from the
cost of classifying a group 2 item as a group 1 item we may want to minimize the expected cost
of misclassication rather than the simple error rate (which does not take cognizance of unequal
misclassication costs.) It is simple to incorporate these situations into our framework. All we need
to provide are estimates of the ratio of the chances of encountering an item in class 1 as compared
to class 2 in future classication and the ratio of the costs of making the two kinds of classication
error. These ratios will alter the constant terms in the linear classication functions to minimize
the expected cost of misclassication. XL-Miner has choices in its discriminant analysis dialog box
to specify the rst of these ratios.
The above analysis for two classes is readily extended to more than two classes. Example 2
illustrates this setting.

8.5

Example 2 - Classication of Flowers

This is a classic example used by R. A. Fisher to illustrate his method for computing classication
functions. The data consist of four length measurements on dierent varieties of iris owers. Fifty

100

8. Discriminant Analysis

dierent owers were measured for each species of iris. The full data set is available as the XLMiner
data set Iris.xls. A sample of the data are given in Table 3 below:
Table 3
OBS
#
1
2
3
4
5
6
7
8
9
10

51
52
53
54
55
56
57
58
59
60

101
102
103
104
105
106
107
108
109
110

SPECIES
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa

Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor
Iris-versicolor

Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica
Iris-virginica

CLASS
CODE
1
1
1
1
1
1
1
1
1
1

2
2
2
2
2
2
2
2
2
2

3
3
3
3
3
3
3
3
3
3

SEPLEN

SEPW

PETLEN

PETW

5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
4.9

7
6.4
6.9
5.5
6.5
5.7
6.3
4.9
6.6
5.2

6.3
5.8
7.1
6.3
6.5
7.6
4.9
7.3
6.7
7.2

3.5
3
3.2
3.1
3.6
3.9
3.4
3.4
2.9
3.1

3.2
3.2
3.1
2.3
2.8
2.8
3.3
2.4
2.9
2.7

3.3
2.7
3
2.9
3
3
2.5
2.9
2.5
3.6

1.4
1.4
1.3
1.5
1.4
1.7
1.4
1.5
1.4
1.5

4.7
4.5
4.9
4
4.6
4.5
4.7
3.3
4.6
3.9

6
5.1
5.9
5.6
5.8
6.6
4.5
6.3
5.8
6.1

0.2
0.2
0.2
0.2
0.2
0.4
0.3
0.2
0.2
0.1

1.4
1.5
1.5
1.3
1.5
1.3
1.6
1
1.3
1.4

2.5
1.9
2.1
1.8
2.2
2.1
1.7
1.8
1.8
2.5

The results from applying the discriminant analysis procedure of XLMiner are shown in Figure
5. Again, XLMiner refers by default to the training data when no partitions have been specied.

8.5 Example 2 - Classication of Flowers

101

Figure 5 : XL Miner output for discriminant analysis for Iris data

For illustration the computations of the classication function values for observations 40 to 55
and 125 to 135 are shown in Table 4.
For observation # 40, the classication function for class 1 had a value of 85.85. The functions
for classes 2 and 3 were 40.15 and 5.44. The maximum is 85.85, therefore observation # 40 is
classied as class 1 (setosa).
When the classication variables follow a multivariate Normal distribution with variance matrices that dier substantially among dierent groups, the linear classication rule is no longer
optimal. In that case, the optimal classication function is quadratic in the classication variables.
However, in practice this has not been found to be useful except when the dierence in the variance
matrices is large and the number of observations available for training and testing is large. The
reason is that the quadratic model requires many more parameters that are all subject to error to
be estimated. If there are c classes and p variables, the number of parameters to be estimated for
the dierent variance matrices is cp(p+1)/2. This is an example of the importance of regularization
in practice.

102

8. Discriminant Analysis
Table 4

OBS
#
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
125
126
127
128
129
130
131
132
133
134
135

IRIS
Species
setosa
setosa
setosa
setosa
setosa
setosa
etosa
setos
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica

Class
Code
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3

SEP
LEN
5.1
5
4.5
4.4
5
5.1
4.8
5.1
4.6
4.3
5
7
6.4
6.9
5.5
6.5
6.7
7.2
6.2
6.1
6.4
7.2
7.4
7.9
6.4
6.3
6.

SEP
W
3.4
3.5
2.3
3.2
3.5
3.8
3
3.8
3.2
3.7
3.3
3.2
3.2
3.1
2.3
2.8
3.3
3.2
2.8
3
2.8
3
2.8
3.8
2.8
2.8
2.6

PET
LEN
1.5
1.3
1.3
1.3
1.6
1.9
1.4
1.6
1.4
1.5
1.4
4.7
4.5
4.9
4
4.6
5.7
6
4.8
4.9
5.6
5.8
6.1
6.4
5.6
5.1
5.6

PET
W
0.2
0.3
0.3
0.2
0.6
0.4
0.3
0.2
0.2
0.2
0.2
1.4
1.5
1.5
1.3
1.5
2.1
1.8
1.8
1.8
2.1
1.6
1.9
2
2.2
1.5
1.4

Fn 1
85.85
87.29
47.28
67.95
77.03
85.19
69.20
93.65
71.02
97.61
82.76
52.24
39.60
42.49
8.97
30.90
18.74
28.66
15.21
15.96
1.53
30.79
20.50
49.14
-0.27
18.10
2.40

Fn 2
40.15
38.85
2.65
26.71
42.32
46.30
32.76
43.46
30.38
45.37
37.35
93.06
83.21
92.48
58.92
82.53
98.74
105.59
80.76
81.1
90.02
101.88
107.11
124.13
90.65
82.03
79.51

Fn 3
-5.44
-6.80
-17.18
-17.43
3.36
5.28
-9.74
-2.78
-13.65
-1.91
-8.31
83.91
75.96
86.99
50.99
77.11
108.07
111.50
82.25
82.97
101.32
104.04
116.01
131.65
103.43
81.03
82.14

Max
85.85
87.29
47.28
67.95
77.03
85.19
69.20
93.65
71.02
97.61
82.76
93.06
83.21
92.48
58.92
82.53
108.07
111.50
82.25
82.97
101.32
104.04
116.01
131.65
103.43
82.03
82.14

Pred.
Class
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
3
3
3
3
2
3

8.6 Appendix - Mahalanobis Distance

8.6

103

Appendix - Mahalanobis Distance

Mahalanobis distance is dened with respect to a positive denite matrix . The squared Mahalanobis distance between two pdimensional (column) vectors y1 and y2 is (y1 y2 ) 1 (y1 y2 )
where is a symmetric positive denite square matrix with dimension p. Notice that if is the
identity matrix the Mahalanobis distance is the same as Euclidean distance. In linear discriminant
analysis we use the pooled sample variance matrix of the dierent groups. If X1 and X2 are the
n1 p and n2 p matrices of observations for groups 1 and 2, and the respective sample variance
matrices are S1 and S2 , the pooled matrix S is equal to {(n1 1)S1 + (n2 1)S2 }/(n1 + n2 2). The
matrix S denes the optimum direction (actually the eigenvector associated with its largest eigenvalue) that we referred to when we discussed the logic behind Figure 4. This choice Mahalanobis
distance can also be shown to be optimal1 in the sense of minimizing the expected misclassication
error when the variable values of the populations in the two groups (from which we have drawn
our samples) follow a multivariate normal distribution with a common covariance matrix. If we
have large samples approximate normality is generally adequate for this procedure to be close to
optimal.

This is true asymptotically, i.e. for large training samples. Large training samples are required for S, the pooled
sample variance matrix, to be a good approximation for the population variance matrix.

104

9. Other Supervised Learning Techniques

Chapter 9

Other Supervised Learning


Techniques
9.1

K-Nearest neighbor

The idea behind the k-Nearest Neighbor algorithm is to build a classication (and prediction)
method using no assumptions about the form of the function, y = f (x1 , x2 , , xp ) relating the
dependent variable, y, to the independent variables x1 , x2 , , xp . This is a non-parametric method
because it does not involve estimation of parameters in an assumed function form such as the linear
form that we encountered in linear regression.
We have training data in which each observation has a y value which is just the class to which
the observation belongs. For example, if we have two classes, y is a binary variable. The idea
in k-Nearest Neighbor methods is to dynamically identify k observations in the training data set
that are similar to a new observation, say (u1 , u2 , , up ) that we wish to classify. We then use
similar (neighboring) observations to classify the observation into a class. Specically, we look for
observations in our training data that are similar or near to the observation to be classied,
based on the values of the independent variables. Then, based on the classes of those proximate
observations, we assign a class to the observation we want to classify. First, we must nd a distance
or dissimilarity measure that we can compute between observations based on the independent variables. For the moment we will continue ourselves to the most popular measure of distance Euclidean
 distance. The Euclidean distance between the points (X1 , X2 , , Xp ) and (u1 , u2 , , up ) is
(x1 u1 )2 ) + (x2 u2 )2 + + (xp up )2 . (We will examine other ways to dene distance between points in the space of predictor variables when we discuss clustering methods). Then, we
need a role to assign a class to the observation to be classied, based on the classes of the neighbors.
The simplest case is k = 1 where we nd the observation that is closest (the nearest neighbor)
and set v = y where y is the class of this single nearest neighbor. It is a remarkable fact that
this simple, intuitive idea of using a single nearest neighbor to classify observations can be very
powerful when we have a large number of observations in our training set. It is possible to prove
that the misclassication error of the 1-NN scheme has a misclassication probability that is no
worse than twice that of the situation where we know exactly the probability density functions for
105

106

9. Other Supervised Learning Techniques

each class. In other words if we have a large amount of data and used an arbitrarily sophisticated
classication rule, we would be able to reduce the misclassication error at best to half that of the
simple 1-NN rule.

9.1.1

The K-NN Procedure

For K-NN we extend the idea of 1-NN as follows. Find the nearest k neighbors and then use
a majority decision rule to classify a new observation. The advantage is that higher values of k
provide smoothing that reduces the risk of overtting due to noise in the training data. In typical
applications k is in units or tens rather than in hundreds or thousands. Notice that if k = n,
the number of observations in the training data set, we will simply assign all observations to the
same class as the class that has the majority in the training data, irrespective of the values of
(u1 , u2 , , up ). This is clearly a case of oversmoothing unless there is no information at all in the
independent variables about the dependent variable.

9.1.2

Example 1 - Riding Mowers

A riding-mower manufacturer would like to nd a way of classifying families into those likely to
purchase a riding mower and those not likely to buy one. A pilot random sample of 12 owners and
12 non-owners is undertaken. The data are shown in Table I and Figure 1 below:
Table 1
Observation
1
2
3
4
5
6
7
8
9
10

Income
($ 000s)
60.0
85.5
64.8
61.5
87.0
110.1
108.0
82.8
69.0
93.0

Lot Size
(000s sq. ft.)
18.4
16.8
21.6
20.8
23.6
19.2
17.6
22.4
20.0
20.8

Owners=1,
Non-owners=2
1
1
1
1
1
1
1
1
1
1

9.1 K-Nearest Neighbors

107

Observation
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Income
($ 000s)
51.0
81.0
75.0
52.8
64.8
43.2
84.0
49.2
59.4
66.0
47.4
33.0
51.0
63.0

Lot Size
(000s sq. ft.)
22.0
20.0
19.6
20.8
17.2
20.4
17.6
17.6
16.0
18.4
16.4
18.8
14.0
14.8

Owners=1,
Non-owners=2
1
1
2
2
2
2
2
2
2
2
2
2
2
2

How do we choose k? In data mining we use the training data to classify the cases in the
validation data, then compute error rates for various choices of k. For our example we have
randomly divided the data into a training set with 18 cases and a validation set of 6 cases. Of
course, in a real data mining situation we would have sets of much larger sizes. The validation set
consists of observations 6, 7, 12, 14, 19, 20 of Table 1. The remaining 18 observations constitute the
training data. Figure 1 displays the observations in both training and validation data sets. Notice
that if we choose k = 1 we will classify in a way that is very sensitive to the local characteristics of
our data. On the other hand if we choose a large value of k we average over a large number of data
points and average out the variability due to the noise associated with individual data points. If
we choose k = 18 we would simply predict the most frequent class in the data set in all cases. This
is a very stable prediction but it completely ignores the information in the independent variables.

108

9. Other Supervised Learning Techniques


Figure 1

3
2

Owners (Training)
Non-owners (Training)
Owners (Validation)
Non-owners (Validation)

Table 2 shows the misclassication error rate for observations in the validation data for dierent
choices of k.
Table 2: Classication Error in the Riding Mower Data

Misclassication Error

k
%

1
33

3
33

5
33

7
33

9
33

11
17

13
17

15
17

18
50

We would choose k = 11 (or possibly 13) in this case. This choice optimally trades o the
variability associated with a low value of k against the oversmoothing associated with a high value
of k. It is worth remarking that a useful way to think of k is through the concept of eective
number of parameters. The eective number of parameters corresponding to k is N/k where n
is the number of observations in the training data set. Thus a choice of k = 11 has an eective
number of parameters of about 2 and is roughly similar in the extent of smoothing to a linear
regression t with two coecients.

9.1.3

K-Nearest Neighbor Prediction

The idea of K-NN can be readily extended to predicting a continuous value (as is our aim with
multiple linear regression models). Instead of taking a majority vote of the neighbors to determine
class, we use as our predicted value, the average value of the dependent variable for the k nearest
neighbors. Often this average is a weighted average with the weight decreasing with increasing
distance from the point at which the prediction is required.

9.1 K-Nearest Neighbors

9.1.4

109

Shortcomings of k-NN algorithms

There are two diculties with the practical exploitation of the power of the k-NN approach.
First, while there is no time required to estimate parameters from the training data (as would
be the case for parametric models such as regression) the time to nd the nearest neighbors in a
large training set can be prohibitive. A number of ideas have been implemented to overcome this
diculty. The main ideas are:
(1) Reduce the time taken to compute distances by working in a reduced dimension using dimension
reduction techniques such as principal components;
(2) Use sophisticated data structures such as search trees to speed up identication of the nearest
neighbor. This approach often settles for an almost nearest neighbor to improve speed.
(3) Edit the training data to remove redundant or almost redundant points in the training set
to speed up the search for the nearest neighbor. An example is to remove observations in
the training data set that have no eect on the classication because they are surrounded by
observations that all belong to the same class.
Second, the number of observations required in the training data set to qualify as large increases
exponentially with the number of dimensions p. This is because the expected distance to the nearest
neighbor goes up dramatically with p unless the size of the training data set increases exponentially
with p: An illustration of this phenomenon, known as the curse of dimensionality, is the fact
that if the independent variables in the training data are distributed uniformly in a hypercube of
dimension p, the probability that a point is within a distance of 0.5 units from the center is
p/2
.
2p1 p(p/2)
The table below is designed to show how rapidly this drops to near zero for dierent combinations
of p and n, the size of the training data set. It shows the expected number of points within 0.5
units of the center of the hypercube.

110

9. Other Supervised Learning Techniques


Table 3 : Number of points within 0.5 units of hypercube center.
P

n
10,000
100,000
1,000,000
10,000,000

2
7854
78540
785398
7853982

3
5236
52360
523600
5236000

4
3084
30843
308425
3084251

5
1645
16449
164493
1644934

10
25
249
2490
24904

20
0.0002
0.0025
0.0246
0.2461

30
21010
2 109
2 108
2 107

40
3 1017
3 1016
3 1015
3 1014

The curse of dimensionality is a fundamental issue pertinent to all classication, prediction


and clustering techniques. This is why we often seek to reduce the dimensionality of the space
of predictor variables through methods such as selecting subsets of the predictor variables for
our model or by combining them using methods such as principal components, singular value
decomposition and factor analysis. In the articial intelligence literature dimension reduction is
often referred to as factor selection or feature extraction.

9.2
9.2.1

Naive Bayes
Bayes Theorem

In probability theory, Bayes Theorem provides the probability of a prior event, given that a certain
subsequent event has occurred. For example, the probability that a subject is HIV-positive given
that he or she tested positive on a screening test for HIV.1 In the context of classication, Bayes
theorem provides a formula for updating the probability that a given object belongs to a class,
given the objects attributes. Suppose that we have m classes, C1 , C2 ...Cm and we know that
the proportion of objects in these classes are P (C1 ), P (C2 )...P (Cm ). We have an object O with n
attributes with values X1 , X2 ...Xn . We do not know the class of the object and would like to classify
it on the basis of these attribute values. If we know the probability of occurrence of the attribute
values X1 , X2 ...Xn for each class, Bayes theorem gives us the following formula to compute the
probability that the object belongs to class Ci .
1

Here is a brief review of Bayes Theorem:


Consider, hypothetically, that 1% of the population is HIV positive (class 1 or C1 ), and 99% HIV negative (class
0 or C0 ). That is, in the absence of any other information, the probability that an individual is HIV positive P(C1 )
= 0.01.
Consider also that an HIV positive person has a 98% chance of testing positive (Tpos ) on a screening test: P
(Tpos |C1 ) = 0.98 (the probability of Tpos given C1 is 0.98), and that an HIV negative person has a 5% chance of
triggering a false positive on the test: P (Tpos |C0 ) = 0.05.
What is the probability that a person is HIV positive, if they test positive on the test?
There are two sources of positive test results - HIV negatives (C0 ) triggering false positives, and HIV positives
(C1 ) with true positives. The proportion of HIV positives amongst the positive test results is the probability that a
person is HIV positive, if they test positive on the test. In notation:
P (C1 |Tpos ) =

P (Tpos |C1 ) P (C1 )


0.98 0.01
=
= 0.165
P (Tpos |C1 ) P (C1 ) + P (Tpos |C0 ) P (C0 )
0.98 0.01 + 0.05 0.99

In this hypothetical example, the false positives swamp the true positives, yielding this surprisingly low probability.

9.2 Naive Bayes

P (Ci |X1 , X2 , , Xn ) =

111

P (X1 , X2 , Xn |Ci )P (Ci )


P (X1 , X2 , , Xn |C1 )P (C1 ) + + P (X1 , X2 , , Xn |Cm )P (Cm )

This is known as the posterior probability to distinguish it from P (Ci ) the probability of an
object belonging to class Ci in the absence of any information about its attributes. For purposes of
classication we only need to know which class Ci has the highest probability. Notice that since the
denominator is the same for all classes we do not need to calculate it for purposes of classication.

9.2.2

The Problem with Bayes Theorem

The diculty with using this formula is that if the number of variables, n, is even modestly large,
say, 20, and the number of classes, m is 2, even if all variables are binary, we would need a large
data set with several million observations to get reasonable estimates for P (X1 , X2 , , Xn |Ci ), the
probability of observing an object with the attribute vector (X1 , X2 , , Xn ). In fact the vector
may not be present in our training set for all classes as required by the formula; it may even be
missing in our entire data set! For example, in predicting voting, even a sizeable data set may not
contain many individuals who are male hispanics with high income from the midwest who voted in
the last election, did not vote in the prior election, have 4 children, are diverced, etc.

9.2.3

Simplify - assume independence

If it is reasonable to assume that the attributes are all mutually independent within each class,
we can considerably simplify the expression and make it useful in practice. Independence of the
attributes within each class gives us the following simplication which follows from the product rule
for probabilities of independent events (the probability of multiple events occurring is the product
of their individual probabilities):
P (X1 , X2 , , Xm |Ci ) = P (X1 |Ci )P (X2 |Ci )P (X3 |Ci ) P (Xm |Ci )
The terms on the right can be estimated simply from frequency counts with the estimate of
P (Xj |Ci ) being equal to the number of occurrences of the value Xj in the training data in class Ci
divided by the total number of observations in that class. We would like to have each possible value
for each attribute to be available in the training data. If this is not true for a particular attribute
value for a class, the estimated probability will be zero for the class for objects with that attribute
value. Often this is reasonable so we can relax our requirement of having every possible value for
every attribute being present in the training data. In any case the observations required will be far
fewer than in the formula without making the independence assumption. This is a very simplistic
assumption since the attributes are very likely to be correlated. Surprisingly this Naive Bayes
approach, as it is called, does work well in practice where there are many variables and they are
binary or categorical with a few discrete levels.

112

9.2.4

9. Other Supervised Learning Techniques

Example 1 - Saris

Saris, the primary product of the hand craft textile industry in India, are colorful owing garments
worn by women and made with a single piece of cloth six yards long and one yard wide. A sari has
many characteristics; important ones include the shade, color and design of the sari body, border
and palav (the most prominent and decorative portion of the sari), the size of the border, type of
fabric, and whether the sari is one- or two-sided. The object is to predict the classication of a
sari - sale =1 or no sale =0, on the basis of these predictor variables. The sample data set has 400
cases, of which 242 (60.5%) are sales.
An excerpt from the data follows:

These data were partitioned into training and validation sets, each with 200 cases. XLMiners
Naive Bayes classier was trained on the training set and scored to the validation data. An excerpt
from the scores on the validation set is shown below:

113
.

Row ID is the row identier from the main data set. Predicted class is the class predicted for
this row by XLMiner, applying Bayes classier to the predictor variable values for that row. As we
discussed, it answers the question, Given that SILKWTCat = 4, ZARIWTCat = 3, BODYCOL =
17, etc., what is the most likely class? Actual class is the actual class for that row, and Prob.
for 1 is the calculated probability that this case will be a 1, based on the application of Bayes
Rule. The rest of the columns are the values for the rst few variables for these cases.

114

10. Anity Analysis - Association Rules

Chapter 10

Anity Analysis - Association Rules


Put simply, anity analysis is the study of what goes with what. For example, a medical
researcher might be interested in learning what symptoms go with what conrmed diagnoses.
These methods are also called market basket analysis, since they originated with the study of
customer transactions databases in order to determine correlations between purchases.

10.1

Discovering Association Rules in Transaction Databases

The availability of detailed information on customer transactions has led to the development of
techniques that automatically look for associations between items that are stored in the database.
An example is data collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a
customer on a single purchase transaction. Managers would be interested to know if certain groups
of items are consistently purchased together. They could use this data for store layouts to place
items optimally with respect to each other, they could use such information for cross-selling, for
promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of if-then statements. These rules are
computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in
nature.

10.2

Support and Condence

In addition to the antecedent (the if part) and the consequent (the then part) an association
rule has two numbers that express the degree of uncertainty about the rule. In association analysis
the antecedent and consequent are sets of items (called item sets) that are disjoint (do not have
any items in common).
The rst number is called the support for the rule. The support is simply the number of
transactions that include all items in the antecedent and consequent parts of the rule. (The support
is sometimes expressed as a percentage of the total number of records in the database.)
115

116

10. Anity Analysis - Association Rules

The other number is known as the condence of the rule. Condence is the ratio of the number
of transactions that include all items in the consequent as well as the antecedent (namely, the
support) to the number of transactions that include all items in the antecedent. For example if a
supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items
A and B and 800 of these include item C, the association rule If A and B are purchased then C is
purchased on the same trip has a support of 800 transactions (alternatively 0.8% = 800/100,000)
and a condence of 40% (= 800/2,000).
Note : This concept is dierent from and unrelated to the ideas of condence intervals and condence levels used in statistical inference.
One way to think of support is that it is the probability that a randomly selected transaction
from the database will contain all items in the antecedent and the consequent, whereas the condence is the conditional probability that a randomly selected transaction will include all the items
in the consequent given that the transaction includes all the items in the antecedent.

10.3

Example 1 - Electronics Sales

The manager of the All Electronics retail store would like to know what items sell together. He
has a database of transactions as shown below:
Transaction ID
1
2
3
4
5
6
7
8
9

Item Codes
1 2 5
2 4
2 3
1 2 4
1 3
2 3
1 3
1 2 3 5
1 2 3

There are 9 transactions. Each transaction is a record of the items bought together in that
transaction. Transaction 1 is a point-of-sale purchase of items 1, 2, and 5. Transaction 2 is a joint
purchase of items 2 and 4, etc.
Suppose that we want association rules between items for this database that have a support
count of at least 2 (equivalent to a percentage support of 2/9=22%). By enumeration we can see
that only the following item sets have a count of at least 2:
{1} with support count of 6;
{2} with support count of 7;
{3} with support count of 6;
{4} with support count of 2;
{5} with support count of 2;
{1, 2} with support count of 4;

10.4 The Apriori Algorithm

117

{1, 3} with support count of 4;


{1, 5} with support count of 2;
{2, 3} with support count of 4;
{2, 4} with support count of 2;
{2, 5} with support count of 2;
{1, 2, 3} with support count of 2;
{1, 2, 5} with support count of 2.
Notice that once we have created a list of all item sets that have the required support, we can
deduce the rules that meet the desired condence ratio by examining all subsets of each item set in
the list. Since any subset of a set must occur at least as frequently as the set, each subset will also
be in the list. It is then straightforward to compute the condence as the ratio of the support for
the item set to the support for each subset of the item set. We retain the corresponding association
rule only if it exceeds the desired cut-o value for condence. For example, from the item set
{1,2,5} we get the following association rules:
{1, 2} {5} with condence = support count of {1, 2, 5} divided by support count of {1, 2} =
2/4 = 50%;
{1, 5} {2} with condence = support count of {1, 2, 5} divided by support count of {1, 5}
= 2/2 = 100%;
{2, 5} {1} with condence = support count of {1, 2, 5} divided by support count of {2, 5}
= 2/2 = 100%;
{1} {2, 5} with condence = support count of {1, 2, 5} divided by support count of {1} =
2/6 = 33%;
{2} {1, 5} with condence = support count of {1, 2, 5} divided by support count of {2} =
2/7 = 29%;
{5} {1, 2} with condence = support count of {1, 2, 5} divided by support count of {5} =
2/2 = 100%.
If the desired condence cut-o was 70%, we would report only the second, third, and last rules.
We can see from the above that the problem of generating all association rules that meet
stipulated support and condence requirements can be decomposed into two stages. First we nd
all item sets with the requisite support (these are called frequent or large item sets); and then we
generate, from each item set so identied, association rules that meet the condence requirement.
For most association analysis data, the computational challenge is the rst stage.

10.4

The Apriori Algorithm

Although several algorithms have been proposed for generating association rules, the classic algorithm is the Apriori algorithm of Agrawal and Srikant (1993). The key idea of the algorithm is to
begin by generating frequent item sets with just one item (1-item sets) and to recursively generate
frequent item sets with 2 items, then frequent 3-item sets and so on until we have generated frequent item sets of all sizes. Without loss of generality we will denote items by unique, consecutive
(positive) integers and that the items in each item set are in increasing order of this item number.

118

10. Anity Analysis - Association Rules

The example above illustrates this notation. When we refer to an item in a computation we actually
mean this item number.
It is easy to generate frequent 1-item sets. All we need to do is to count, for each item, how
many transactions in the database include the item. These transaction counts are the supports for
the 1-item sets. We drop 1-item sets that have support below the desired cut-o value to create a
list of the frequent 1-item sets.
The general procedure to obtain k-item sets from (k 1)-item sets for k = 2, 3, , is as follows.
Create a candidate list of k-item sets by performing a join operation on pairs of (k 1)-item sets
in the list. A pair is combined only if the rst (k 2) items are the same in both item sets. (When
k = 2, this simply means that all possible pairs are to be combined.) If this condition is met the
join of pair is a k-item set that contains the common rst (k 2) items and the two items that
are not in common, one from each member of the pair. All frequent k-item sets must be in this
candidate list since every subset of size (k 1) of a frequent k-item set must be a frequent (k 1)
item set. However, some k-item sets in the candidate list may not be frequent k-item sets. We
need to delete these to create the list of frequent k-item sets. To identify the k-item sets that are
not frequent we examine all subsets of size (k 1) of each candidate k-item set. Notice that we
need examine only (k 1)-item sets that contain the last two items of the candidate k-item set
(Why?). If any one of these subsets of size (k 1) is not present in the frequent (k 1) item set
list, we know that the candidate k-item set cannot be a frequent item set. We delete such k-item
sets from the candidate list. Proceeding in this manner with every item set in the candidate list
we are assured that at the end of our scan the k-item set candidate list will have been pruned to
become the list of frequent k-item sets. We repeat the procedure recursively by incrementing k.
We stop only when the candidate list is empty.
A critical aspect for eciency in this algorithm is the data structure of the candidate and
frequent item set lists. Hash trees were used in the original version but there have been several
proposals to improve on this structure.
There are also other algorithms that can be faster than the Apriori algorithm in practice.
What about condence in the non-technical sense? How sure can we be that the rules we
develop are meaningful? Considering the matter from a statistical perspective, we can ask Are we
nding associations that are really just chance occurrences?

10.5

Example 2 - Randomly-generated Data

Let us examine the output from an application of this algorithm to a small randomly generated
database of 50 records shown in Example 2.

10.5 Example 2 - Randomly-generated Data


Tr#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

119
Items
8
3
8
3
9
1
6
3
8
8
1
1
5
6
3
1
6
8
8
9
2
4
4
8
6
1
5
4
9
8
1
3
7
7
3
1

9
8
9
5

7
4
7
7
7
4
7

9
5
9
8
9
9
8

5
6
9
9
8
6
8
8

6
9

5
6
9
8
4
4

8
9

8
9

9
6
8

120

10. Anity Analysis - Association Rules


Tr#
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Association Rules Output


Input
Data:
$A$5:$E$54
Min.
Support:
2
Min.
Conf. %:
70
Rule #
1
2
3
4
5
6
7
8
9

Confidence
%
80
100
100
100
100
100
100
100
100

Antecedent
(a)
2
5, 7
6, 7
1, 5
2, 7
3, 8
3, 4
3, 7
4, 5

4%

Consequent
(c)
9
9
8
8
9
4
8
9
9

4
8
4
2
2
1
5
1
8
2
4
9
9
6

Items
7 8
9
5 7 9
8 9
5 9
2 7 9
8
7 8
7
6

9
9

Support
(a)
5
3
3
2
2
2
2
2
2

Support
(c)
27
27
29
29
27
11
29
27
27

Support
(a c)
4
3
3
2
2
2
2
2
2

Confidence
If pr(ca) = pr(c)
%
54
54
58
58
54
22
58
547
54

Lift
Ratio
(conf/prev.col. )
1.5
1.9
1.7
1.7
1.9
4.5
1.7
1.9
1.9

A high value of condence suggests a strong association rule. However, this can be deceptive
because if the antecedent and/or the consequent have a high support, we can have a high value
for condence even when they are independent! (If nearly all customers buy beer and nearly
all customers buy ice cream, the condence level will be high regardless of whether there is an
association between the items.)
A better measure to judge the strength of an association rule is to compare the condence of
the rule with the benchmark value where we assume that the occurrence of the consequent item set
in a transaction is independent of the occurance of the antecedent for each rule. We can compute
this benchmark from the frequency counts of the frequent item sets. The benchmark condence
value for a rule is the support for the consequent divided by the number of transactions in the
database. This enables us to compute the lift ratio of a rule. The lift ratio is the condence of
the rule divided by the condence assuming independence of consequent from antecedent. A lift
ratio greater than 1.0 suggests that there is some usefulness to the rule. The larger the lift ratio,
the greater is the strength of the association. (What does a ratio less than 1.00 mean? Can it be
useful to know such rules?)
In our example the lift ratios highlight Rule 6 as most interesting in that it suggests purchase
of item 4 is almost 5 times as likely when items 3 and 8 are purchased than if item 4 was not

10.6 Shortcomings

121

associated with the item set {3,8}.

10.6

Shortcomings

Association rules have not been as useful in practice as one would have hoped. One major shortcoming is that the support condence framework often generates too many rules. Another is that
often most of them are obvious. Insights such as the celebrated on Friday evenings diapers and
beers are bought together story are not as common as might be expected. There is need for skill
in association analysis and it seems likely, as some researchers have argued, that a more rigorous
statistical discipline to cope with rule proliferation would be benecial.

122

11. Data Reduction and Exploration

Chapter 11

Data Reduction and Exploration


11.1

Dimensionality Reduction - Principal Components Analysis

In data mining one often encounters situations where there are a large number of variables in the
database. In such situations it is very likely that subsets of variables are highly correlated with
each other. Including highly correlated variables, or variables that are unrelated to the outcome of
interest, in a classication or prediction model can lead to overtting, and accuracy and reliability
can suer. Large numbers of variables also pose computational problems for some models (aside
from questions of questions of correlation.) In model deployment, superuous variables can increase
costs due to collection and processing of these variables.
The dimensionality of a model is the number of independent or input variables used by the
model. One of the key steps in data mining, therefore, is nding ways to reduce dimensionality
without sacricing accuracy.
A useful procedure for this purpose is to analyze the principal components (illustrated below) of the input variables. It is especially valuable when we have subsets of measurements that
are measured on the same scale and are highly correlated. In that case it provides a few (often
as few as three) variables that are weighted combinations of the original variables that retain the
explanatory power of the full original set.

11.2

Example 1 - Head Measurements of First Adult Sons

The data below give 25 pairs of head measurements for rst adult sons in a sample (Elston and
Grizzle, 1962).
For this data the meansof the variables x1 and x2 are 185.7 and 151.7 and the covariance
95.29 52.87
.
matrix, S =
52.87 54.36

123

124

11. Data Reduction and Exploration


First Adult Son
Head Length (x1 ) Head Breadth (x2 )
191
155
195
149
181
148
183
153
176
144
208
157
189
150
197
159
188
152
192
150
179
158
183
147
174
150
190
159
188
151
163
137
195
155
186
153
181
145
175
140
192
154
174
143
176
139
197
167
190
163

11.3

The Principal Components

Figure 1 below shows the scatter plot of points (x1 , x2 ). The principal component directions are
shown by the axes z1 and z2 that are centered at the means of x1 and x2 . The line z1 is the direction
of the rst principal component of the data. It is the line that captures the most variation in the
data if we decide to reduce the dimensionality of the data from two to one. Amongst all possible
lines, it is the line for which, if we project the points in the data set orthogonally to get a set of 25
(one dimensional) values using the z1 co-ordinate, the variance of the z1 values will be maximum.
It is also the line that minimizes the sum of squared perpendicular distances from the line. (Show
why this follows from Pythagoras theorem. How is this line dierent from the regression line of x2
on x1 ?) The z2 axis is perpendicular to the z1 axis.
The directions of the axes is given by the eigenvectors of S. For our example the eigenvalues
are 131.5 and 18.14. The eigenvector corresponding to the larger eigenvalue is (0.825,0.565) and

11.3 The Principal Components

125

gives us the direction of the z1 axis. The eigenvector corresponding to the smaller eigenvalue is (0.565,.825) and this is the direction of the z2 axis.
The lengths of the major and minor axes of the ellipse that would enclose about 65% of the points
(provided the points had a bivariate Normal distribution) are the square roots of the eigenvalues.
This corresponds to the rule that about 65% of the data in a univariate Normal distribution lie
within one standard deviation of the mean. Similarly, doubling the axes lengths of the ellipse will
enclose 95% of the points and tripling them would enclose 99% of the points. In our example, the

length of the major axis is 131.5 = 11.47 and the lenght of the minor is 18.14 = 4.26. In Figure
1 the inner ellipse has these axes lengths while the outer ellipse has axes with twice these lengths.
Figure 1

The values of z1 and z2 for the observations are known as the principal component scores and
are shown below. The scores are computed as the inner products of the data points and the rst
and second eigenvectors (in order of decreasing eigenvalue).

126

11. Data Reduction and Exploration


Principal Component Scores
Observation No.
z1
z2
1
6.549
0.216
2
6.457
6.994
3
5.657
0.094
4
1.181
3.088
5
12.043 0.379
6
21.703
7.743
7
2.073
2.778
8
13.759
0.125
9
2.378
0.563
10
4.547
4.474
11
1.655
9.474
12
4.573 41.861
13
10.301
5.701
14
7.985
4.081
15
1.813
1.388
16
26.724
1.194
17
9.848
2.045
18
1.294
1.393
19
7.353
2.381
20
15.129 3.114
21
6.808
1.174
22
14.258 0.074
23
14.869 4.504
24
18.281
6.724
25
10.246
7.381

The means of z1 and z2 are zero. This follows from our choice of the origin for the (z1 , z2 )
coordinate system to be the means of x1 and x2 . The variances are more interesting. The variances
of z1 and z2 are 131.5 and 18.14 respectively. The rst principal component, z1 , accounts for 88%
of the total variance. Since it captures most of the variability in the data, it seems reasonable to
use one variable, the rst principal score, to represent the two variables in the original data.

11.4

Example 2 - Characteristics of Wine

The data in Table 2 gives measurements on 13 characteristics of 60 dierent wines from a region.
Let us see how principal component analysis would enable us to reduce the number of dimensions
in the data.

11.4 Example 2 - Characteristics of Wine

127
Table 2

Ash
Alc
Ash

Magnsium
alinity

Total

Alcohol

Malic
Acid

Flava
Phenols

14.23
13.2
13.16
14.37
13.24
14.83
13.86
14.1
14.12
13.75
14.75
14.38
13.63
14.3
13.83
14.19
12.37
12.33
12.64
13.67
12.37
12.17
12.37
13.11
12.37
13.34
12.21
12.29
13.86
13.49
12.99
11.96
11.66
13.03
11.84
12.33
12.86
12.88
12.81
12.7
12.51
12.6
12.25
12.53
13.49
12.84
12.93
13.36
13.52
13.62
12.25
13.16
13.88
12.87
13.32
13.08

1.71
1.78
2.36
1.95
2.59
1.64
1.35
2.16
1.48
1.73
1.73
1.87
1.81
1.92
1.57
1.59
0.94
1.1
1.36
1.25
1.13
1.45
1.21
1.01
1.17
0.94
1.19
1.61
1.51
1.66
1.67
1.09
1.88
0.9
2.89
0.99
1.35
2.99
2.31
3.55
1.24
2.46
4.72
5.51
3.59
2.96
2.81
2.56
3.17
4.95
3.88
3.57
5.04
4.61
3.24
3.9

2.43
2.14
2.67
2.5
2.87
2.17
2.27
2.3
2.32
2.41
2.39
2.38
2.7
2.72
2.62
2.48
1.36
2.28
2.02
1.92
2.16
2.53
2.56
1.7
1.92
2.36
1.75
2.21
2.67
2.24
2.6
2.3
1.92
1.71
2.23
1.95
2.32
2.4
2.4
2.36
2.25
2.2
2.54
2.64
2.19
2.61
2.7
2.35
2.72
2.35
2.2
2.15
2.23
2.48
2.38
2.36

15.6
11.2
18.6
16.8
21
14
16
18
16.8
16
11.4
12
17.2
20
20
16.5
10.6
16
16.8
18
19
19
18.1
15
19.6
17
16.8
20.4
25
24
30
21
16
16
18
14.8
18
20
24
21.5
17.5
18.5
21
25
19.5
24
21
20
23.5
20
18.5
21
20
21.5
21.5
21.5

127
100
101
113
118
97
98
105
95
89
91
102
112
120
115
108
88
101
100
94
87
104
98
78
78
110
151
103
86
87
139
101
97
86
112
136
122
104
98
106
85
94
89
96
88
101
96
89
97
92
112
102
80
86
92
113

2.8
2.65
2.8
3.85
2.8
2.8
2.98
2.95
2.2
2.6
3.1
3.3
2.85
2.8
2.95
3.3
1.98
2.05
2.02
2.1
3.5
1.89
2.42
2.98
2.11
2.53
1.85
1.1
2.95
1.88
3.3
3.38
1.61
1.95
1.72
1.9
1.51
1.3
1.15
1.7
2
1.62
1.38
1.79
1.62
2.32
1.54
1.4
1.55
2
1.38
1.5
0.98
1.7
1.93
1.41

ava
noids
Phenols
3.06
2.76
3.24
3.49
2.69
2.98
3.15
3.32
2.43
2.76
3.69
3.64
2.91
3.14
3.4
3.93
0.57
1.09
1.41
1.79
3.1
1.75
2.65
3.18
2
1.3
1.28
1.02
2.86
1.84
2.89
2.14
1.57
2.03
1.32
1.85
1.25
1.22
1.09
1.2
0.58
0.66
0.47
0.6
0.48
0.6
0.5
0.5
0.52
0.8
0.78
0.55
0.34
0.65
0.76
1.39

Non
Proan
noid
nins
0.28
0.26
0.3
0.24
0.39
0.29
0.22
0.22
0.26
0.29
0.43
0.29
0.3
0.33
0.4
0.32
0.28
0.63
0.53
0.32
0.19
0.45
0.37
0.26
0.27
0.55
0.14
0.37
0.21
0.27
0.21
0.13
0.34
0.24
0.43
0.35
0.21
0.24
0.27
0.17
0.6
0.63
0.53
0.63
0.58
0.53
0.53
0.37
0.5
0.47
0.29
0.43
0.4
0.47
0.45
0.34

Inten
thocya

Color
Hue
sity

0D315
Hue

2.29
1.28
2.81
2.18
1.82
1.98
1.85
2.38
1.57
1.81
2.81
2.96
1.46
1.97
1.72
1.86
0.42
0.41
0.62
0.73
1.87
1.03
2.08
2.28
1.04
0.42
2.5
1.46
1.87
1.03
1.96
1.65
1.15
1.46
0.95
2.76
0.94
0.83
0.83
0.84
1.25
0.94
0.8
1.1
0.88
0.81
0.75
0.64
0.55
1.02
1.14
1.3
0.68
0.86
1.25
1.14

5.64
4.38
5.68
7.8
4.32
5.2
7.22
5.75
5
5.6
5.4
7.5
7.3
6.2
6.6
8.7
1.95
3.27
5.75
3.8
4.45
2.95
4.6
5.3
4.68
3.17
2.85
3.05
3.38
3.74
3.35
3.21
3.8
4.6
2.65
3.4
4.1
5.4
5.7
5
5.45
7.1
3.85
5
5.7
4.92
4.6
5.6
4.35
4.4
8.21
4
4.9
7.65
8.42
9.4

1.04
1.05
1.03
0.86
1.04
1.08
1.01
1.25
1.17
1.15
1.25
1.2
1.28
1.07
1.13
1.23
1.05
1.25
0.98
1.23
1.22
1.45
1.19
1.12
1.12
1.02
1.28
0.91
1.36
0.98
1.31
0.99
1.23
1.19
0.96
1.06
0.76
0.74
0.66
0.78
0.75
0.73
0.75
0.82
0.81
0.89
0.77
0.7
0.89
0.91
0.65
0.6
0.58
0.54
0.55
0.57

OD280/
Proline

3.92
3.4
3.17
3.45
2.93
2.85
3.55
3.17
2.82
2.9
2.73
3
2.88
2.65
2.57
2.82
1.82
1.67
1.59
2.46
2.87
2.23
2.3
3.18
3.48
1.93
3.07
1.82
3.16
2.78
3.5
3.13
2.14
2.48
2.52
2.31
1.29
1.42
1.36
1.29
1.51
1.58
1.27
1.69
1.82
2.15
2.31
2.47
2.06
2.05
2
1.68
1.33
1.86
1.62
1.33

1065
1050
1185
1480
735
1045
1045
1510
1280
1320
1150
1547
1310
1280
1130
1680
520
680
450
630
420
355
678
502
510
750
718
870
410
472
985
886
428
392
500
750
630
530
560
600
650
695
720
515
580
590
600
780
520
550
855
830
415
625
650
550

The output from running a principal components analysis on these data is shown in Figure 2
below. The rows of Figure 2 are in the same order as the columns of Table 2. For example, row 1
for each principal component gives the weight for alcohol and row 13 gives the weight for proline.

128

11. Data Reduction and Exploration


Figure 2: Principal components for the Wine data

Principal Components
1
2
3
0.247
0.343
-0.245
-0.255
0.402
0.020
0.045
0.488
0.417
-0.187
0.213
0.588
0.138
0.023
0.485
0.376
0.053
0.045
0.409
0.044
0.003
-0.243
0.168
-0.112
0.349
0.044
0.043
0.076
0.477
-0.338
0.295
-0.306
0.172
0.364
-0.045
0.111
0.323
0.281
-0.111
5.444
2.327
1.370
41.876%
17.900%
10.540%
41.876%
59.776%
70.316%

4
0.166
-0.065
0.291
0.156
-0.620
0.241
0.091
0.402
-0.184
-0.242
0.335
0.133
-0.157
0.972
7.477%
77.793%

5
0.044
0.144
-0.155
0.392
-0.440
0.022
0.085
-0.669
0.040
0.122
-0.181
0.198
-0.250
0.808
6.218%
84.011%

6
-0.624
-0.182
0.014
0.104
-0.102
0.259
0.177
0.207
0.263
0.529
0.037
-0.156
-0.200
0.491
3.775%
87.785%

7
0.122
-0.598
0.099
0.147
0.098
0.034
0.025
-0.182
-0.660
0.288
-0.011
-0.159
0.090
0.427
3.287%
91.072%

8
-0.394
0.152
-0.064
-0.100
0.002
0.270
-0.191
0.085
-0.374
0.002
-0.249
0.647
0.264
0.287
2.207%
93.279%

9
0.400
-0.087
-0.438
0.279
0.301
0.135
-0.117
0.363
0.053
0.197
-0.052
0.304
-0.415
0.249
1.917%
95.196%

10
0.037
-0.522
0.118
0.107
-0.153
0.051
-0.147
0.106
0.364
-0.294
-0.638
0.009
0.125
0.226
1.742%
96.939%

11
-0.064
-0.222
0.128
0.070
-0.122
-0.745
-0.110
0.086
0.154
0.250
0.221
0.436
0.102
0.185
1.420%
98.358%

12
-0.041
0.037
-0.454
0.493
-0.127
0.048
-0.142
0.106
0.068
-0.034
0.225
-0.209
0.631
0.166
1.281%
99.639%

13
-0.077
0.061
-0.189
0.124
0.040
-0.275
0.824
0.217
-0.178
-0.167
-0.267
0.042
0.056
0.047
0.361%
100.000%

Notice that the rst ve components account for more than 80% of the total variation associated
with all 13 of the original variables. This suggests that we can capture most of the variability in
the data with less than half the number of original dimensions in the data. A further advantage of
the principal components compared to the original data is that they are uncorrelated (correlation
coecient = 0). If we construct regression models using these principal components as independent
variables we will not encounter problems of multicollinearity.

11.5

Normalizing the Data

The principal components shown in Figure 2 were computed after replacing each original variable
by a standardized version of the variable that has unit variance. This is easily accomplished by
dividing each variable by its standard deviation. (Sometimes the mean is subtracted as well.) The
eect of this normalization (standardization) is to give all variables equal importance in terms of
the variability.
When should we normalize the data like this? This depends on the nature of the data. When
the units of measurement are common for the variables (e.g. dollars), and when their scale reects
their importance (sales of jet fuel, sales of heating oil), it is probably best not to normalize (rescale
the data so that it has unit variance). If the variables are measured in quite diering units so that
it is unclear how to compare the variability of dierent variables (e.g. dollars for some, parts per
million for others), or if, for variables measured in the same units, scale does not reect importance
(earnings per share, gross revenues) it is generally advisable to normalize. In this way, that changes
in units of measurement do not change the principal component weights. In the rare situations
where we can give relative weights to variables we would multiply the normalized variables by these
weights before doing the principal components analysis.

11.6 Principal Components and Orthogonal Least Squares

129

Example 2 (continued)
Normalizing variables in the wine data is important due to the heterogenous nature of the
variables. The rst ve principal components computed on the raw (non-normalized) data are
shown in Table 3. Notice that the rst principal component is comprised mainly of the variable
Proline, and this component explains almost all the variance in the data. This is because Prolines
standard deviation is 351 compared to the next largest standard deviation of 15 for the variable
Magnesium. The second principal component is Magnesium. The standard deviations of all the
other variables are about 1% (or less) that of Proline.
Table 3: Principal Components of non-normalized Wine data

Alcohol
MalicAcid
Ash
Ash Alcalinity
Magnesium
Total Phenols
Flavanoids
Nonavanoid Phenols
Proanthocyanins
Color Intensity
Hue
OD280/OD315
Proline
Variance
% Variance
Cumulative %

1
0.001
-0.001
0.000
-0.004
0.014
0.001
0.002
0.000
0.001
0.002
0.000
0.001
1.000
123594.453
99.830%
99.830%

Principal Components
2
3
4
0.013
0.014
-0.030
0.009
0.167
-0.427
-0.002
0.054
-0.009
-0.045
0.976
0.176
-0.998
-0.040
-0.031
0.002
-0.015
0.164
0.000
-0.049
0.214
0.002
0.004
-0.025
-0.007
-0.031
0.082
0.022
0.097
-0.804
-0.002
-0.021
0.096
-0.002
-0.022
0.220
0.014
0.004
0.001
194.345
11.424
2.388
0.157%
0.009%
0.002%
99.987% 99.996% 99.998%

Std. Dev.
5
0.129
-0.402
0.006
0.060
0.006
0.316
0.545
-0.040
0.244
0.536
0.064
0.261
-0.004
1.391
0.001%
99.999%

0.8
1.2
0.3
3.6
14.7
0.7
1.1
0.1
0.7
1.6
0.2
0.7
351.5

The principal components analysis without normalization is trivial for this data set, The rst
four components are the four variables with the largest variances in the data and account for almost
100% of the total variance in the data.

11.6

Principal Components and Orthogonal Least Squares

The weights computed by principal components analysis have an interesting alternate interpretation. Suppose we want to t a linear surface (a straight line for 2-dimensions and a plane for
3-dimensions) to the data points where the objective is to minimize the sum of squared errors
measured by the squared orthogonal distances (squared lengths of perpendiculars) from the points
to the tted linear surface. The weights of the rst principal component would dene the best

130

12. Cluster Analysis

linear surface that minimizes this sum. The variance of the rst principal component, expressed as
a percentage of the total variation in the data, would be the portion of the variability explained by
the t in a manner analogous to R2 in multiple linear regression. This property can be exploited
to nd nonlinear structure in high dimensional data by considering perpendicular projections on
non-linear surfaces (Hastie and Stuetzle, 1989).

Chapter 12

Cluster Analysis
12.1

What is Cluster Analysis?

Cluster analysis is concerned with forming groups of similar objects based on several measurements
of dierent kinds made on the objects. The key idea is classify these clusters in ways that would be
useful for the aims of the analysis. This idea has been applied in many areas, including astronomy,
archaeology, medicine, chemistry, education, psychology, linguistics and sociology. For example,
biologists have made extensive use of classes and sub-classes to organize species. A spectacular
success of the clustering idea in chemistry was Mendelevs periodic table of the elements. In
marketing and political forecasting, clustering of neighborhoods using US postal zip codes has been
used successfully to group neighborhoods by lifestyles. Claritas, a company that pioneered this
approach, grouped neighborhoods into 40 clusters using various measures of consumer expenditure
and demographics. Examining the clusters enabled Claritas to come up with evocative names,
such as Bohemian Mix, Furs and Station Wagons and Money and Brains, for the groups
that captured the dominant lifestyles in the neighborhoods. Knowledge of lifestyles can be used
to estimate the potential demand for products such as sports utility vehicles and services such as
pleasure cruises.
The objective of this chapter is to help you to understand the key ideas underlying the most
commonly used techniques for cluster analysis and to appreciate their strengths and weaknesses.
We cannot aspire to be comprehensive as there are literally hundreds of methods (there is even a
journal dedicated to clustering ideas: The Journal of Classication!).
Typically, the basic data used to form clusters is a table of measurements on several variables
where each column represents a variable and a row represents an object (case). Our goal is to form
groups of cases so that similar cases are in the same group. The number of groups may be specied
or determined from the data.

12.2

Example 1 - Public Utilities Data

Table 1.1 below gives corporate data on 22 US public utilities.


We are interested in forming groups of similar utilities. The objects to be clustered are the
131

132

12. Cluster Analysis

utilities. There are 8 measurements on each utility described in Table 1.2. An example where
clustering would be useful is a study to predict the cost impact of deregulation. To do the requisite
analysis economists would need to build a detailed cost model of the various utilities. It would
save a considerable amount of time and eort if we could cluster similar types of utilities and build
detailed cost models for just one typical utility in each cluster and then scale up from these
models to estimate results for all utilities. The objects to be clustered are the utilities and there
are 8 measurements on each utility.
Before we can use any technique for clustering we need to dene a measure for distances
between utilities so that similar utilities are a short distance apart and dissimilar ones are far from
each other. A popular distance measure based on variables that take on continuous values is to
normalize the values by dividing by the standard deviation and subtracting the mean (sometimes
other measures such as range are used) and then computing the distance between objects using the
Euclidean metric.
The Euclidean distance dij between two cases, i and j with normalized variable values
(xi1 , xi2 , , xip ) and (xj1 , xj2 , , xjp ) is dened by:


dij =

(xi1 xj1 )2 + (xi2 xj2 )2 + + (xip xjp )2 .

All our variables are continuous in this example, so we compute distances using this metric. The
result of the calculations is given in Table 1.2 below.
If we felt that some variables should be given more importance than others we would modify
the squared dierence terms by multiplying them by weights (positive numbers adding up to one)
and use larger weights for the important variables. The weighted Euclidean distance measure is
given by:

dij = w1 (xi1 xj1 )2 + w2 (xi2 xj2 )2 + + wp (xip xjp )2
where w1 , w2 , , wp are the weights for variables 1, 2, , p so that wi 0,

p

i=1

wi = 1.

12.2 Example 1 - Public Utilities Data

133

Table 1 : Public Utilities Data


No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Company
Arizona Public Service
Boston Edison Company
Central Louisiana Electric Co.
Commonwealth Edison Co.
Consolidated Edison Co. (NY)
Florida Power and Light
Hawaiian Electric Co.
Idaho Power Co.
Kentucky Utilities Co.
Madison Gas & Electric Co.
Nevada Power Co.
New England Electric Co.
Northern States Power Co.
Oklahoma Gas and Electric Co.
Pacic Gas & Electric Co.
Puget Sound Power & Light Co.
San Diego Gas & Electric Co.
The Southern Co.
Texas Utilities Co.
Wisconsin Electric Power Co.
United Illuminating Co.
Virginia Electric & Power Co.

X1
1.06
0.89
1.43
1.02
1.49
1.32
1.22
1.1
1.34
1.12
0.75
1.13
1.15
1.09
0.96
1.16
0.76
1.05
1.16
1.2
1.04
1.07

X2
9.2
10.3
15.4
11.2
8.8
13.5
12.2
9.2
13
12.4
7.5
10.9
12.7
12
7.6
9.9
6.4
12.6
11.7
11.8
8.6
9.3

X3
151
202
113
168
1.92
111
175
245
168
197
173
178
199
96
164
252
136
150
104
148
204
1784

X4
54.4
57.9
53
56
51.2
60
67.6
57
60.4
53
51.5
62
53.7
49.8
62.2
56
61.9
56.7
54
59.9
61
54.3

X5
1.6
2.2
3.4
0.3
1
-2.2
2.2
3.3
7.2
2.7
6.5
3.7
6.4
1.4
-0.1
9.2
9
2.7
-2.1
3.5
3.5
5.9

X6
9077
5088
9212
6423
3300
11127
7642
13082
8406
6455
17441
6154
7179
9673
6468
15991
5714
10140
13507
7297
6650
10093

Table 2 : Explanation of Variables


X1:
X2:
X3:
X4:
X5:
X6:
X7:
X8:

Fixed-charge covering ratio (income/debt)


Rate of return on capital
Cost per KW capacity in place
Annual Load Factor
Peak KWH demand growth from 1974 to 1975
Sales (KWH use per year)
Percent Nuclear
Total fuel costs (cents per KWH)

X7
0
25.3
0
34.3
15.6
22.5
0
0
0
39.2
0
0
50.2
0
0.9
0
8.3
0
0
41.1
0
26.6

X8
0.628
1.555
1.058
0.7
2.044
1.241
1.652
0.309
0.862
0.623
0.768
1.897
0.527
0.588
1.4
0.62
1.92
1.108
0.636
0.702
2.116
1.306

134

12. Cluster Analysis


Table 3 : Distances based on standardized variable values

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

1
0.0
3.1
3.7
2.5
4.1
3.6
3.9
2.7
3.3
3.1
3.5
3.2
4.0
2.1
2.6
4.0
4.4
1.9
2.4
3.2
3.5
2.5

2
3.1
0.0
4.9
2.2
3.9
4.2
3.4
3.9
4.0
2.7
4.8
2.4
3.4
4.3
2.5
4.8
3.6
2.9
4.6
3.0
2.3
2.4

3
3.7
4.9
0.0
4.1
4.5
3.0
4.2
5.0
2.8
3.9
5.9
4.0
4.4
2.7
5.2
5.3
6.4
2.7
3.2
3.7
5.1
4.1

4
2.5
2.2
4.1
0.0
4.1
3.2
4.0
3.7
3.8
1.5
4.9
3.5
2.6
3.2
3.2
5.0
4.9
2.7
3.5
1.8
3.9
2.6

5
4.1
3.9
4.5
4.1
0.0
4.6
4.6
5.2
4.5
4.0
6.5
3.6
4.8
4.8
4.3
5.8
5.6
4.3
5.1
4.4
3.6
3.8

6
3.6
4.2
3.0
3.2
4.6
0.0
3.4
4.9
3.7
3.8
6.0
3.7
4.6
3.5
4.1
5.8
6.1
2.9
2.6
2.9
4.6
4.0

7
3.9
3.4
4.2
4.0
4.6
3.4
0.0
4.4
2.8
4.5
6.0
1.7
5.0
4.9
2.9
5.0
4.6
2.9
4.5
3.5
2.7
4.0

8
2.7
3.9
5.0
3.7
5.2
4.9
4.4
0.0
3.6
3.7
3.5
4.1
4.1
4.3
3.8
2.2
5.4
3.2
4.1
4.1
4.0
3.2

9
3.3
4.0
2.8
3.8
4.5
3.7
2.8
3.6
0.0
3.6
5.2
2.7
3.7
3.8
4.1
3.6
4.9
2.4
4.1
2.9
3.7
3.2

10
3.1
2.7
3.9
1.5
4.0
3.8
4.5
3.7
3.6
0.0
5.1
3.9
1.4
3.6
4.3
4.5
5.5
3.1
4.1
2.1
4.4
2.6

11
3.5
4.8
5.9
4.9
6.5
6.0
6.0
3.5
5.2
5.1
0.0
5.2
5.3
4.3
4.7
3.4
4.8
3.9
4.5
5.4
4.9
3.4

12
3.2
2.4
4.0
3.5
3.6
3.7
1.7
4.1
2.7
3.9
5.2
0.0
4.5
4.3
2.3
4.6
3.5
2.5
4.4
3.4
1.4
3.0

13
4.0
3.4
4.4
2.6
4.8
4.6
5.0
4.1
3.7
1.4
5.3
4.5
0.0
4.4
5.1
4.4
5.6
3.8
5.0
2.2
4.9
2.7

14
2.1
4.3
2.7
3.2
4.8
3.5
4.9
4.3
3.8
3.6
4.3
4.3
4.4
0.0
4.2
5.2
5.6
2.3
1.9
3.7
4.9
3.5

15
2.6
2.5
5.2
3.2
4.3
4.1
2.9
3.8
4.1
4.3
4.7
2.3
5.1
4.2
0.0
5.2
3.4
3.0
4.0
3.8
2.1
3.4

16
4.0
4.8
5.3
5.0
5.8
5.8
5.0
2.2
3.6
4.5
3.4
4.6
4.4
5.2
5.2
0.0
5.6
4.0
5.2
4.8
4.6
3.5

17
4.4
3.6
6.4
4.9
5.6
6.1
4.6
5.4
4.9
5.5
4.8
3.5
5.6
5.6
3.4
5.6
0.0
4.4
6.1
4.9
3.1
3.6

18
1.9
2.9
2.7
2.7
4.3
2.9
2.9
3.2
2.4
3.1
3.9
2.5
3.8
2.3
3.0
4.0
4.4
0.0
2.5
2.9
3.2
2.5

19
2.4
4.6
3.2
3.5
5.1
2.6
4.5
4.1
4.1
4.1
4.5
4.4
5.0
1.9
4.0
5.2
6.1
2.5
0.0
3.9
5.0
4.0

20
3.2
3.0
3.7
1.8
4.4
2.9
3.5
4.1
2.9
2.1
5.4
3.4
2.2
3.7
3.8
4.8
4.9
2.9
3.9
0.0
4.1
2.6

21
3.5
2.3
5.1
3.9
3.6
4.6
2.7
4.0
3.7
4.4
4.9
1.4
4.9
4.9
2.1
4.6
3.1
3.2
5.0
4.1
0.0
3.0

Clustering Algorithms
A large number of techniques have been proposed for forming clusters from distance matrices.
The most important types are hierarchical techniques, optimization techniques and mixture models.
We discuss the rst two types here.

12.3

Hierarchical Methods

There are two major types of hierarchical techniques: divisive and agglomerative. Agglomerative
hierarchical techniques are the more commonly used. The idea behind this set of techniques is
to start with each cluster comprising of exactly one object and then progressively agglomerate
(combine) the two nearest clusters until there is just one cluster left consisting of all the objects.
Nearness of clusters is based on a measure of distance between clusters.
How do we measure distance between clusters? All agglomerative methods require as input
a distance measure between all the objects that are to be clustered. This measure of distance
between objects is mapped into a metric for the distance between clusters (sets of objects)
metrics for the distance between two clusters. The only dierence between the various agglomerative
techniques is the way in which this inter-cluster distance metric is dened. The most popular
agglomerative techniques are:

12.3.1

Nearest neighbor (Single linkage)

Here the distance between two clusters is dened as the distance between the nearest pair of objects
in the two clusters (one object in each cluster). If cluster A is the set of objects A1 , A2 , Am and
cluster B is B1 , B2 , Bn , the single linkage distance between A and B is M in(distance(Ai , Bj )|i =
1, 2 m; j = 1, 2 n). This method has a tendency to cluster together at an early stage objects
that are distant from each other because of a chain of intermediate objects in the same cluster.
Such clusters have elongated sausage-like shapes when visualized as objects in space.

22
2.5
2.4
4.1
2.6
3.8
4.0
4.0
3.2
3.2
2.6
3.4
3.0
2.7
3.5
3.4
3.5
3.6
2.5
4.0
2.6
3.0
0.0

12.3 Hierarchical Methods

12.3.2

135

Farthest neighbor (Complete linkage)

Here the distance between two clusters is dened as the distance between the farthest pair of
objects with one object in the pair belonging to a distinct cluster. If cluster A is the set of objects
A1 , A2 , Am and cluster B is B1 , B2 , Bn , the complete linkage distance between A and B is
M ax(distance(Ai , Bj )|i = 1, 2 m; j = 1, 2 n). This method tends to produce clusters at the
early stages that have objects that are within a narrow range of distances from each other. If we
visualize them as objects in space, the objects in such clusters would have roughly spherical shapes.

12.3.3

Group average (Average linkage)

Here the distance between two clusters is dened as the average distance between all possible pairs
of objects with one object in each pair belonging to a distinct cluster. If cluster A is the set of
objects A1 , A2 , Am and cluster B is B1 , B2 , Bn , the average linkage distance between A and
B is (1/mn) distance(Ai , Bj ) the sum being taken over i = 1, 2 m and j = 1, 2 . . . n.
Note that the results of the single linkage and the complete linkage methods depend only on
the order of the inter-object distances and so are invariant to monotonic transformations of the
inter-object distances.
The nearest neighbor clusters for the utilities are displayed in Figure 1 below in a useful graphic
format called a dendogram. For any given number of clusters we can determine the cases in the
clusters by sliding a horizontal line up and down until the number of vertical intersections of the
horizontal line equals the desired number of clusters. For example, if we wanted to form 6 clusters
we would nd that the clusters are:
{1, 2, 4, 10, 13, 20, 7, 12, 21, 15, 14, 19, 18, 9, 3}; {8, 16}; {6}; {17}; {11};

and {5}.

Notice that if we wanted 5 clusters they would be the same as for six with the exception that
the rst two clusters above would be merged into one cluster. In general all hierarchical methods
have clusters that are nested within each other as we decrease the number of clusters we desire.
This is a valuable property for interpreting clusters and is essential in certain applications, such as
taxonomy of varieties of living organisms.
The average linkage dendogram is shown in Figure 2. If we want six clusters using average
linkage, they would be:
{1, 18, 14, 19, 6, 3, 9}; {2, 22, 4, 20, 10, 13}; {7, 12, 21, 15}; {17}; {5}; {8, 16, 11}
Notice that both methods identify {5} and {17} as singleton clusters. The clusters tend to
group geographically for example there is a southern group {1, 18, 14, 19, 6, 3, 9}, a east/west
seaboard group: {7, 12, 21, 15}.

136

12. Cluster Analysis


Figure1: Dendogram - Single Linkage

12.4 Optimization and the k-means algorithm

137

Figure2: Dendogram - Average Linkage between groups

12.4

Optimization and the k-means algorithm

A non-hierarchical approach to forming good clusters is to specify a desired number of clusters,


say, k, and to assign each case to one of k clusters so as to minimize a measure of dispersion
within the clusters. A very common measure is the sum of distances or sum of squared Euclidean
distances from the mean of each cluster. The problem can be set up as an integer programming
problem (see Appendix) but because solving integer programs with a large number of variables is
time consuming, clusters are often computed using a fast, heuristic method that generally produces
good (but not necessarily optimal) solutions. The k-means algorithm is one such method.
The k-means algorithm starts with an initial partition of the cases into k clusters. Subsequent

138

12. Cluster Analysis

steps modify the partition to reduce the sum of the distances for each case from the mean of the
cluster to which the case belongs. The modication consists of allocating each case to the nearest of the k means of the previous partition. This leads to a new partition for which the sum of
distances is strictly smaller than before. The means of the new clusters are computed and the
improvement step is repeated until the improvement is very small. The method is very fast. There
is a possibility that the improvement step leads to fewer than k partitions. In this situation one of
the partitions (generally the one with the largest sum of distances from the mean) is divided into
two or more parts to reach the required number of k partitions. The algorithm can be rerun with
dierent randomly generated starting partitions to reduce the chances of the heuristic producing
a poor solution. Generally the number of clusters in the data is not known so it is a good idea
to run the algorithm with dierent values for k that are near the number of clusters one expects
from the data to see how the sum of distances reduces with increasing values of k. The results of
running the k-means algorithm for Example 1 with k = 6 are shown below. The clusters developed using dierent values of k will not be nested (unlike those developed by hierarchical methods).

Output of k-means clustering


Initial Cluster Centers

X1
X2
X3
X4
X5
X6
X7
X8

1
0.25
-0.37
2.03
-0.22
1.91
1.99
-0.71
-0.87

2
-1.92
-1.93
-0.78
1.10
1.85
-0.90
-0.22
1.47

Cluster
3
4
-0.13 0.19
0.56 0.88
-1.75 0.75
-1.61 -0.73
-0.59 1.01
0.21 -0.49
-0.71 2.27
-0.93 -1.04

5
2.04
-0.86
0.58
-1.30
-0.72
-1.58
0.21
1.69

6
1.12
1.23
-1.39
0.68
-1.74
0.62
0.63
0.25

Minimum distance is between initial centers 3 and 6 = 3.469


Iteration History

Iteration
1
2
3

1
1.533
0.950
0.000

Change in Cluster Centers


2
3
4
5
2.350 1.211 1.730 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000

6
1.676
0.946
0.000

12.4 Optimization and the k-means algorithm

139

Cluster Membership
(sorted by Case)
Case Cluster Distance
1
3
1.666
2
4
1.989
3
3
2.241
4
4
1.195
5
5
0.000
6
6
2.173
7
6
1.640
8
1
1.560
9
6
1.893
10
4
1.109
11
1
2.177
12
2
1.440
13
4
1.730
14
3
1.211
15
2
1.589
16
1
1.536
17
2
2.350
18
3
1.334
19
3
1.555
20
4
1.494
21
2
1.074
22
4
1.756

(sorted by Cluster)
Cluster Case Distance
1
8
1.560
1
11
2.177
1
16
1.536
2
12
1.440
2
15
1.589
2
17
2.350
2
21
1.074
3
1
1.666
3
3
2.241
3
14
1.211
3
18
1.334
3
19
1.555
4
2
1.989
4
4
1.195
4
10
1.109
4
13
1.730
4
20
1.494
4
22
1.756
5
5
0.000
6
6
2.173
6
7
1.640
6
9
1.893

Final Cluster Centers

X1
X2
X3
X4
X5
X6
X7
X8

1
-0.60
-0.83
1.34
-0.48
0.99
1.86
-0.71
-0.97

2
-0.77
-1.05
0.06
1.08
0.25
-0.75
-0.58
1.31

Cluster
3
4
0.24 -0.21
0.64 0.24
-1.10 0.32
-0.76 -0.26
-0.59 0.08
0.40 -0.51
-0.71 1.44
-0.54 -0.36

5
2.04
-0.86
0.58
-1.30
-0.72
-1.58
0.21
1.69

6
0.97
0.96
-0.41
1.28
-0.27
0.04
-0.27
0.27

140

12. Cluster Analysis


Distances between Final Cluster Centers

1
2
3
4
5
6

1
0.000
4.087
3.706
3.720
5.557
4.295

2
4.087
0.000
3.752
3.286
4.024
3.072

3
3.706
3.752
0.000
2.927
4.287
2.519

4
3.720
3.286
2.927
0.000
3.861
2.924

5
5.557
4.024
4.287
3.861
0.000
4.141

6
4.295
3.072
2.519
2.924
4.141
0.000

Number of Cases in each Cluster


Cluster

1
2
3
4
5
6

3
4
5
6
1
3
22
0

Valid
Missing

The ratio of the sum of squared distances for a given k to the sum of squared distances to the
mean of all the cases (k = 1) is a useful measure for the usefulness of the clustering. If the ratio is
near 1.0 the clustering has not been very eective, if it is small we have well separated groups.

12.5

Similarity Measures

Sometimes it is more natural or convenient to work with a similarity measure between cases rather
than distance which measures dissimilarity. A popular similarity measure is the square of the
2 , dened by
correlation coecient, rij
p

2

rij

m=1
p


m=1

(xim xm )(xjm xm )

(xim xm )2

p

m=1

(xjm xm )2

Such measures can always be converted to distance measures. In the above example we could dene
2.
a distance measure dij = 1 rij
In the case of binary values of x it is more intuitively appealing to use similarity measures.
Suppose we have binary values for all the xij s and for individuals i and j we have the following
2 2 table:

12.6 Other distance measures

141

Individual i

0
1

Individual j
0
1
a
b
c
d
a+c b+d

a+b
c+d
p

The most useful similarity measures in this situation are:


The matching coecient, (a + d)/p
Jaquards coecient, d/(b + c + d). This coecient ignores zero matches. This is desirable
when we do not want to consider two individuals to be similar simply because they both do
not have a large number of characteristics.
When the variables are mixed a similarity coecient suggested by Gower [2] is very useful. It
is dened as
p

wijm sijm
m=1
sij =
p

wijm
m=1

with wijm = 1 subject to the following rules:


wijm = 0 when the value of the variable is not known for one of the pair of individuals or to
binary variables to remove zero matches.
For non-binary categorical variables sijm = 0 unless the individuals are in the same category
in which case sijm = 1
For continuous variables sijm = 1 | xim xjm | /(xm : max xm : min)

12.6

Other distance measures

Two useful measures of dissimilarity other than the Euclidean distance that satisfy the triangular
inequality and so qualify as distance metrics are:
Mahalanobis distance dened by


dij =

(xi xj ) S 1 (xi xj )

where xi and xj are p-dimensional vectors of the variable values for i and j respectively; and
S is the covariance matrix for these vectors. This measure takes into account the correlation
between variables. With this measure variables that are highly correlated with other variables
do not contribute as much as variables that are uncorrelated or mildly correlated.

142

12. Cluster Analysis


Manhattan distance dened by
dij =

p


| xim xjm |

m=1

Maximum co-ordinate distance dened by


dij =

max

m=1,2,,p

| xim xjm |

Chapter 13

Cases
13.1

Charles Book Club

THE BOOK INDUSTRY


Approximately 50,000 new titles, including new editions, are published each year in the US,
giving rise to a $25 billion industry in 20011
In terms of percentage of sales, this industry may be segmented as follows:
16%
16%
21%
10%
17%
20%

Textbooks
Trade books sold in bookstores
Technical, scientic and professional books
Book clubs and other mail-order books
Mass-market paperbound books
All other books

Book retailing in the US in the 1970s was characterized by the growth of bookstore chains
located in shopping malls. The 1980s saw increased purchases in bookstores stimulated through
the widespread practice of discounting. By the 1990s, the superstore concept of book retailing
gained acceptance and contributed to double-digit growth of the book industry. Conveniently
situated near large shopping centers, superstores maintain large inventories of 30,000 to 80,000
titles, and employ well-informed sales personnel. Superstores applied intense competitive pressure
on book clubs and mail-order rms as well as traditional book retailers. (Association of American
Publishers. Industry Statistics, 2002.) In response to these pressures, book clubs sought out
alternative business models that were more responsive to their customers individual preferences.
Historically, book clubs oered their readers dierent types of membership programs. Two common membership programs are continuity and negative option programs that were extended
contractual relationships between the club and its members.
Under a continuity program, a reader would sign up by accepting an oer of several books for
1

This case was derived, with the assistance of Ms. Vinni Bhandari from The Bookbinders Club, a Case Study in
Database Marketing, prepared by Nissan Levin and Jacob Zahavi, Tel Aviv University. Permission pending.

143

144

13. Cases

just a few dollars (plus shipping and handling) and an agreement to receive a shipment of one
or two books each month thereafter at more standard pricing. The continuity program was most
common in the childrens books market, where parents are willing to delegate the rights to the book
club to make a selection, and much of the clubs prestige depends on the quality of its selections.
In a negative option program, readers get to select which and how many additional books they
would like to receive. However, the clubs selection of the month will be automatically delivered to
them unless they specically mark no by a deadline date on their order form. Negative option
programs sometimes result in customer dissatisfaction and always give rise to signicant mailing
and processing costs.
In an attempt to combat these trends, some book clubs have begun to oer books on a positive
option basis, but only to specic segments of their customer base that are likely to be receptive to
specic oers. Rather than expanding the volume and coverage of mailings, some book clubs are
beginning to use database-marketing techniques to more accurately target customers. Information
contained in their databases is used to identify who is most likely to be interested in a specic oer.
This information enables clubs to carefully design special programs tailored to meet their customer
segments varying needs.
DATABASE MARKETING AT CHARLES
The club
The Charles Book Club (CBC) was established in December of 1986, on the premise that
a book club could dierentiate itself through a deep understanding of its customer base and by
delivering uniquely tailored oerings. CBC focused on selling specialty books by direct marketing
through a variety of channels, including media advertising (TV, magazines, newspapers) and mailing. CBC is strictly a distributor and does not publish any of the books that it sells. In line with
its commitment to understanding its customer base, CBC built and maintained a detailed database
about its club members. Upon enrollment, readers were required to ll out an insert and mail it to
CBC. Through this process, CBC has created an active database of 500,000 readers. CBC acquired
most of these customers through advertising in specialty magazines.

13.1 Charles Book Club

145

The problem
CBC sent mailings to its club members each month containing its latest oering. On the surface, CBC looked like they were very successful, mailing volume was increasing, book selection
was diversifying and growing, their customer database was increasing; however, their bottom line
prots were falling. The decreasing prots led CBC to revisit their original plan of using database
marketing to improve its mailing yields and to stay protable.
A possible solution
They embraced the idea that deriving intelligence from their data would allow them to know
their customer better and enable multiple targeted campaigns where each target audience would
receive appropriate mailings.
CBCs management decided to focus its eorts on the most protable customers and prospects,
and to design targeted marketing strategies to best reach them.
The two processes they had in place were:
1. Customer acquisition:
New members would be acquired by advertising in specialty magazines, newspapers and TV.
Direct mailing and telemarketing would contact existing club members.
Every new book would be oered to the club members before general advertising.
2. Data collection:
All customer responses would be recorded and maintained in the database.
Any information not being collected that is critical would be requested from the customer.
To derive intelligence from these processes they decided to use a two-step approach for each
new title:

(a) Conduct a market test, involving a random sample of 10,000 customers from the database
to enable analysis of customer responses. The analysis would create and calibrate response
models for the current book oering.
(b) Based on the response models, compute a score for each customer in the database. Use this
score and a cut-o value to extract a target customer list for direct mail promotion.

146

13. Cases

Targeting promotions was considered to be of prime importance. There were, in addition, other
opportunities to create successful marketing campaigns based on customer behavior data such as
returns, inactivity, complaints, and compliments. CBC planned to address these opportunities at
a subsequent stage.
Art History of Florence
A new title, The Art History of Florence, is ready for release. CBC has sent a test mailing
to a random sample of 4,000 customers from its customer base. The customer responses have been
collated with past purchase data. The data has been randomly partitioned into 3 parts: Training
Data (1800 customers): initial data to be used to t response models, Validation Data (1400
customers): hold-out data used to compare the performance of dierent response models, and Test
Data (800 Customers): data only to be used after a nal model has been selected to estimate the
likely accuracy of the model when it is deployed. Each row (or case) in the spreadsheet (other than
the header) corresponds to one market test customer. Each column is a variable with the header
row giving the name of the variable. The variable names and descriptions are given in Table 1,
below.

13.1 Charles Book Club

147

Table 1: List of Variables in Charles Book Club Data Set


Variable Name
Seq#
ID#
Gender
M
R
F
FirstPurch
ChildBks
YouthBks
CookBks
DoItYBks
RefBks
ArtBks
GeoBks
ItalCook
ItalAtlas
ItalArt
Florence
Related purchase

Description
Sequence number in the partition
Identication number in the full
(unpartitioned) market test data set
O=Male 1=Female
Monetary- Total money spent on books
Recency- Months since last purchase
Frequency - Total number of purchases
Months since rst purchase
Number of purchases from the category:Child books
Number of purchases from the category:Youth books
Number of purchases from the category:Cookbooks
Number of purchases from the categoryDo It Yourself books I
Number of purchases from the category:Reference books
(Atlases, Encyclopedias,Dictionaries)
Number of purchases from the categorArt books
Number of purchases from the category:Geography books
Number of purchases of book title: Secrets of Italian Cooking
Number of purchases of book title: Historical Atlas of Italy
Number of purchases of book title: Italian Art
=1 The Art History of Florence was bought, = 0 if not
Number of related books purchased

DATA MINING TECHNIQUES


There are various data mining techniques that can be used to mine the data collected from the
market test. No one technique is universally better than another. The particular context and the
particular characteristics of the data are the major factors in determining which techniques perform
better in an application. For this assignment, we will focus on two fundamental techniques:
K-Nearest Neighbor
Logistic regression
We will compare them with each other as well as with a standard industry practice known as
RFM segmentation.

148

13. Cases

RFM Segmentation
The segmentation process in database marketing aims to partition customers in a list of
prospects into homogenous groups (segments) that are similar with respect to buying behavior.
The homogeneity criterion we need for segmentation is propensity to purchase the oering. But
since we cannot measure this attribute, we use variables that are plausible indicators of this propensity.
In the direct marketing business the most commonly used variables are the RFM variables:
R - Recency - time since last purchase
F - Frequency - the number of previous purchases from the company over a period
M - Monetary - the amount of money spent on the companys products over a period.
The assumption is that the more recent the last purchase, the more products bought from the
company in the past, and the more money spent in the past buying the companys products, the
more likely is the customer to purchase the product oered.
The 1800 observations in the training data and the 1400 observations in the validation data
have been divided into Recency, Frequency and Monetary categories as follows:
Recency:
0-2 months (Rcode=1)
3-6 months (Rcode=2)
7-12 months (Rcode=3)
13 months and up (Rcode=4)
Frequency:
1 book (Fcode=l)
2 books (Fcode=2)
3 books and up (Fcode=3)
Monetary:
$0 - $25 (Mcode=1)
$26 - $50 (Mcode=2)
$51 - $100 (Mcode=3)
$101 - $200 (Mcode=4)
$201 and up (Mcode=5)
The tables below display the 1800 customers in the training data, cross tabulated by these
categories. The buyers are summarized in the rst ve tables and the non-buyers in the next ve
tables. These tables are available for Excel computations in the RFM spreadsheet in the data le.

13.1 Charles Book Club

149
Buyers

Sum of Florence
Fcode
1
2
3
Grand Total

Mcode
1
2

Rcode
Sum of Florence
Fcode
1
2
3
Grand Total
Rcode
Sum of Florence
Fcode
1
2
3
Grand Total
Rcode
Sum of Florence
Fcode
1
2
3
Grand Total
Rcode
Sum of Florence
Fcode
1
2
3
Grand Total

1
Mcode
1
0

0
2
Mcode
1
1

1
3
Mcode
1
1

1
4
Mcode
1
0

2
2
3
1
6

3
10
5
1
16

4
7
9
15
31

5
17
17
62
96

Grand Total
38
34
79
151
151

2
0
1
1
2

3
0
0
0
0

4
2
0
0
2

5
1
1
5
7

Grand Total
3
2
6
11

2
0
0
0

3
1
3
0
4

4
1
5
4
10

5
5
5
10
20

Grand Total
8
13
14
35

2
0
1
0
1

3
1
1
0
2

4
2
2
4
8

5
5
4
31
40

Grand Total
9
8
35
52

2
2
1

3
8
1
1
10

4
2
2
7
11

5
6
7
16
29

Grand Total
18
11
24
53

150

13. Cases
All customers (buyers and non-buyers)
Count of Florence
Fcode
1
2
3
Grand Total

Mcode
1
20

Rcode
Count of Florence
Fcode
1
2
3
Grand Total
Rcode
Count of Florence
Fcode
1
2
3
Grand Total
Rcode
Count of Florence
Fcode
1
2
3
Grand Total
Rcode
Count of Florence
Fcode
1
2
3
Grand Total

1
Mcode
1
2

20

2
2
Mcode
1
3

3
3
Mcode
1
7

7
4
Mcode
1
8

2
40
32
2
74

3
93
91
33
217

4
166
180
179
525

5
219
247
498
964

Grand Total
538
550
712
1800
1800

2
2
3
1
6

3
6
4
2
12

4
10
12
11
33

5
15
16
45
76

Grand Total
35
35
59
129

2
5
2
7

3
17
17
3
37

4
28
30
34
92

5
26
31
66
123

Grand Total
79
80
103
262

2
15
12
1
28

3
24
29
17
70

4
51
55
53
159

5
86
85
165
336

Grand Total
183
181
236
600

2
18
15

3
46
41
11
98

4
77
83
81
241

5
92
115
222
429

Grand Total
241
254
314
809

33

(a) What is the response rate for the training data customers taken as a whole? What is the
response rate for each of the 4 5 3 = 60 combinations of RFM categories? Which
combinations have response rates in the training data that are above the overall response in
the training data?

13.1 Charles Book Club

151

(b) Suppose that we decide to send promotional mail only to the RFM combinations identied in
part a. Compute the response rate in the validation data using these combinations.
(c) Rework parts a. and b. with three segments: segment 1 consisting of RFM combinations that
have response rates that exceed twice the overall response rate, segment 2 consists of RFM
combinations that exceed the overall response rate but do not exceed twice that rate and
segment 3 consisting of the remaining RFM combinations. Draw the cumulative lift curve
(consisting of three points for these three segments) showing the number of customers in the
validation data set on the x axis and cumulative number of buyers in the validation data set
on the y axis.
k - Nearest Neighbor
The k-Nearest Neighbor technique can be used to create segments based on product proximity
of the oered products to similar products as well as propensity to purchase (as measured by the
RFM variables). For The Art History of Florence, a possible segmentation by product proximity
could be created using the following variables:

1. M: Monetary - Total money ($) spent on books


2. R: Recency - Months since last purchase
3. F : Frequency - Total number of past purchases
4. FirstPurch : Months since rst purchase
5. RelatedPurch: Total number of past purchases of related books, i.e. sum of purchases from
Art and Geography categories and of titles Secrets of Italian Cooking, Historical Atlas of
Italy, and Italian Art.
(d) Use the k-Nearest Neighbor option under the Classify menu choice in XLMiner to classify
cases with k = 1, k = 3 and k = 11. Use normalized data (note the checkbox normalize
input data in the dialog box) and all ve variables.
(e) Use the k-Nearest Neighbor option under the Prediction menu choice in XLMiner to compute
a cumulative gains curve for the validation data for k = 1, k = 3 and k = 11. Use normalized
data (note the checkbox normalize input data in the dialog box) and all ve variables. The
k-NN prediction algorithm gives a numerical value, which is a weighted sum of the values of
the Florence variable for the k nearest neighbors with weights that are inversely proportional
to distance.

152

13. Cases

Logistic Regression
The Logistic Regression model oers a powerful method for modeling response because it yields
well-dened purchase probabilities. (The model is especially attractive in consumer choice settings
because it can be derived from the random utility theory of consumer behavior, under the assumption that the error term in the customers utility function follows a type I extreme value
distribution.)
Use the training set data of 1800 observations to construct three logistic regression models with:
The full set of 15 predictors in the data set as dependent variables and Florence as the
independent variable,
a subset that you judge as the best,
only the R, F, and M variables.
(f) Score the customers in the validation sample and arrange them in descending order of purchase
probabilities.
(g) Create a cumulative gains chart summarizing the results from the three logistic regression
models created above along with the expected cumulative gains for a random selection of an
equal number customers from the validation data set.
(h) If the cuto criterion for a campaign is a 30% likelihood of a purchase, nd the customers in
the validation data that would be targeted and count the number of buyers in this set.

13.2

German Credit

The German Credit data set (available at ftp.ics.uci.edu/pub/machine-learning-databases/statlog/)


has 30 variables and 1000 records, each record being a prior applicant for credit. Each applicant
was rated as good credit (700 cases) or bad credit (300 cases).
New applicants for credit can also be evaluated on these 30 predictor variables and classied
as a good credit risk or a bad credit risk, based on the predictor variables. All the variables are
explained in Table 1.1. (Note : The original data set had a number of categorical variables, some
of which have been transformed into a series of binary variables so that they can be appropriately
handled by XLMiner. Several ordered categorical variables have been left as is; to be treated by
XLMiner as numerical. The data has been organized in the spreadsheet German Credit.xls)

13.2 German Credit

153
Table 1.1 Variables for the German Credit data.

Var. #
1.
2.

Variable Name
OBS#
CHK ACCT

Description
Observation No.
Checking account
status

Variable Type
Categorical
Categorical

Code Description
Sequence Number in data set
0 :< 0DM
1: 0 < 200
2 : 200 DM
3: no checking account

DM
3.

DURATION

4.

HISTORY

5.

NEW CAR

6.

USED CAR

7.

FURNITURE

8.

RADIO/TV

9.

EDUCATION

10.

RETRAINING

11.
12.

AMOUNT
SAV ACCT

Duration of
credit in months
Credit history

Numerical

Purpose of
credit
Purpose of
credit
Purpose of
credit
Purpose of
credit
Purpose of
credit
Purpose of
credit
Credit amount
Average balance
in savings
account

Binary

0: no credits taken
1: all credits at this bank
paid back duly
2: existing credits paid
back duly till now
3: delay in paying
o in the past
4: critical account
car (new) 0: No, 1: Yes

Binary

car (used) 0: No, 1: Yes

Binary

Binary

furniture/equipment
0: No, 1: Yes
radio/television
0: No, 1: Yes
education 0: No, 1: Yes

Binary

retraining 0: No, 1: Yes

Categorical

Binary

Numerical
Categorical

0 :< 100 DM
1 : 100 <= < 500 DM
2 : 500 <= < 1000 DM
3 : 1000 DM
4 : unknown/ no
savings account

154

13. Cases

13.

EMPLOYMENT

Present
employment
since

Categorical

14.

INSTALL RATE

Numerical

15.

MALE DIV

16.

MALE SINGLE

17.

MALE MAR WID

18.

CO-APPLICANT

19.

GUARANTOR

20.

PRESENT RESIDENT

Installment
rate as
% of disposable
income
Applicant is male
and divorced
Applicant is male
and single
Applicant is male
and married
or a widower
Application has
a co-applicant
Applicant has
a guarantor
Present resident
since - years

21.

REAL ESTATE

22.

PROP UNKN NONE

23.
24.

AGE
OTHER INSTALL

25.
26.

RENT
OWN RES

Applicant owns
real estate
Applicant owns no
property (or unknown)
Age in years
Applicant has
other installment
plan credit
Applicant rents
Applicant owns
residence

0 : unemployed
1: < 1 year
2 : 1 <= < 4 years
3 : 4 <= < 7 years
4 : >= 7 years

Binary

0: No, 1:Yes

Binary

0: No, 1:Yes

Binary

0: No, 1:Yes

Binary

0: No, 1:Yes

Binary

0: No, 1:Yes

Categorical

Binary

0 :<= 1 year
1 < <= 2 years
2 < <= 3 years
3 :> 4 years
0: No, 1:Yes

Binary

0: No, 1:Yes

Numerical
Binary

0: No, 1:Yes

Binary
Binary

0: No, 1:Yes
0: No, 1:Yes

13.2 German Credit


27.

NUM CREDITS

28.

JOB

29.

NUM DEPENDENTS

30.

TELEPHONE

31.
32

FOREIGN
RESPONSE

155
Number of
existing credits
at this bank
Nature of job

Number of people
for whom liable to
provide maintenance
Applicant has
phone in his
or her name
Foreign worker
Credit rating
is good

Numerical

Categorical

0 : unemployed/
unskilled
- non-resident
1 : unskilled resident
2 : skilled employee
/ ocial
3 : management/
self-employed/
highly qualied
employee/ ocer

Numerical

Binary

0: No, 1:Yes

Binary
Binary

0: No, 1:Yes
0: No, 1:Yes

156

13. Cases
Table 1.2, below, shows the values of these variables for the rst several records in the case.

Table 1.2 The data (rst several rows)

The consequences of misclassication have been assessed as follows: the costs of a false positive
(incorrectly saying an applicant is a good credit risk) outweigh the benets of a true positive
(correctly saying an applicant is a good credit risk) by a factor of ve. This can be summarized in
the following table.
Table 1.3 Opportunity Cost Table (in deutch Marks)

Actual
Good
Bad

Predicted (Decision)
Good (Accept) Bad (Reject)
0
100 DM
500 DM
0

13.2 German Credit

157

The opportunity cost table was derived from the average net prot per loan as shown below:

Table 1.4 Average Net Prot

Actual
Good
Bad

Predicted (Decision)
Good (Accept) Bad (Reject)
100 DM
0
- 500 DM
0

Let us use this table in assessing the performance of the various models because it is simpler
to explain to decision-makers who are used to thinking of their decision in terms of net prots.
Assignment
1. Review the predictor variables and guess at what their role might be in a credit decision. Are
there any surprises in the data?
2. Divide the data into training and validation partitions, and develop classication models using
the following data mining techniques in XLMiner:
Logistic regression
Classication trees
Neural networks.
3. Choose one model from each technique and report the confusion matrix and the cost/gain
matrix for the validation data. Which technique has the most net prot?
4. Lets see if we can improve our performance. Rather than accepting XLMiners initial classication of everyones credit status, lets use the predicted probability of success in logistic
regression (where success means 1) as a basis for selecting the best credit risks rst,
followed by poorer risk applicants.
a. Sort the validation on predicted probability of success.
b. For each case, calculate the net prot of extending credit.
c. Add another column for cumulative net prot.
d. How far into the validation data do you go to get maximum net prot? (Often this is
specied as a percentile or rounded to deciles.)
e. If this logistic regression model is scored to future applicants, what probability of success
cuto should be used in extending credit?

158

13. Cases, nearly done

13.3

Textile Cooperatives

Background
The traditional handloom segment of the textile industry in India employs 10 million people
and accounts for 30% of total textile production. It has a distinct, non-industrial character. Its
products, particularly saris, are created by artisans who have learned the art of making textiles
from their forefathers. These products have very individual characteristics.
Each rm in the industry is a cooperative of families, most of which have been engaged in the
production of these textiles for generations. Dierent families within the cooperatives undertake
dierent production tasks such as spinning ber, making natural colors (now not much in practice),
dying, and weaving saris or fabric for other uses.
The handloom sector is protected by state governments, which grant exemptions from various
types of taxes, and also award subsidies and developmental grants. Most state governments in
India have also set up state-run corporations, which undertake training and marketing activities to
support the handloom industry.
This family-centered cooperative industry competes with large mills that produce a more uniform product.
Saris, the primary product of the handloom sector, are colorful owing garments worn by
women and made with a single piece of cloth six yards long and one yard wide. A sari has many
characteristics; important ones include:
Body (made up of warp and weft) shade - dark, light, medium and shining, colors, and design
Border shade, color and design
Border size - broad, medium or narrow
Palav shade, color and design (the palav is the part of the sari that is most prominent, showing
at the top and front when worn, and typically is the most decorated)
Sari side - one-sided or double-sided
Fabric - cotton or polyester (or many combinations of natural and synthetic ber) Tens of
thousands of combinations are possible, and the decentralized nature of the industry means
that there is tremendous variety in what is produced.
The price of a sari varies from 500 rupees to 10,000 rupees depending upon the fabric and
intricacies of design. The average price is about 1500 rupees. Hand-made saris have a competitive
advantage stemming from the very fact that they are a hand-made, craft product (in contrast to
the more uniform garments mass produced by textile mills).
Cooperatives selling hand-made saris also face competitive disadvantages, chief among them:
(a) Greater cost of production due to the use of labor intensive technology and a highly decentralized production process

13.3 Textile Cooperatives

159

(b) Diculty in coordinating demand with supply.


The decentralization process leads to high costs as a result of supervisory and procurement costs.
Per capita throughput in handlooms is lower than at mills, leading to higher costs of production.
Further decentralization makes the production system unaware of consumer preferences. This
results in stockpiling of nished goods at the sales outlets. In addition, the weavers are tradition
bound and lack the desire to either adopt new designs and products or to change to better techniques
of production. The combined eect of all these factors is a mismatch between what is produced by
the weavers and what is preferred by the consumers.
The government, too, has contributed to the sense of complacency in the handlooms sector.
Non-moving stocks are disposed of at government-backed rebates of 20 per cent and so there is no
pressure to change the status quo.
In spite of these disadvantages, the potential of the handlooms sector is substantial because of
its intrinsic ability to produce a highly diversied product range.
A study was initiated at the Indian Institute of Management, Ahmedabad, India to nd ways
of improving performance of this sector. The following areas were identied for the study:
(a) Develop suitable models for sales forecasting and monitoring based on consumer preferences
(b) Design methods to improve the ow of information between marketing and production subsystems
(c) Analyze the production sub-system to reduce its response time and achieve better quality of
output
(d) Develop an optimal product mix for an organization and, if possible, an optimal product mix
for the production centers
(e) Analyze the process of procurement, stocking, and distribution of raw materials to reduce cost
and improve the quality of inputs.
Matching Supply and Demand
India is a large country with a great diversity of peoples, religions, languages and festivals
(holidays), and oers a great diversity of demand for saris. Dierent categories of saris - shade,
color, design and fabric combinations - are in demand in dierent parts of the country at dierent
times of year coinciding with the local holiday festivals.
If there were just a few types of saris produced, the problem of adjusting supply to demand
would be relatively simple - the varieties that sell well would be produced in greater quantity, and
those that do not sell well would be reduced or discontinued.
With tens of thousands of varieties being sold, however, this is not possible. A simple review of
sales records will not reveal much information that is useful for adjusting production. Data mining
might help in two ways:
1. Identifying important sari characteristics (shade, color, design and fabric combinations) that
inuence sales, and

160

13. Cases, nearly done

2. Providing a black box that can classify proposed designs into likely sellers and less likely
sellers.
Inaccurate forecasting results in three types of costs:
Inventory costs of storing saris for six months to a year until the next festival time
Transport and handling costs of moving saris to a dierent location
Losses from price discounting to clear inventories.
These costs average about 12% of the cost of production (compared to the normal prot margin
of 20%).
For forecasting purposes, a set of sari characteristics was identied and a coding system was
developed to represent these characteristics as shown in Table 1.
Table 1: Code List
ID
SILKWT
ZARIWT
SILKWT Cat
ZARIWT Cat
BODYCOL
BRDCOL
BODYSHD
BRDSHD
SARSIDE
BODYDES
BRDDES
PALDES
BRDSZ
PALSZ
SALE

case number
silk weight
zari weight
categorical version of SILKWT
categorical version of ZARIWT
body color series of binary variables, 1 = body is that color
border color series of binary variables, 1 = border is that color
body shade 1 = pale, 4 = bright
border shade 1 = pale, 4 = bright
1 or 2 sided sari 1 = 1-sided, 2 = 2-sided
body design series of binary variables, 1 = body is that design
border design series of binary variables, 1 = border is that design
pallav design series of binary variables, 1 = border is that design
border size
pallav size
1 = sale, 0 = no sale

Note: The colors and designs selected for the binary variables were those that were most
common.

13.4 Tayko Software Cataloger

161

Using Data Mining Techniques for Sales Forecasting


An experiment was conducted to see how data mining techniques can be used to develop sales
forecast for dierent categories of Saris. A random sample of 3695 items in a store at a market (a
town) was selected for study. Data were collected for saris present in the store at the beginning of
a 15 day festival season sale period indicating whether the item was sold or unsold during the sale
period (see le Textiles.xls).
Assignment

1. Identify the characteristics of saris that are:


a. Most popular
b. Average interest
c. Slow moving
Establish and state your own criteria for the above denitions.
2. Predict the sale of saris, and the gain to the stores, using several data mining techniques?

13.4

Tayko Software Cataloger

Tayco is a software catalog rm that sells games and educational software. It started out as a
software manufacturer, and added third party titles to its oerings. It has recently put together a
revised collection of items in a new catalog, which it is preparing to roll out in a mailing.
In addition to its own software titles, Taykos customer list is a key asset. In an attempt to grow
its customer base, it has recently joined a consortium of catalog rms that specialize in computer
and software products.
The consortium aords members the opportunity to mail catalogs to names drawn from a pooled
list of customers. Members supply their own customer lists to the pool, and can withdraw an
equivalent number of names each quarter. Members are allowed to do predictive modeling on the
records in the pool so they can do a better job of selecting names from the pool.
Tayko has supplied its customer list of 200,000 names to the pool, which totals over 5,000,000
names, so it is now entitled to draw 200,000 names for a mailing. Tayko would like to select the
names that have the best chance of performing well, so it conducts a test - it draws 20,000 names
from the pool and does a test mailing of the new catalog to them.
This mailing yielded 1065 purchasers - a response rate of 0.053. Average spending was $103
for each of the purchasers, or $5.46 per catalog mailed. To optimize the performance of the data
mining techniques, it was decided to work with a stratied sample that contained equal numbers
of purchasers and non-purchasers. For ease of presentation, the data set for this case includes just
1000 purchasers and 1000 non-purchasers, an apparent response rate of 0.5. Therefore, after using
the data set to predict who will be a purchaser, we must adjust the purchase rate back down by
multiplying each cases probability of purchase by 0.053/0.5 or 0.107.

162

13. Cases, nearly done

There are two response variables in this case. Purchase indicates whether or not a prospect
responded to the test mailing and purchased something. Spending indicates, for those who made
a purchase, how much they spent. The overall procedure in this case will be to develop two models.
One will be used to classify records as purchase or no purchase. The second will be used for
those cases that are classied as purchase, and will predict the amount they will spend.
The following table provides a description of the variables available in this case. A partition
variable is used because we will be developing two dierent models in this case and want to preserve
the same partition structure for assessing each model.
Codelist
Var. #

Variable Name

Description

Variable Type

1.

US

Is it a US address?

binary

2 - 16

Source *

binary

17.

Freq.

18.

last update days ago

19.

1st update days ago

20.

RFM%

Source catalog
for the record
(15 possible sources)
Number of transactions
in last year at
source catalog
How many days ago
was last update
to cust. record
How many days
ago was 1st update
to cust. record
Recency-frequency
-monetary percentile,
as reported by
source catalog
(see CBC case)

numeric

numeric

numeric

numeric

Code
Description
1: yes
0: no
1: yes
0: no

13.4 Tayko Software Cataloger


Codelist
21.

Web order

22.

163

binary

1: yes
0: no

Gender=mal

Customer placed at
least 1 order
via web
Customer is male

binary

Var. #

Variable Name

Description

Variable Type

23.

Address is res

binary

24.

Purchase

25.

Spending

26.

Partition

Address is
a residence
Person made purchase
in test mailing
Amount spent
by customer in ($)
test mailing
Variable indicating
which partition
the record will
be assigned to

1: yes
0: no
Code
Description
1: yes
0: no
1: yes
0: no

binary
numeric

alpha

t: training
v: validation
s: test

The following gures show the rst few rows of data (the rst gure shows the sequence number
plus the rst 14 variables, and the second gure shows the remaining 11 variables for the same rows):

164

13. Cases, nearly done


.

13.4 Tayko Software Cataloger

165

Assignment

(1) Each catalog costs approximately $2 to mail (including printing, postage and mailing costs).
Estimate the gross prot that the rm could expect from its remaining 180,000 names if it
randomly selected them from the pool.
(2) Develop a model for classication a customer as a purchaser or non-purchaser
(a) Partition the data into training on the basis of the partition variable, which has 800
ts, 700 vs and 500 ss (standing for training data, validation data and test data,
respectively) randomly assigned to cases.
(b) Using the best subset option in logistic regression, implement the full logistic regression model, select the best subset of variables, then implement a regression model with
just those variables to classify the data into purchasers and non-purchasers. (Logistic

166

13. Cases, nearly done


regression is used because it yields an estimated probability of purchase, which is
required later in the analysis.)

(3) Develop a model for predicting spending among the purchasers


(a) Make a copy of the data sheet (call it data2), sort by the Purchase variable, and
remove the records where Purchase = 0 (the resulting spreadsheet will contain only
purchasers).
(b) Partition this data set into training and validation partitions on the basis of the partition
variable.
(c) Develop models for predicting spending using
(i) Multiple linear regression (use best subset selection)
(ii) Regression trees
(d) Choose one model on the basis of its performance with the validation data. MLR, with
a lift of about 2.7 in the rst decile, was chosen.
(4) Return to the original test data partition. Note that this test data partition includes both
purchasers and non-purchasers. Note also that, although it contains the scoring of the chosen
classication model, we have not used this partition in any of our analysis to this point thus
it will give an unbiased estimate of the performance of our models. It is best to make a copy
of the test data portion of this sheet to work with, since we will be adding analysis to it. This
copy is called Score Analysis.
(a) Copy the predicted probability of success (success = purchase) column from the classication of test data to this sheet.
(b) Score the chosen prediction model to this data sheet.
(c) Arrange the following columns so they are adjacent:
(i) Predicted probability of purchase (success)
(ii) Actual spending $
(iii) Predicted spending $
(d) Add a column for adjusted prob. of purchase by multiplying predicted prob. of
purchase by 0.107. This is to adjust for over-sampling the purchasers (see above).
(e) Add a column for expected spending [adjusted prob. of purchase * predicted spending]
(f) Sort all records on the expected spending column
(g) Calculate cumulative lift (= cumulative actual spending divided by the average spending that would result from random selection [each adjusted by the .107])
(5) Using this cumulative lift curve, estimate the gross prot that would result from mailing to
the 180,000 on the basis of your data mining models.

13.5 IMRB : Segmenting Consumers of Bath Soap

167

Note : Tayko is a hypothetical company name, but the data in this case were supplied by a real
company that sells software through direct sales, and have been modied slightly for illustrative
purposes. While this rm did not participate in a catalog consortium, the concept is based upon
the Abacus Catalog Alliance. Details can be found at
http://www.doubleclick.com/us/solutions/marketers/database/catalog/.

13.5

IMRB : Segmenting Consumers of Bath Soap

Business Situation
The Indian Market Research Bureau (IMRB) is a leading market research agency that specializes in tracking consumer purchase behavior in consumer goods (both durable and non-durable).
IMRB tracks about 30 product categories (e.g. detergents, etc.) and, within each category,
about 60 - 70 brands. To track purchase behavior, IMRB constituted about 50,000 household panels
in 105 cities and towns in India, covering about 80% of the Indian urban market. (In addition to
this, there are 25,000 sample households selected in rural areas; however, we are working with only
urban market data). The households are carefully selected using stratied sampling. The strata
are dened on the basis of socio-economic status, and the market (a collection of cities).
IMRB has both transaction data (each row is a transaction) and household data (each row is
a household), and, for the household data, maintains the following information:
Demographics of the households (updated annually).
Possession of durable goods (car, washing machine, etc.; updated annually); an auence
index is computed from this information.
Purchase data of product categories and brands (updated monthly).
IMRB has two categories of clients: (1) Advertising agencies who subscribe to the database
services. They obtain updated data every month and use it to advise their clients on advertising
and promotion strategies. (2) Consumer goods manufacturers who monitor their market share
using the IMRB database.
Key Problems
IMRB has traditionally segmented markets on the basis of purchaser demographics. They
would like now to segment the market based on two key sets of variables more directly related to
the purchase process and to brand loyalty:
1. Purchase behavior (volume, frequency, susceptibility to discounts, and brand loyalty), and
2. Basis of purchase (price, selling proposition)
Doing so would allow IMRB to gain information about what demographic attributes are associated with dierent purchase behaviors and degrees of brand loyalty, and more eectively deploy
promotion budgets.

168

13. Cases, nearly done

The better and more eective market segmentation would enable IMRBs clients to design more
cost-eective promotions targeted at appropriate segments. Thus, multiple promotions could be
launched, each targeted at dierent market segments at dierent times of a year. This would result
in a more cost-eective allocation of the promotion budget to dierent market-segments. It would
also IMRB to design more eective customer reward systems and thereby increase brand loyalty.
Data
The data in this sheet prole each household - each row contains the data for one household.
Member
Identication
Demographics

Member id
SEC

1 - 5 categories

FEH

1 - 3 categories

MT

SEX

Food eating habits


(1=vegetarian,
2=veg. but eat eggs,
3=non veg.,
0=not specied)
Native language (see table
in worksheet)

1: male, 2: Female

AGE

Demographics

Unique identier for


each household
Socio Economic Class
(1=high, 5=low)

Sex of
homemaker
Age of
homemaker

EDU

1 - 9 categories

HS

1-9

CHILD

1- 4 categories

Presence of
children in the
household

CS

1-2

Television available.
1: Available
2: Not Available

Auence
Index

Education of
homemaker
(1=minimum, 9 = maximum)
Number of members
in the household

Weighted value of
durables
possessed

13.5 IMRB : Segmenting Consumers of Bath Soap

169

Summarized Purchase Data


Purchase summary
of the house hold
over the period

Purchase within
Promotion

No. of Brands

Number of brands purchased

Brand Runs

Number of instances
of consecutive
purchase of brands

Total Volume

Sum of volume

No. of Trans

Number of purchase
transactions;
Multiple brands
purchased in a
month are counted
as separate transactions

Value

Sum of value

Trans
/ Brand Runs

Avg. transactions
per brand run

Vol/Tran

Avg. volume
per transaction

Avg. Price

Avg. price
of purchase

Pur Vol
No Promo - %

Percent of volume purchased


under no-promotion

Pur Vol Promo 6 %

Percent of volume
purchased under
Promotion Code 6

Pur Vol
Other Promo %

Percent of
volume purchased
under other promotions

170

13. Cases, nearly done


Brand wise
purchase
Price category
wise purchase

Br. Cd. (57, 144),


55, 272, 286, 24,
481, 352, 5 and 999 (others)
Price
Cat 1 to 4

Selling
proposition
wise purchase

Proposition
Cat 5 to 15

Percent of volume
purchased of the brand
Per cent of volume
purchased under
the price category

Percent of volume purchased under


the product
proposition category

Measuring Brand Loyalty


Several variables in this case deal measure aspects of brand loyalty. The number of dierent
brands purchased by the customer is one measure. However, a consumer who purchases one or two
brands in quick succession then settles on a third for a long streak is dierent from a consumer
who constantly switches back and forth among three brands. So, how often customers switch from
one brand to another is another measure of loyalty. Yet a third perspective on the same issue is
the proportion of purchases that go to dierent brands - a consumer who spends 90% of his or
her purchase money on one brand is more loyal than a consumer who spends more equally among
several brands.
All three of these components can be measured with the data in the purchase summary worksheet.
Assignments

1. Use k-means clustering to identify clusters of households based on


a. The variables that describe purchase behavior (including brand loyalty).
b. The variables that describe basis-for-purchase.
c. The variables that describe both purchase behavior and basis of purchase.
Note 1: How should k be chosen? Think about how the clusters would be used. It is likely
that the marketing eorts would support 2-5 dierent promotional approaches.
Note 2: How should the percentages of total purchases comprised by various brands be
treated? Isnt a customer who buys all brand A just as loyal as a customer who buys all
brand B? What will be the eect on any distance measure of using the brand share variables
as is? Consider using a single derived variable.
2. Select what you think is the best segmentation and comment on the characteristics (demographic, brand loyalty and basis-for-purchase) of these clusters. (This information would be
used to guide the development of advertising and promotional campaigns.)

13.5 IMRB : Segmenting Consumers of Bath Soap

171

3. Develop a model that classies the data into these segments. Since this information would
most likely be used in targeting direct mail promotions, it would be useful to select a market
segment that would be dened as a success in the classication model.
APPENDIX
Although they are not used in the assignment, two additional data sets are provided that were
used in the derivation of the summary data.
IMRB Purchase Data is a transaction database, where each row is a transaction. Multiple
rows in this data set corresponding to a single household were consolidated into a single household
row in IMRD Summary Data.
The Durables sheet in IMRB SummaryData contains information used to calculate the afuence index. Each row is a household, and each column represents a durable consumer good. A
1 in the column indicates that the durable is possessed by the household; a 0 indicates it is
not possessed. This value is multiplied by the weight assigned to the durable item. For example, a
5 indicates the weighted value of possessing the durable. The sum of all the weighted values of
the durables possessed equals the Auence Index.

Index
2 in regression, 50
Radj

data reduction, 8
data warehouse, 6
decision node, 79
dendrogram, 131
dependent variable, 4
dimension reduction, 106
dimensionality, 119
dimensionality curse, 106
disjoint, 111
distance measures, 128

activation function, 63
anity analysis, 7, 111
algorithm, 4
antecedent, 111
Apriori algorithm, 114
articial intelligence, 4
association rules, 7, 111
attribute, 4
average linkage, 131

eective number of parameters, 104


epoch, 68
estimation, 4
Euclidean distance, 101
Euclidian distance, 128

backward elimination, 48
backward propagation, 69
Bayes Theorem, 106
Bayes formula, 27
Bayes risk, 30
Bayesian classier, 107
bias, 43, 63
bias-variance trade-o, 47

factor selection, 106


farthest neighbor clustering, 131
feature, 4
feature extraction, 106
eld, 4
Forward selection, 47

CART, 73
case, 4
categorical variables, 12
CHAID, 80
classication, 7
Classication Trees, 73
classier, 25
cluster analysis, 127
complete linkage, 131
condence, 4
consequent, 111
continuous variables, 12

group average clustering, 131


holdout data, 16
homoskedasticity, 40
input variable, 4
k-means clustering, 133
k-Nearest Neighbor algorithm, 101
leaf node, 79
likelihood function, 59

data marts, 6
172

INDEX
logistic regression, 51
machine learning, 4
majority decision rule, 102
Mallows CP, 50
market basket analysis, 111
maximum likelihood, 59
Minimum Error, 26
misclassication, 28
missing data, 15
model, 4, 1113
momentum parameter, 71
Naive Bayes, 108
nearest neighbor, 101
nearest neighbor clustering, 130
neural nets, 63
Newton Raphson method, 61
normalizing data, 15, 124
numeric variables, 12

173
sigmoid function, 66
similarity measures, 136
single linkage, 130
squashing function, 63
standardizing data, 124
step-wise regression, 48
steps in data mining, 10
stratied sampling, 30
subset selection in linear regression, 43
supervised learning, 5, 8
terabyte, 5
terminal node, 79
test data, 10
test partition, 16
text variables, 12
training data, 8
training partition, 16
Triage strategy, 36
unsupervised learning, 5, 9

outcome variable, 4
outliers, 10, 19
output variable, 4
overtting, 3, 12, 16
oversampling, 11
partitions, 16
pattern, 4
prediction, 4, 7
Principal Components Analysis, 119
probability, 27
pruning, 80
random sampling, 10
record, 4, 14
recursive partitioning, 73
regression, 39
Regression Trees, 73
row, 4
sample, 3, 4
score, 4

validation partition, 16
variable, 5
weighted Euclidian distance, 128

You might also like