Professional Documents
Culture Documents
Charles Elkan
elkan@cs.ucsd.edu
1
Contents
Contents 2
1 Introduction 5
1.1 Limitations of predictive analytics . . . . . . . . . . . . . . . . . . 6
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Introduction to Rapidminer 21
3.1 Standardization of features . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Example of a Rapidminer process . . . . . . . . . . . . . . . . . . 22
3.3 Other notes on Rapidminer . . . . . . . . . . . . . . . . . . . . . . 25
2
CONTENTS 3
9 Recommender systems 83
9.1 Applications of matrix approximation . . . . . . . . . . . . . . . . 84
9.2 Measures of performance . . . . . . . . . . . . . . . . . . . . . . . 84
9.3 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.4 Multiplicative models . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.5 Combining models by fitting residuals . . . . . . . . . . . . . . . . 87
9.6 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
10 Text mining 93
10.1 The bag-of-words representation . . . . . . . . . . . . . . . . . . . 94
10.2 The multinomial distribution . . . . . . . . . . . . . . . . . . . . . 94
10.3 Training Bayesian classifiers . . . . . . . . . . . . . . . . . . . . . 96
10.4 Burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.5 Discriminative classification . . . . . . . . . . . . . . . . . . . . . 98
10.6 Clustering documents . . . . . . . . . . . . . . . . . . . . . . . . . 99
4 CONTENTS
Bibliography 117
Chapter 1
Introduction
There are many definitions of data mining. We shall take it to mean the application
of learning algorithms and statistical methods to real-world datasets. There are nu-
merous data mining domains in science, engineering, business, and elsewhere where
data mining is useful. We shall focus on applications that are related to business, but
the methods that are most useful are mostly the same for applications in science or
engineering.
The focus will be on methods for making predictions. For example, the available
data may be a customer database, along with labels indicating which customers failed
to pay their bills. The goal will then be to predict which other customers might fail
to pay in the future. In general, analytics is a newer name for data mining. Predictive
analytics indicates a focus on making predictions.
The main alternative to predictive analytics can be called descriptive analytics.
This area is often also called “knowledge discovery in data” or KDD. In a nutshell,
the goal of descriptive analytics is to discover patterns in data. Finding patterns is
often fascinating and sometimes highly useful, but in general it is harder to obtain
direct benefit from descriptive analytics than from predictive analytics. For example,
suppose that customers of Whole Foods tend to be liberal and wealthy. This pattern
may be noteworthy and interesting, but what should Whole Foods do with the find-
ing? Often, the same finding suggests two courses of action that are both plausible,
but contradictory. In such a case, the finding is really not useful in the absence of
additional knowledge. For example, perhaps Whole Foods at should direct its mar-
keting towards additional wealthy and liberal people. Or perhaps that demographic is
saturated, and it should aim its marketing at a currently less tapped, different, group
of people?
In contrast, predictions can be typically be used directly to make decisions that
maximize benefit to the decision-maker. For example, customers who are more likely
5
6 CHAPTER 1. INTRODUCTION
not to pay in the future can have their credit limit reduced now. It is important to
understand the difference between a prediction and a decision. Data mining lets us
make predictions, but predictions are useful to an agent only if they allow the agent
to make decisions that have better outcomes.
Some people may feel that the focus in this course on maximizing profit is dis-
tasteful or disquieting. After all, maximizing profit for a business may be at the
expense of consumers, and may not benefit society at large. There are several re-
sponses to this feeling. First, maximizing profit in general is maximizing efficiency.
Society can use the tax system to spread the benefit of increased profit. Second, in-
creased efficiency often comes from improved accuracy in targeting, which benefits
the people being targeted. Businesses have no motive to send advertising to people
who will merely be annoyed and not respond.
Fico, the company behind the credit score, recently launched a service
that pre-qualifies borrowers for modification programmes using their in-
house scoring data. Lenders pay a small fee for Fico to refer potential
candidates for modifications that have already been vetted for inclusion
in the programme. Fico can also help lenders find borrowers that will
best respond to modifications and learn how to get in touch with them.
It is hard to see how this could be a successful application of data mining, because
it is hard to see how a useful labeled training set could exist. The target concept
is “borrowers that will best respond to modifications.” From a lender’s perspective
(and Fico works for lenders not borrowers) such a borrower is one who would not pay
under his current contract, but who would pay if given a modified contract. Especially
in 2009, lenders had no long historical experience with offering modifications to
borrowers, so FICO did not have relevant data. Moreover, the definition of the target
is based on a counterfactual, that is on reading the minds of borrowers. Data mining
cannot read minds.
For a successful data mining application, the actions to be taken based on pre-
dictions need to be defined clearly and to have reliable profit consequences. The
1.2. OVERVIEW 7
difference between a regular payment and a modified payment is often small, for ex-
ample $200 in the case described in the newspaper article. It is not clear that giving
people modifications will really change their behavior dramatically.
For a successful data mining application also, actions must not have major unin-
tended consequences. Here, modifications may change behavior in undesired ways.
A person requesting a modification is already thinking of not paying. Those who get
modifications may be motivated to return and request further concessions.
Additionally, for predictive analytics to be successful, the training data must be
representative of the test data. Typically, the training data come from the past, while
the test data arise in the future. If the phenomenon to be predicted is not stable
over time, then predictions are likely not to be useful. Here, changes in the general
economy, in the price level of houses, and in social attitudes towards foreclosures,
are all likely to change the behavior of borrowers in the future, in systematic but
unknown ways.
Last but not least, for a successful application it helps if the consequences of
actions are essentially independent for different examples. This may be not the case
here. Rational borrowers who hear about others getting modifications will try to
make themselves appear to be candidates for a modification. So each modification
generates a cost that is not restricted to the loss incurred with respect to the individual
getting the modification.
An even more clear example of an application of predictive analytics that is un-
likely to succeed is learning a model to predict which persons will commit a major
terrorist act. There are so few positive training examples that statistically reliable
patterns cannot be learned. Moreover, intelligent terrorists will take steps not to fit in
with patterns exhibited by previous terrorists [Jonas and Harper, 2006].
1.2 Overview
In this course we shall only look at methods that have state-of-the-art accuracy, that
are sufficiently simple and fast to be easy to use, and that have well-documented
successful applications. We will tread a middle ground between focusing on theory
at the expense of applications, and understanding methods only at a cookbook level.
Often, we shall not look at multiple methods for the same task, when there is one
method that is at least as good as all the others from most points of view. In particular,
for classifier learning, we will look at support vector machines (SVMs) in detail. We
will not examine alternative classifier learning methods such as decision trees, neural
networks, boosting, and bagging. All these methods are excellent, but it is hard to
identify clearly important scenarios in which they are definitely superior to SVMs.
8 CHAPTER 1. INTRODUCTION
We may also look at random forests, a nonlinear method that is often superior to
linear SVMs, and which is widely used in commercial applications nowadays.
Chapter 2
This chapter explains supervised learning, linear regression, and data cleaning and
recoding.
9
10 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
a column vector of y values. The cardinality of the training set is n, while its dimen-
sionality is p. We use the notation xij for the value of feature number j of example
number i. The label of example i is yi . True labels are known for training examples,
but not for test examples.
ues, However, in many applications, both these approaches eliminate too much useful
training data.
Also, the fact that a particular feature is missing may itself be a useful predictor.
Therefore, it is often beneficial to create an additional binary feature that is 0 for
missing and 1 for present. If a feature with missing values is retained, then it is
reasonable to replace each missing value by the mean or mode of the non-missing
values. This process is called imputation. More sophisticated imputation procedures
exist, but they are not always better.
Some training algorithms can only handle categorical features. For these, features
that are numerical can be discretized. The range of the numerical values is partitioned
into a fixed number of intervals that are called bins. The word “partitioned” means
that the bins are exhaustive and mutually exclusive, i.e. non-overlapping. One can
set boundaries for the bins so that each bin has equal width, i.e. the boundaries are
regularly spaced, or one can set boundaries so that each bin contains approximately
the same number of training examples, i.e. the bins are “equal count.” Each bin is
given an arbitrary name. Each numerical value is then replaced by the name of the
bin in which the value lies. It often works well to use ten bins.
Other training algorithms can only handle real-valued features. For these, cat-
egorical features must be made numerical. The values of a binary feature can be
recoded as 0.0 or 1.0. It is conventional to code “false” or “no” as 0.0, and “true” or
“yes” as 1.0. Usually, the best way to recode a feature that has k different categorical
values is to use k real-valued features. For the jth categorical value, set the jth of
these features equal to 1.0 and set all k − 1 others equal to 0.0.1
Categorical features with many values (say, over 20) are often difficult to deal
with. Typically, human intervention is needed to recode them intelligently. For ex-
ample, zipcodes may be recoded as just their first one or two letters, since these
indicate meaningful regions of the United States. If one has a large dataset with
many examples from each of the 50 states, then it may be reasonable to leave this
as a categorical feature with 50 values. Otherwise, it may be useful to group small
states together in sensible ways, for example to create a New England group for MA,
CT, VT, ME, NH.
An intelligent way to recode discrete predictors is to replace each discrete value
by the mean of the target conditioned on that discrete value. For example, if the
average label value is 20 for men versus 16 for women, these values could replace
the male and female values of a variable for gender. This idea is especially useful as
1
The ordering of the values, i.e. which value is associated with j = 1, etc., is arbitrary. Mathemat-
ically it is preferable to use only k − 1 real-valued features. For the last categorical value, set all k − 1
features equal to 0.0. For the jth categorical value where j < k, set the jth feature value to 1.0 and set
all k − 1 others equal to 0.0.
12 CHAPTER 2. PREDICTIVE ANALYTICS IN GENERAL
a way to convert a discrete feature with many values, for example the 50 U.S. states,
into a useful single numerical feature.
However, as just explained, the standard way to recode a discrete feature with
m values is to introduce m − 1 binary features. With this standard approach, the
training algorithm can learn a coefficient for each new feature that corresponds to an
optimal numerical value for the corresponding discrete value. Conditional means are
likely to be meaningful and useful, but they will not yield better predictions than the
coefficients learned in the standard approach. A major advantage of the conditional-
means approach is that it avoids an explosion in the dimensionality of training and
test examples.
Mixed types.
Sparse data.
Normalization. After conditional-mean new values have been created, they can
be scaled to have zero mean and unit variance in the same way as other features.
y = b0 + b1 x1 + b2 x2 + . . . + bp xp .
The righthand side above is called a linear function of x. The linear function is
defined by its coefficients b0 to bp . These coefficients are the output of the data
mining algorithm.
The coefficient b0 is called the intercept. It is the value of y predicted by the
model if xi = 0 for all i. Of course, it may be completely unrealistic that all features
xi have value zero. The coefficient bi is the amount by which the predicted y value
increases if xi increases by 1, if the value of all other features is unchanged. For
example, suppose xi is a binary feature where xi = 0 means female and xi = 1
means male, and suppose bi = −2.5. Then the predicted y value for males is lower
by 2.5, everything else being held constant.
Suppose that the training set has cardinality n, i.e. it consists of n examples of
the form hxi , yi i, where xi = hxi1 , . . . , xip i. Let b be any set of coefficients. The
predicted value for xi is
p
X
ŷi = f (xi ; b) = b0 + bj xij .
j=1
2.3. LINEAR REGRESSION 13
The semicolon in the expression f (xi ; b) emphasizes that the vector xi is a variable
input, while b is a fixed set of parameter values. If we define xi0 = 1 for every i, then
we can write
Xp
ŷi = bj xij .
j=0
The objective function i (yi − j bj xij )2 is called the sum of squared errors, or
P P
SSE for short. Note that during training the n different xi and yi values are fixed,
while the parameters b are variable.
The optimal coefficient values b̂ are not defined uniquely if the number n of train-
ing examples is less than the number p of features. Even if n > p is true, the optimal
coefficients have multiple equivalent values if some features are themselves related
linearly. Here, “equivalent” means that the different sets of coefficients achieve the
same minimum SSE. For an intuitive example, suppose features 1 and 2 are height
and weight respectively. Suppose that x2 = 120 + 5(x1 − 60) = −180 + 5x1 ap-
proximately, where the units of measurement are pounds and inches. Then the same
model can be written in many different ways:
• y = b0 + b1 x1 + b2 x2
errors (SSE) plus a function that penalizes large values of the coefficients. A sim-
ple penalty function of this type is pj=1 b2j . A parameter λ can control the relative
P
importance of the two objectives, namely SSE and penalty:
n p
1X 1X 2
b̂ = argminb (yi − ŷi )2 + λ bj .
n p
i=1 j=1
2
HDL cholesterol is considered beneficial and is sometimes called “good” cholesterol. Source:
http://www.jerrydallal.com/LHSP/importnt.htm. Predictors have been reordered
here from most to least statistically significant, as measured by p-value.
2.5. EVALUATING PERFORMANCE 15
From most to least statistically significant, the predictors are body mass index, the
log of total cholesterol, diastolic blood pressures, vitamin C level in blood, systolic
blood pressure, skinfold thickness, and age in years. (It is not clear what GLUM is.)
The example illustrates at least two crucial issues. First, if predictors are collinear,
then one may appear significant and the other not, when in fact both are significant or
both are not. Above, diastolic blood pressure is statistically significant, but systolic
is not. This may possibly be true for some physiological reason. But it may also be
an artifact of collinearity.
Second, a predictor may be practically important, and statistically significant,
but still useless for interventions. This happens if the predictor and the outcome
have a common cause, or if the outcome causes the predictor. Above, vitamin C
is statistically significant. But it may be that vitamin C is simply an indicator of a
generally healthy diet high in fruits and vegetables. If this is true, then merely taking
a vitamin C supplement will cause an increase in HDL level.
A third crucial issue is that a correlation may disagree with other knowledge and
assumptions. For example, vitamin C is generally considered beneficial or neutral.
If lower vitamin C was associated with higher HDL, one would be cautious about
believing this relationship, even if the association was statistically significant.
with known labels. We train the classifier on the training set, apply it to the test set,
and then measure performance by comparing the predicted labels with the true labels
(which were not available to the training algorithm).
It is absolutely vital to measure the performance of a classifier on an independent
test set. Every training algorithm looks for patterns in the training data, i.e. corre-
lations between the features and the class. Some of the patterns discovered may be
spurious, i.e. they are valid in the training data due to randomness in how the train-
ing data was selected from the population, but they are not valid, or not as strong,
in the whole population. A classifier that relies on these spurious patterns will have
higher accuracy on the training examples than it will on the whole population. Only
accuracy measured on an independent test set is a fair estimate of accuracy on the
whole population. The phenomenon of relying on patterns that are strong only in the
training data is called overfitting. In practice it is an omnipresent danger.
Most training algorithms have some settings that the user can choose between.
For ridge regression the main algorithmic parameter is the degree of regularization
λ. Other algorithmic choices are which sets of features to use. It is natural to run
a supervised learning algorithm many times, and to measure the accuracy of the
function (classifier or regression function) learned with different settings. A set of
labeled examples used to measure accuracy with different algorithmic settings, in
order to pick the best settings, is called a validation set. If you use a validation set,
it is important to have a final test set that is independent of both the training set
and the validation set. For fairness, the final test set must be used only once. The
only purpose of the final test set is to estimate the true accuracy achievable with the
settings previously chosen using the validation set.
Dividing the available data into training, validation, and test sets should be done
randomly, in order to guarantee that each set is a random sample from the same distri-
bution. However, a very important real-world issue is that future real test examples,
for which the true label is genuinely unknown, may be not a sample from the same
distribution.
Quiz question
Suppose you are building a model to predict how many dollars someone will spend at
Sears. You know the gender of each customer, male or female. Since you are using
linear regression, you must recode this discrete feature as continuous. You decide
to use two real-valued features, x11 and x12 . The coding is a standard “one of n”
scheme, as follows:
gender x11 x12
male 1 0
female 0 1
Learning from a large training set yields the model
y = . . . + 15x11 + 75x12 + . . .
Dr. Roebuck says “Aha! The average woman spends $75, but the average man spends
only $15.”
Write your name below, and then answer the following three parts with one or
two sentences each:
(a) Explain why Dr. Roebuck’s conclusion is not valid. The model only predicts
spending of $75 for a woman if all other features have value zero. This may not be
true for the average woman. Indeed it will not be true for any woman, if features
such as “age” are not normalized.
(b) Explain what conclusion can actually be drawn from the numbers 15 and 75.
The conclusion is that if everything else is held constant, then on average a woman
will spend $60 more than a man. Note that if some feature values are systematically
different for men and women, then even this conclusion is not useful, because it is not
reasonable to hold all other feature values constant.
(c) Explain a desirable way to simplify the model. The two features x11 and
x12 are linearly related. Hence, they make the optimal model be undefined, in the
absence of regularization. It would be good to eliminate one of these two features.
The expressiveness of the model would be unchanged.
Quiz for April 6, 2010
Your name:
Suppose that you are training a model to predict how many transactions a credit card
customer will make. You know the education level of each customer. Since you are
using linear regression, you recode this discrete feature as continuous. You decide to
use two real-valued features, x37 and x38 . The coding is a “one of two” scheme, as
follows:
x37 x38
college grad 1 0
not college grad 0 1
Learning from a large training set yields the model
y = . . . + 5.5x37 + 3.2x38 + . . .
(a) Dr. Goldman concludes that the average college graduate makes 5.5 transactions.
Explain why Dr. Goldman’s conclusion is likely to be false.
The model only predicts 5.5 transactions if all other features, including the in-
tercept, have value zero. This may not be true for the average college grad. It will
certainly be false if features such as “age” are not normalized.
(b) Dr. Sachs concludes that being a college graduate causes a person to make 2.3
more transactions, on average. Explain why Dr. Sachs’ conclusion is likely to be
false also.
First, if any other feature have different values on average for men and women,
for example income, then 5.5 − 3.2 = 2.3 is not the average difference in predicted
y value between groups. Said another way, it is unlikely to be reasonable to hold all
other feature values constant when comparing groups.
Second, even if 2.3 is the average difference, one cannot say that this difference
is caused by being a college graduate. There may be some other unknown common
cause, for example.
Linear regression assignment
This assignment is due at the start of class on Tuesday April 12, 2011. You should
work in a team of two. Choose a partner who has a different background from you.
Download the file cup98lrn.zip from from http://archive.ics.uci.
edu/ml/databases/kddcup98/kddcup98.html. Read the associated doc-
umentation. Load the data into Rapidminer (or other software for data mining such
as R). Select the 4843 records that have feature TARGET_B=1. Save these as a
native-format Rapidminer example set.
Now, build a linear regression model to predict the field TARGET_D as accurately
as possible. Use root mean squared error (RMSE) as the definition of error, and use
ten-fold cross-validation to measure RMSE. Do a combination of the following:
Do the steps above repeatedly in order to explore alternative ways of using the data.
The outcome should be the best possible model that you can find that uses 30 or
fewer of the original features.
To make your work more efficient, be sure to save the 4843 records in a format
that Rapidminer can load quickly. You can use three-fold cross-validation during
development to save time also. If you normalize all input features, and you use
strong regularization (ridge parameter 107 perhaps), then the regression coefficients
will indicate the relative importance of features.
The deliverable is a brief report that is formatted similarly to this assignment
description. Describe what you did that worked, and your results. Explain any as-
sumptions that you made, and any limitations on the validity or reliability of your
results. If you use Rapidminer, include a printout of your final process. Include your
final regression model. Do not speculate about future work, and do not explain ideas
that did not work. Write in the present tense. Organize your report logically, not
chronologically.
Comments on the regression assignment
It is typically useful to rescale predictors to have mean zero and variance one. How-
ever, it loses interpretability to rescale the target variable. Note that if all predictors
have mean zero, then the intercept of a linear regression model is the mean of the
target, $15.6243 here, assuming that the intercept is not regularized.
The assignment specifically asks you to report root mean squared error, RMSE.
One could also report mean squared error, MSE, but whichever is chosen should be
used consistently. In general, do not confuse readers by switching between multiple
performance measures without a good reason.
As is often the case, good performance can be achieved with a very simple model.
The most informative single feature is LASTGIFT, the dollar amount of the person’s
most recent gift. A model based on just this single feature achieves RMSE of $9.98.
In 2009, three of 11 teams achieved similar final RMSEs that were slightly better than
$9.00. The two teams that omitted LASTGIFT achieved RMSE worse than $11.00.
However, it is possible to do significantly better.
The assignment asks you to produce a final model based on at most 30 of the
original features. Despite this directive, it is not a good idea to begin by choosing
a subset of the 480 original features based on human intuition. The teams that did
this all omitted features that in fact would have made their final models considerably
better, including sometimes the feature LASTGIFT. As explained above, it is also
not a good idea to eliminate automatically features with missing values.
Rapidminer has operators that search for highly predictive subsets of variables.
These operators have two major problems. First, they are too slow to be used on a
large initial set of variables, so it is easy for human intuition to pick an initial set
that is bad. Second, these operators try a very large number of alternative subsets
of variables, and pick the one that performs best on some dataset. Because of the
high number of alternatives considered, this subset is likely to overfit substantially
the dataset on which it is best. For more discussion of this problem, see the section
on nested cross-validation in a later chapter.
Chapter 3
Introduction to Rapidminer
• Convert each nominal feature with k alternative values into k different binary
features.
• Optionally, drop all binary features with fewer than 100 examples for either
binary value.
• Convert each binary feature into a numerical feature with values 0.0 and 1.0.
21
22 CHAPTER 3. INTRODUCTION TO RAPIDMINER
Z-scoring
Normalization
Cross-validation
XValidation
ApplierChain
OperatorChain
Applier
ModelApplier
sion “dat.” The easiest way to create these files is by clicking on “Start Data Loading
Wizard.” The first step with this wizard is to specify the file to read data from, the
character that begins comment lines, and the decimal point character. Ticking the
box for “use double quotes” can avoid some error messages.
In the next panel, you specify the delimiter that divides fields within each row
of data. If you choose the wrong delimiter, the data preview at the bottom will look
wrong. In the next panel, tick the box to make the first row be interpreted as field
names. If this is not true for your data, the easiest is to make it true outside Rapid-
miner, with a text editor or otherwise. When you click Next on this panel, all rows
of data are loaded. Error messages may appear in the bottom pane. If there are no
errors and the data file is large, then Rapidminer hangs with no visible progress. The
same thing happens if you click Previous from the following panel. You can use a
CPU monitor to see what Rapidminer is doing.
The next panel asks you to specify the type of each attribute. The wizard guesses
this based only on the first row of data, so it often makes mistakes, which you have to
fix by trial and error. The following panel asks you to say which features are special.
The most common special choice is “label” which means that an attribute is a target
to be predicted.
Finally, you specify a file name that is used with “aml” and “dat” extensions to
save the data in Rapidminer format.
To keep just features with certain names, use the operator FeatureNameFilter. Let
the argument skip_features_with_name be .* and let the argument except_
features_with_name identify the features to keep. In our sample process, it is
(.*AMNT.*)|(.*GIFT.*)(YRS.*)|(.*MALE)|(STATE)|(PEPSTRFL)|(.*GIFT)
|(MDM.*)|(RFA_2.*).
In order to convert a discrete feature with k different values into k real-valued
0/1 features, two operators are needed. The first is Nominal2Binominal, while
the second is Nominal2Numerical. Note that the documentation of the latter
operator in Rapidminer is misleading: it cannot convert a discrete feature into multi-
ple numerical features directly. The operator Nominal2Binominal is quite slow.
Applying it to discrete features with more than 50 alternative values is not recom-
mended.
The simplest way to find a good value for an algorithm setting is to use the
XValidation operator nested inside the GridParameterOptimization op-
erator. The way to indicate nesting of operators is by dragging and dropping. First
create the inner operator subtree. Then insert the outer operator. Then drag the root
of the inner subtree, and drop it on top of the outer operator.
3.3. OTHER NOTES ON RAPIDMINER 25
This chapter explains soft-margin support vector machines (SVMs), including linear
and nonlinear kernels. It also discusses detecting overfitting via cross-validation, and
preventing overfitting via regularization.
We have seen how to use linear regression to predict a real-valued label. Now we
will see how to use a similar model to predict a binary label. In later chapters, where
we think probabilistically, it will be convenient to assume that a binary label takes on
true values 0 or 1. However, in this chapter it will be convenient to assume that the
label y has true value either +1 or −1.
which is called the 0-1 loss function. The usual definition of accuracy uses this loss
function. However, it has undesirable properties. First, it loses information: it does
not distinguish between predictions f (x) that are almost right, and those that are
27
28 CHAPTER 4. SUPPORT VECTOR MACHINES
which is infinitely differentiable everywhere, and does not lose information when the
prediction f (x) is real-valued. However, this loss function says that the prediction
f (x) = 1.5 is as undesirable as f (x) = 0.5 when the true label is y = 1. Intuitively,
if the true label is +1, then a prediction with the correct sign that is greater than 1
should not be considered incorrect.
The following loss function, which is called hinge loss, satisfies the intuition just
suggested:
l(f (x), y) = max{0, 1 − yf (x)}.
The hinge loss function deserves some explanation. Suppose the true label is y = 1.
Then the loss is zero as long as the prediction f (x) ≥ 1. The loss is positive, but less
than 1, if 0 < f (x) < 1. The loss is large, i.e. greater than 1, if f (x) < 0.
Using hinge loss is the first major insight behind SVMs. An SVM classifier f
is trained to minimize hinge loss. The training process aims to achieve predictions
f (x) ≥ 1 for all training instances x with true label y = +1, and to achieve predic-
tions f (x) ≤ −1 for all training instances x with y = −1. Overall, training seeks
to classify points correctly, and to distinguish clearly between the two classes, but it
does not seek to make predictions be exactly +1 or −1. In this sense, the training
process intuitively aims to find the best possible classifier, without trying to satisfy
any unnecessary additional objectives also.
4.2 Regularization
Given a set of training examples hxi , yi i for i = 1 to i = n, the total training loss
(sometimes called empirical loss) is the sum
n
X
l(f (xi ), yi ).
i=1
restricted, then we run the risk of underfitting the data. In general, we do not know in
advance what the best space F is for a particular training set. A possible solution to
this dilemma is to choose a flexible space F , but at the same time to impose a penalty
on the complexity of f . Let c(f ) be some real-valued measure of complexity. The
learning process then becomes to solve
n
1X
f = argminf ∈F λc(f ) + l(f (xi ), yi ).
n
i=1
Here, λ is a parameter that controls the relative strength of the two objectives, namely
to minimize the complexity of f and to minimize training error.
Suppose that the space of candidate functions is defined by a vector w ∈ Rd of
parameters, i.e. we can write f (x) = g(x; w) where g is some fixed function. In this
case we can define the complexity of each candidate function to be the norm of the
corresponding w. Most commonly we use the square norm:
d
X
2
c(f ) = ||w|| = wj2 .
j=1
or the L1 norm
d
X
c(f ) = ||w||21 = |wj |.
j=1
unique. Moreover, the objective function is convex, so there are no local minima.
Note that d is the dimensionality of x, and w has the same dimensionality.
An equivalent way of writing the same optimization problem is
n
X
w = argminw∈Rd ||w||2 + C max{0, 1 − yi (w · xi )}
i=1
with C = 1/(nλ). Many SVM implementations let the user specify C as opposed
to λ. A small value for C corresponds to strong regularization, while a large value
corresponds to weak regularization. Intuitively, everything else being equal, a smaller
training set should require a smaller C value. However, useful guidelines are not
known for what the best value of C might be for a given dataset. In practice, one has
to try multiple values of C, and find the best value experimentally.
Mathematically, the optimization problem above is called an unconstrained pri-
mal formulation. There is an alternative formulation that is equivalent, and is useful
theoretically. This so-called dual formulation is
n n n
X 1 XX
maxn αi − αi αj yi yj (xi · xj )
α∈R 2
i=1 i=1 j=1
subject to 0 ≤ αi ≤ C.
The primal and dual formulations are different optimization problems, but they have
the same unique solution. The solution to the dual problem is a coefficient αi for
each training example. Notice that the optimization is over Rn , whereas it is over Rd
in the primal formulation. The trained classifier is f (x) = w · x where the vector
n
X
w= αi yi xi .
i=1
This equation says that w is a weighted linear combination of the training instances
xi , where the weight of each instance is between 0 and C, and the sign of each
instance is its label yi . The training instances xi such that αi > 0 are called support
vectors. These instances are the only ones that contribute to the final classifier.
The constrained dual formulation is the basis of the training algorithms used by
standard SVM implementations, but recent research has shown that the unconstrained
primal formulation in fact leads to faster training algorithms, at least in the linear case
as above. Moreover, the primal version is easier to understand and easier to use as a
foundation for proving bounds on generalization error. However, the dual version is
easier to extend to obtain nonlinear SVM classifiers. This extension is based on the
idea of a kernel function.
4.4. NONLINEAR KERNELS 31
says that the prediction for a test example x is a weighted average of the training
labels yi , where the weight of yi is the product of αi and the degree to which x is
similar to xi .
Consider a re-representation of instances x 7→ Φ(x) where the transformation Φ
0
is a function Rd → Rd . In principle, we could use dot-product to define similarity
0
in the new space Rd , and train an SVM classifier in this space. However, suppose
we have a function K(xi , xj ) = Φ(xi ) · Φ(xj ). This function is all we need in order
to write down the optimization problem and its solution; we do not need to know the
function Φ in any explicit way. Specifically, let kij = K(xi , xj ). The learning task
is to solve
n n n
X 1 XX
max αi − αi αj yi yj kij subject to 0 ≤ αi ≤ C.
α∈Rn 2
i=1 i=1 j=1
The solution is
n
X n
X
f (x) = [ αi yi Φ(xi )] · x = αi yi K(xi , x).
i=1 i=1
This classifier is a weighted combination of at most n functions, one for each training
instance xi . These are called basis functions.
The result above says that in order to train a nonlinear SVM classifier, all that we
need is the kernel matrix of size n by n whose entries are kij . And in order to apply
1
If the instances have unit length, that is ||xi || = ||xj || = 1, then Euclidean distance and dot
product similarity are perfectly anticorrelated. For many applications of support vector machines, it
is advantageous to normalize features to have the same mean and variance. It can be advantageous
also to normalize instances so that they have unit length. However, in general one cannot have both
normalizations be true at the same time.
32 CHAPTER 4. SUPPORT VECTOR MACHINES
the trained nonlinear classifier, all that we need is the kernel function K. The function
Φ never needs to be known explicitly. Using K exclusively in this way instead of Φ
is called the “kernel trick.” Practically, the function K can be much easier to deal
with than Φ, because K is just a mapping to R, rather than to a high-dimensional
0
space Rd .
Intuitively, regardless of which kernel K is used, that is regardless of which re-
representation Φ is used, the complexity of the classifier f is limited, since it is
defined by at most n coefficients αi . The function K is fixed, so it does not increase
the intuitive complexity of the classifier.
One particular kernel is especially important. The radial basis function (RBF)
kernel is the function
where γ > 0 is an adjustable parameter. Using an RBF kernel, each basis function
K(xi , x) is “radial” since it is based on the Euclidean
P distance ||xi −x|| from xi to x.
With an RBF kernel, the classifier f (x) = i αi yi K(xi , x) is similar to a nearest-
neighbor classifier. Given a test instance x, its predicted label f (x) is a weighted
average of the labels yi of the support vectors xi . The support vectors that contribute
non-negligibly to the predicted label are those for which the Euclidean distance ||xi −
x|| is small.
The RBF kernel can also be written
where σ 2 = 1/γ. This notation emphasizes the similarity with a Gaussian distribu-
tion. A smaller value for γ, i.e. a larger value for σ 2 , corresponds to basis functions
that are less peaked, i.e. that are significantly non-zero for a wider range of x values.
Using a larger value for σ 2 is similar to using a larger number k of neighbors in a
nearest neighbor classifier.
3. Use cross-validation to find the best value for C for a linear kernel.
4. Use cross-validation to find the best values for C and γ for an RBF kernel.
5. Train on the entire available data using the parameter values found to be best
via cross-validation.
It is reasonable to start with C = 1 and γ = 1, and to try values that are smaller and
larger by factors of 2:
Quiz question
(a) Draw the hinge loss function for the case where the true label y = 1. Label the
axes clearly.
(b) Explain where the derivative is (i) zero, (ii) constant but not zero, or (iii) not
defined.
(c) For each of the three cases for the derivative, explain its intuitive implications
for training an SVM classifier.
Quiz for April 20, 2010
Your name:
Suppose that you have trained SVM classifiers carefully for a given learning task.
You have selected the settings that yield the best linear classifier and the best RBF
classifier. It turns out that both classifiers perform equally well in terms of accuracy
for your task.
(a) Now you are given many times more training data. After finding optimal settings
again and retraining, do you expect the linear classifier or the RBF classifier to have
better accuracy? Explain very briefly.
(b) Do you expect the optimal settings to involve stronger regularization, or weaker
regularization? Explain very briefly.
CSE 291 Assignment
This assignment is due at the start of class on Tuesday April 20. As before, you
should work in a team of two, choosing a partner with a different background. You
may keep the same partner as before, or change partners.
This project uses data published by Kevin Hillstrom, a well-known data min-
ing consultant. You can find the data at http://cseweb.ucsd.edu/users/
elkan/250B/HillstromData.csv. For a detailed description, see http://
minethatdata.com/blog/2008/03/minethatdata-e-mail-analytics-and-data.
html.
For this assignment, use only the data for customers who are not sent any email
promotion. Your job is to train a good model to predict which customers visit the
retailer’s website. For now, you should ignore information about which customers
make a purchase, and how much they spend.
Build support vector machine (SVM) models to predict the target label as ac-
curately as possible. In the same general way as for linear regression, recode non-
numerical features as numerical, and transform features to improve their usefulness.
Train the best possible model using a linear kernel, and also the best possible model
using a radial basis function (RBF) kernel. The outcome should be the two most
accurate SVM classifiers that you can find, without overfitting or underfitting.
Decide thoughtfully which measure of accuracy to use, and explain your choice
in your report. Use nested cross-validation carefully to find the best settings for
training, and to evaluate the accuracy of the best classifiers as fairly as possible. In
particular, you should identify good values of the soft-margin C parameter for both
kernels, and of the width parameter for the RBF kernel.
For linear SVMs, the Rapidminer operator named FastLargeMargin is recom-
mended. Because it is fast, you can explore models based on a large number of
transformed features. Training nonlinear SVMs is much slower, but one can hope
that good performance can be achieved with fewer features.
As before, the deliverable is a well-organized, well-written, and well-formatted
report of about two pages. Describe what you did that worked, and your results. Ex-
plain any assumptions that you made, and any limitations on the validity or reliability
of your results. Explain carefully your nested cross-validation procedure.
Include a printout of your final Rapidminer process, and a description of your
two final models (not included in the two pages). Do not speculate about future
work, and do not explain ideas that do not work. Write in the present tense. Organize
your report logically, not chronologically.
Chapter 5
In many data mining applications, the goal is to find needles in a haystack. That is,
most examples are negative but a few examples are positive. The goal is to iden-
tify the rare positive examples, as accurately as possible. For example, most credit
card transactions are legitimate, but a few are fraudulent. We have a standard bi-
nary classifier learning problem, but both the training and test sets are unbalanced.
In a balanced set, the fraction of examples of each class is about the same. In an
unbalanced set, some classes are rare while others are common.
A major difficulty with unbalanced data is that accuracy is not a meaningful measure
of performance. Suppose that 99% of credit card transactions are legitimate. Then
we can get 99% accuracy by predicting trivially that every transaction is legitimate.
On the other hand, suppose we somehow identify 5% of transactions for further in-
vestigation, and half of all fraudulent transactions occur among these 5%. Clearly the
identification process is doing something worthwhile and not trivial. But its accuracy
is only 95%.
For concreteness in further discussion, we will consider only the two class case,
and we will call the rare class positive. Rather than talk about fractions or percentages
of a set, we will talk about actual numbers (also called counts) of examples. It turns
out that thinking about actual numbers leads to less confusion and more insight than
thinking about fractions. Suppose the test set has a certain total size n, say n = 1000.
We can represent the performance of the trivial classifier as follows:
37
38 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS
predicted
positive negative
positive 0 10
truth
negative 0 990
predicted
positive negative
positive 5 5
truth
negative 45 945
A table like the ones above is called a 2×2 contingency table. Above, rows corre-
spond to actual labels, while columns correspond to predicted labels. It would be
equally valid to swap the rows and columns. Unfortunately, there is no standard
convention about whether rows are actual or predicted. Remember that there is a
universal convention that in notation like xij the first subscript refers to rows while
the second subscript refers to columns.
A table like the ones above is also called a confusion matrix. For supervised
learning with discrete predictions, only a confusion matrix gives complete informa-
tion about the performance of a classifier. No single number that summarizes per-
formance, for example accuracy, can provide a full picture of the usefulness of a
classifier.
The four entries in a 2×2 contingency table have standard names. They are called
called true positives tp, false positives f p, true negatives tn, and false negatives f n,
as follows:
predicted
positive negative
positive tp fn
truth
negative fp tn
The terminology true positive, etc., is standard, but as mentioned above, whether
columns correspond to predicted and rows to actual, or vice versa, is not standard.
As mentioned, the entries in a confusion matrix are counts, i.e. integers. The total
of the four entries tp+tn+f p+f n = n, the number of test examples. Depending on
the application, different summaries are computed from these entries. In particular,
accuracy a = (tp + tn)/n. Assuming that n is known, three of the counts in a
confusion matrix can vary independently. This is the reason why no single number
5.2. THRESHOLDS AND LIFT 39
The names “precision” and “recall” come from the field of information retrieval. In
other research areas, recall is often called sensitivity, while precision is sometimes
called positive predictive value.
Precision is undefined for a classifier that predicts that every test example is neg-
ative, that is when tp + f p = 0. Worse, precision can be misleadingly high for a
classifier that predicts that only a few test examples are positive. Consider the fol-
lowing confusion matrix:
predicted
positive negative
positive 1 9
truth
negative 0 990
Precision is 100% but 90% of actual positives are missed. F-measure is a widely
used metric that overcomes this limitation of precision. It is the harmonic average of
precision and recall:
1 pr
F = = .
1/p + 1/r r+p
For the confusion matrix above F = 1 · 0.1/(1 + 0.1) = 0.09.
Besides accuracy, precision, recall, and F-measure, many other summaries are
also commonly computed from confusion matrices. Some of these are called speci-
ficity, false positive rate, false negative rate, positive and negative likelihood ratio,
kappa coefficient, and more. Rather than rely on agreement and understanding of the
definitions of these, it is preferable simply to report a full confusion matrix explicitly.
the threshold zero to obtain a discrete yes/no prediction. Confusion matrices cannot
represent information about the usefulness of underlying real-valued predictions. We
shall return to this issue below, but first we shall consider the issue of selecting a
threshold.
Setting the threshold determines the number tp + f p of examples that are pre-
dicted to be positive. In some scenarios, there is a natural threshold such as zero for
an SVM. However, even when a natural threshold exists, it is possible to change the
threshold to achieve a target number of positive predictions. This target number is
often based on a so-called budget constraint. Suppose that all examples predicted to
be positive are subjected to further investigation. An external limit on the resources
available will determine how many examples can be investigated. This number is a
natural target for the value f p + tp. Of course, we want to investigate those examples
that are most likely to be actual positives, so we want to investigate the examples x
with the highest prediction scores f (x). Therefore, the correct strategy is to choose
a threshold t such that we investigate all examples x with f (x) ≥ t and the number
of such examples is determined by the budget constraint.
Given that a fixed number of examples are predicted to be positive, a natural ques-
tion is how good a classifier is at capturing the actual positives within this number.
This question is answered by a measure called lift. The definition is a bit complex.
First, let the fraction of examples predicted to be positive be x = (tp + f p)/n. Next,
let the base rate of actual positive examples be b = (tp + f n)/n. Let the density of
actual positives within the predicted positives be d = tp/(tp + f p). Now, the lift at
x is defined to be the ratio d/b. Intuitively, a lift of 2 means that actual positives are
twice as dense among the predicted positives as they are among all examples.
Lift can be expressed as
d tp/(tp + f p) tp · n
= =
b (tp + f n)/n (tp + f p)(tp + f n)
tp n recall
= = .
tp + f n tp + f p prediction rate
Lift is a useful measure of success if the number of examples that should be predicted
to be positive is determined by external considerations. However, budget constraints
should normally be questioned. In the credit card scenario, perhaps too many trans-
actions are being investigated and the marginal benefit of investigating some trans-
actions is negative. Or, perhaps too few transactions are being investigated; there
would be a net benefit if additional transactions were investigated. Making optimal
decisions about how many examples should be predicted to be positive is discussed
in the next chapter.
While a budget constraint is not normally a rational way of choosing a threshold
for predictions, it is still more rational than choosing an arbitrary threshold. In par-
5.3. RANKING EXAMPLES 41
Figure 5.1: ROC curves for three alternative binary classifiers. Source: Wikipedia.
ticular, the threshold zero for an SVM classifier has some mathematical meaning but
is not a rational guide for making decisions.
equal. This is what an ROC curve does.1 Concretely, an ROC curve is a plot of
the performance of a classifier, where the horizontal axis measures false positive rate
(fpr) and the vertical axis measures true positive rate (tpr). These are defined as
fp
f pr =
f p + tn
tp
tpr =
tp + f n
Note that tpr is the same as recall and is sometimes also called “hit rate.”
In an ROC plot, the ideal point is at the top left. One classifier uniformly domi-
nates another if its curve is always above the other’s curve. It happens often that the
ROC curves of two classifiers cross, which implies that neither one dominates the
other uniformly.
ROC plots are informative, but do not provide a quantitative measure of the per-
formance of classifiers. A natural quantity to consider is the area under the ROC
curve, often abbreviated AUC. The AUC is 0.5 for a classifier whose predictions are
no better than random, while it is 1 for a perfect classifier. The AUC has an intuitive
meaning: it can be proved that it equals the probability that the classifier will rank
correctly two randomly chosen examples where one is positive and one is negative.
One reason why AUC is widely used is that, as shown by the probabilistic mean-
ing just mentioned, it has built into it the implicit assumption that the rare class is
more important than the common class.
values sorted in the same order. For each f j we want to find an output value gj such
that the gj values are monotonically increasing, and squared error relative to the y j
values is minimized. Formally, the optimization problem is
It is a remarkable fact that if squared error is minimized, then the resulting predictions
are well-calibrated probabilities.
There is an elegant algorithm called “pool adjacent violators” (PAV) that solves
this problem in linear time. The algorithm is as follows, where pooling a set means
replacing each member of the set by the arithmetic mean of the set.
Let gj = y j for all j
Start with j = 1 and increase j until the first j such that gj 6≤ gj+1
Pool gj and gj+1
Move left: If gj−1 6≤ gj then pool gj−1 to gj+1
Continue to the left until monotonicity is satisfied
Proceed to the right
Given a test example x, the procedure to predict a well-calibrated probability is as
follows:
Apply the classifier to obtain f (x)
Find j such that f j ≤ f (x) ≤ f j+1
The predicted probability is gj .
loss X 1
l( , yi ).
i
1 + e−(a+bfi )
The precise loss function l that is typically used in the optimization is called condi-
tional log likelihood (CLL) and is explained in Chapter ??? below. However, one
could also use squared error, which would be consistent with isotonic regression.
5.6. UNIVARIATE LOGISTIC REGRESSION 45
Quiz
All questions below refer to a classifier learning task with two classes, where the base
rate for the positive class y = 1 is 5%.
(a) Suppose that a probabilistic classifier predicts p(y = 1|x) = c for some constant
c, for all test examples x. Explain why c = 0.05 is the best value for c.
The value 0.05 is the only constant that is a well-calibrated probability. “Well-
calibrated” means that c = 0.05 equals the average frequency of positive examples
in sets of examples with predicted score c.
(b) What is the error rate of the classifier from part (a)? What is its MSE?
With a prediction threshold of 0.5, all examples are predicted to be negative, so
the error rate is 5%. The MSE is
2009 Assignment
This assignment is due at the start of class on Tuesday April 28, 2009. As before, you
should work in a team of two. You may change partners, or keep the same partner.
Like previous assignments, this one uses the KDD98 training set. However, you
should now use the entire dataset. (Revised.) The goal is to train a classifier with real-
valued outputs that identifies test examples with TARGET_B = 1. Specifically, the
measure of success to optimize is lift at 10%. That is, as many positive test examples
as possible should be among the 10% of test examples with highest prediction score.
You should use logistic regression first, because this is a fast and reliable method
for training probabilistic classifiers. If you are using Rapidminer, then use the logis-
tic regression option of the FastLargeMargin operator. As before, recode and
transform features to improve their usefulness. Do feature selection to reduce the
size of the training set as much as is reasonably possible.
Next, you should apply a different learning method to the same training set that
you developed using logistic regression. The objective is to see whether this other
method can perform better than logistic regression, using the same data code in the
same way. The second learning method can be a support vector machine, a neural
network, or a decision tree, for example. Apply cross-validation to find good algo-
rithm settings.
(Deleted: When necessary, use a postprocessing method (isotonic regression or
logistic regression) to obtain calibrated estimates of conditional probabilities. In-
vestigate the probability estimates produced by your two methods. What are the
minimum, mean, and maximum predicted probabilities? Discuss whether these are
reasonable.
Assignment
This assignment is due at the start of class on Tuesday April 27, 2010. As before, you
should work in a team of two. You may change partners, or keep the same partner.
Like the previous assignment, this one uses the e-commerce data published by
Kevin Hillstrom However, the goal now is to predict who makes a purchase on the
website (again, for customers who are not sent any promotion). This is a highly
unbalanced classification task.
First, use logistic regression. If you are using Rapidminer, then use the logis-
tic regression option of the FastLargeMargin operator. As before, recode and
transform features to improve their usefulness. Investigate the probability estimates
produced by your two methods. What are the minimum, mean, and maximum pre-
dicted probabilities? Discuss whether these are reasonable.
Second, use your creativity to get the best possible performance in predicting
who makes a purchase. The measure of success to optimize is lift at 25%. That is, as
many positive test examples as possible should be among the 25% of test examples
with highest prediction score. You may apply any learning algorithms that you like.
Can any method achieve better accuracy than logistic regression?
For predicting who makes a purchase, compare learning a classifier directly with
learning two classifiers, the first to predict who visits the website and the second to
predict which visitors make purchases. Note that mathematically
As before, be careful not to fool yourself about the success of your methods.
Quiz for April 27, 2010
Your name:
Assignment
The paper you should read is Predicting protein-protein interactions from primary
structure by Joel Bock and David Gough, published in the journal Bioinformatics in
2001. The full text of this paper in PDF is supposed to be available free. If you have
difficulty obtaining it, please post on the class message board.
You should figure out and describe three major flaws in the paper. The flaws
concern
Each of the three mistakes is serious. The paper has 248 citations according to
Google Scholar as of May 26, 2009, but unfortunately each flaw by itself makes
the results of the paper not useful as a basis for future research. Each mistake is
described sufficiently clearly in the paper: it is a sin of commission, not a sin of
omission.
The second mistake, how each example is represented, is the most subtle, but at
least one of the papers citing this paper does explain it clearly. It is connected with
how SVMs are applied here. Remember the slogan: “If you cannot represent it then
you cannot learn it.”
Separately, provide a brief critique of the four benefits claimed for SVMs in the
section of the paper entitled Support vector machine learning. Are these benefits
true? Are they unique to SVMs? Does the work described in this paper take advan-
tage of them?
50 CHAPTER 5. CLASSIFICATION WITH A RARE CLASS
Sample answers
Here is a brief summary of what I see as the most important flaws of the paper
Predicting protein-protein interactions from primary structure.
(1) How the dataset is constructed. The problem here is that the negative exam-
ples are not pairs of genuine proteins. Instead, they are pairs of randomly generated
amino acid sequences. It is quite possible that these artificial sequences could not
fold into actual proteins at all. The classifiers reported in this paper may learn mainly
to distinguish between real proteins and non-proteins.
The authors acknowledge this concern, but do not overcome it. They could have
used pairs of genuine proteins as negative examples. It is true that one cannot be
sure that any given pair really is non-interacting. However, the great majority of
pairs do not interact. Moreover, if a negative example really is an interaction, that
will presumably slightly reduce the apparent accuracy of a trained classifier, but not
change overall conclusions.
(2) How each example is represented. This is a subtle but clear-cut and important
issue, assuming that the research uses a linear classifier.
Let x1 and x2 be two proteins and let f (x1 ) and f (x2 ) be their representations
as vectors in Rd . The pair of proteins is represented as the concatenated vector
hf (x1 ) f (x2 )i ∈ R2d . Suppose a trained linear SVM has parameter vector w. By
definition w ∈ R2d also. (If there is a bias coefficient, so w ∈ R2d+1 , the conclusion
is the same.)
Now suppose the first protein x1 is fixed and consider varying the second protein
x2 . Proteins x2 will be ranked according to the numerical value of the dot product
w · hf (x1 ) f (x2 )i. This is equal to w1 · f (x1 ) + w2 · f (x2 ) where the vector w
is written as hw1 w2 i. If x1 is fixed, then the first term is constant and the second
term w2 · f (x2 ) determines the ranking. The ranking of x2 proteins will be the same
regardless of what the x1 protein is. This fundamental drawback of linear classifiers
for predicting interactions is pointed out in [Vert and Jacob, 2008, Section 5].
With a concatenated representation of protein pairs, a linear classifier can at best
learn the propensity of individual proteins to interact. Such a classifier cannot repre-
sent patterns that depend on features that are true only for specific protein pairs. This
is the relevance of the slogan “If you cannot represent it then you cannot learn it.”
Note: Previously I was under the impression that the authors stated that they used
a linear kernel. On rereading the paper, it fails to mention at all what kernel they use.
If the research uses a linear kernel, then the argument above is applicable.
(3) How performance is measured and reported. Most pairs of proteins are
non-interacting. It is reasonable to use training sets where negative examples (non-
interacting pairs) are undersampled. However, it is not reasonable or correct to report
5.7. PITFALLS OF LINK PREDICTION 51
performance (accuracy, precision, recall, etc.) on test sets where negative examples
are under-represented, which is what is done in this paper.
As for the four claimed advantages of SVMs:
3. SVMs have fast training, which is essential for screening large datasets.
SVM training is slow compared to many other classifier learning methods, ex-
cept for linear classifiers trained by fast algorithms that were only published
after 2001, when this paper was published. As mentioned above, a linear clas-
sifier is not appropriate with the representation of protein pairs used in this
paper.
In any case, what is needed for screening many test examples is fast classifier
application, not fast training. Applying a linear classifier is fast, whether it is
an SVM or not. Applying a nonlinear SVM typically has the same order-of-
magnitude time complexity as applying a nearest-neighbor classifier, which is
the slowest type of classifier in common use.
Detecting overfitting:
cross-validation
6.1 Cross-validation
Usually we have a fixed database of labeled examples available, and we are faced
with a dilemma: we would like to use all the examples for training, but we would
also like to use many examples as an independent test set. Cross-validation is a
procedure for overcoming this dilemma. It is the following algorithm.
53
54 CHAPTER 6. DETECTING OVERFITTING: CROSS-VALIDATION
The output of this model selection procedure is v̂. The input set V of alternative
settings can be a grid of parameter values.
But, any procedure for selecting parameter values is itself part of the learning
algorithm. It is crucial to understand this point. The setting v̂ is chosen to maximize
M (v), so M (v̂) is not a fair estimate of the performance to be expected from v̂ on
future data. Stated another way, v̂ is chosen to optimize performance on all of S, so
v̂ is likely to overfit S.
Notice that the division of S into subsets happens just once, and the same division
is used for all settings v. This choice reduces the random variability in the evaluation
of different v. A new partition of S could be created for each setting v, but this would
not overcome the issue that v̂ is chosen to optimize performance on S.
What should we do about the fact that any procedure for selecting parameter
values is itself part of the learning algorithm? One answer is that this procedure
should itself be evaluated using cross-validation. This process is called nested cross-
validation, because one cross-validation is run inside another.
Specifically, nested cross-validation is the following process:
Now, the final reported average ei is the estimated performance of the classifier ob-
tained by running the same model selection procedure (i.e. the search over each set-
ting v) on the whole dataset S.
Some notes:
2. Above, the same partition of T is used for each setting v. This reduces ran-
domness a little bit compared to using a different partition for each v, but the
latter would be correct also.
Quiz for April 13, 2010
Your name:
(a) Suppose that you choose the R value for which the reported RMSE is lowest.
Explain why this method is likely to be overoptimistic, as an estimate of the RMSE
to be expected on future data.
Each R value is being evaluated on the entire set S. The value that seems to be
best is likely overfitting this set.
(b) Very briefly, suggest an improved variation of the method above.
The procedure to select a value for R is part of the learning algorithm. This
whole procedure should be evaluated using a separate test set, or via cross-validation,
• “The partition of S should be stratified.” No; first of all, we are doing re-
gression, so stratification is not well-defined, and second, failing to stratify
increases variability but not does not cause overfitting.
• “The partition of S should be done separately for each R value, not just once.”
No; a different partition for each R value might increase the variability in the
evaluation of each R, but it would not change the fact that the best R is being
selected according to its performance on all of S.
Two basic points to remember are that it is never fair to evaluate a learning method on
its training set, and that any search for settings for a learning algorithm (e.g. search
for a subset of features, or for algorithmic parameters) is part of the learning method.
Chapter 7
This chapter discusses making optimal decisions based on predictions, and maximiz-
ing the value of customers.
Decisions and predictions are conceptually very different. For example, a prediction
concerning a patient may be “allergic” or “not allergic” to aspirin, while the corre-
sponding decision is whether or not to administer the drug. Predictions can often be
probabilistic, while decisions typically cannot.
Suppose that examples are credit card transactions and the label y = 1 designates
a legitimate transaction. Then making the decision y = 1 for an attempted transaction
means acting as if the transaction is legitimate, i.e. approving the transaction. The
essence of cost-sensitive decision-making is that it can be optimal to act as if one
class is true even when some other class is more probable. For example, if the cost
of approving a fraudulent transaction is proportional to the dollar amount involved,
then it can be rational not to approve a large transaction, even if the transaction is
most likely legitimate. Conversely, it can be rational to approve a small transaction
even if there is a high probability it is fraudulent.
Mathematically, let i be the predicted class and let j be the true class. If i = j
then the prediction is correct, while if i 6= j the prediction is incorrect. The (i, j)
entry in a cost matrix c is the cost of acting as if class i is true, when in fact class j
is true. Here, predicting i means acting as if i is true, so one could equally well call
this deciding i.
A cost matrix c has the following structure when there are only two classes:
59
60 CHAPTER 7. MAKING OPTIMAL DECISIONS
The cost of a false positive is c10 while the cost of a false negative is c01 . We fol-
low the convention that cost matrix rows correspond to alternative predicted classes,
while columns correspond to actual classes. In short the convention is row/column =
i/j = predicted/actual. (This convention is the opposite of the one in Section 5.1 so
perhaps we should switch one of these to make the conventions similar.)
The optimal prediction for an example x is the class i that minimizes the expected
cost X
e(x, i) = p(j|x)c(i, j). (7.1)
j
For each i, e(x, i) is an expectation computed by summing over the alternative pos-
sibilities for the true class of x. In this framework, the role of a learning algorithm is
to produce a classifier that for any example x can estimate the probability p(j|x) of
each class j being the true class of x.
if a constant is added to each entry in the matrix. This shifting corresponds to chang-
ing the baseline away from which costs are measured. By scaling and shifting entries,
any two-class cost matrix
c00 c01
c10 c11
that satisfies the reasonableness conditions can be transformed into a simpler matrix
that always leads to the same decisions:
0 c001
1 c011
where c001 = (c01 − c00 )/(c10 − c00 ) and c011 = (c11 − c00 )/(c10 − c00 ). From a
matrix perspective, a 2x2 cost matrix effectively has two degrees of freedom.
Here examples are people who apply for a loan from a bank. “Actual good” means
that a customer would repay a loan while “actual bad” means that the customer would
default. The action associated with “predict bad” is to deny the loan. Hence, the
cashflow relative to any baseline associated with this prediction is the same regardless
62 CHAPTER 7. MAKING OPTIMAL DECISIONS
fraudulent legitimate
refuse $20 −$20
approve −x 0.02x
the predicted average cost, and is computed using the conditional probability of each
class given the example.
In the two-class case, the optimal prediction is class 1 if and only if the expected
cost of this prediction is less than or equal to the expected cost of predicting class 0,
i.e. if and only if
which is equivalent to
given p = p(y = 1|x). If this inequality is in fact an equality, then predicting either
class is optimal.
The threshold for making optimal decisions is p∗ such that
Assuming the reasonableness conditions c10 > c00 and c01 ≥ c11 , the optimal pre-
diction is class 1 if and only if p ≥ p∗ . Rearranging the equation for p∗ leads to
This says that the examples ranked lowest are positive with probability equal to t/2.
In other words, the lift for the lowest ranked examples cannot be driven below 0.5.
Let c be the cost of a contact and let b be the benefit of a positive. This means
that the benefit matrix is
nature simple nature complex
predict simple succeed fail
predict complex fail fail
Here “simple” means that the hypothesis is drawn from a space with low cardinality, while complex
means it is drawn from a space with high cardinality.
7.6. RULES OF THUMB FOR EVALUATING DATA MINING CAMPAIGNS 65
Let n be the size of the test set, The profit at q is the total benefit minus the total cost,
which is p √ √
nqbt 1/q − nqc = nc(bt q/c − q) = nc(k q − q)
where k = tb/c. Profit is maximized when
d
0= (kq 0.5 − q) = k(0.5)q −0.5 − 1.
dq
Remarkably, this is always the same as the cost of running the campaign, which is
nqc.
As k decreases, the attainable profit decreases fast, i.e. quadratically. If k < 0.4
then q < 0.04 and the attainable profit is less than 0.04nc. Such a campaign may
have high risk, because the lift attainable at small q is typically worse than suggested
by the rule of thumb. Note however that a small adverse change in c, b, or t is not
likely to make the campaign lose money, because the expected revenue is always
twice the campaign cost.
It is interesting to consider whether data mining is beneficial or not from the
point of view of society at large. Remember that c is the cost of one contact from the
perspective of the initiator of the campaign. Each contact also has a cost or benefit
for the recipient. Generally, if the recipient responds, one can assume that the contact
was beneficial for him or her, because s/he only responds if the response is beneficial.
However, if the recipient does not respond, which is the majority case, then one can
assume that the contact caused a net cost to him or her, for example a loss of time.
From the point of view of the initiator of a campaign a cost or benefit for a
respondent is an externality. It is not rational for the initiator to take into account
these benefits or costs, unless they cause respondents to take actions that affect the
initiator, such as closing accounts. However, it is rational for society to take into
account these externalities.
The conclusion above is that the revenue from a campaign, for its initiator, is
roughly twice its cost. Suppose that the benefit of responding for a respondent is λb
66 CHAPTER 7. MAKING OPTIMAL DECISIONS
where b is the benefit to the initiator. Suppose also that the cost of a solicitation to
a person is µc where c is the cost to the initiator. The net benefit to respondents is
positive as long as µ < 2λ.
The reasoning above clarifies why spam email campaigns are harmful to society.
For these campaigns, the cost c to the initiator is tiny. However, the cost to a recipient
is not tiny, so µ is large. Whatever λ is, it is likely that µ > 2λ.
In summary, data mining is only beneficial in a narrow sweet spot, where
tb/2 ≤ c ≤ αtb
where α is some constant greater than 1. The product tb is the average benefit of
soliciting a random customer. If the cost c of solicitation is less than half of this, then
it is rational to contact all potential respondents. If c is much greater than the average
benefit, then the campaign is likely to have high risk for the initiator.
As an example of the reasoning above, consider the scenario of the 1998 KDD
contest. Here t = 0.05 about, c = $0.68, and the average benefit is b = $15
approximately. We have k = tb/c = 75/68 = 1.10. The rule of thumb predicts that
the optimal fraction of people to solicit is q = 0.30, while the achievable profit per
person cq = $0.21. In fact, the methods that perform best in this domain achieve
profit of about $0.16, while soliciting about 70% of all people.
Quiz (2009)
Suppose your lab is trying to crystallize a protein. You can try experimental condi-
tions x that differ on temperature, salinity, etc. The label y = 1 means crystallization
is successful, while y = 0 means failure. Assume (not realistically) that the results
of different experiments are independent.
You have a classifier that predicts p(y = 1|x), the probability of success of ex-
periment x. The cost of doing one experiment is $60. The value of successful crys-
tallization is $9000.
(a) Write down the benefit matrix involved in deciding rationally whether or not
to perform a particular experiment.
The benefit matrix is
success failure
do experiment 9000-60 -60
don’t 0 0
Suppose that you work for a bank that wants to prevent criminal laundering of money.
The label y = 1 means a money transfer is criminal, while y = 0 means the transfer
is legal. You have a classifier that estimates the probability p(y = 1|x) where x is a
vector of feature values describing a money transfer.
Let z be the dollar amount of the transfer. The matrix of costs (negative) and
benefits (positive) involved in deciding whether or not to deny a transfer is as follows:
criminal legal
deny 0 −0.10z
allow −z 0.01z
Work out the rational policy based on p(y = 1|x) for deciding whether or not to
allow a transfer.
7.7. EVALUATING SUCCESS 69
2009 Assignment
This assignment is due at the start of class on Tuesday May 5. As before, you should
work in a team of two, and you are free to change partners or not.
This assignment is the last one to use the KDD98 data. You should now train
on the entire training set, and measure final success on the test set that you have not
previously used. The goal is to solicit an optimal subset of the test examples. The
measure of success to maximize is total donations received minus $0.68 for every
solicitation.
You should train a regression function to predict donation amounts, and a classi-
fier to predict donation probabilities. For a test example x, let the predicted donation
be a(x) and let the predicted donation probability be p(x). You should decide to send
a donation request to person x if and only if
You should use the training set for all development work. In particular, you should
use part of the training set for debugging your procedure for reading in test examples,
making decisions concerning them, and tallying total profit. Only use the actual test
set once, to measure the final success of your method.
Notes: The test instances are in the file cup98val.zip at http://archive.
ics.uci.edu/ml/databases/kddcup98/kddcup98.html. The test set
labels are in valtargt.txt. The labels are sorted by CONTROLN, unlike the test
instances.
2010 Assignment
This week’s assignment is to participate in the PAKDD 2010 data mining contest.
Details of the contest are at http://sede.neurotech.com.br/PAKDD2010/.
Each team of two students should register and download the files for the contest.
Your first goal should be to understand the data and submission formats, and to
submit a correct set of predictions. Make sure that when you have good predictions
later, you will not run into any technical difficulties.
Your second goal should be to understand the contest scenario and the differences
between the training and two test datasets. Do some exploratory analysis of the three
datasets. In your written report, explain your understanding of the scenario, and your
general findings about the datasets.
Next, based on your general understanding, design a sensible approach for achiev-
ing the contest objective. Implement this approach and submit predictions to the con-
test. Of course, you may refine your approach iteratively and you may make multiple
submissions. Meet the May 3 deadline for uploading your best predictions and a
copy of your report. The contest rules ask each team to submit a paper of four pages.
You can find template files for LaTeX and Word at http://www.springer.
com/computer/lncs?SGWID=0-164-6-793341-0. Do not worry about
formatting details.
Chapter 8
This chapter discusses how to learn two-class classifiers from nonstandard training
data. Specifically, we consider three different but related scenarios where labels are
missing for some but not all training examples.
p(x, y) = p(x)p(y|x)
and also
p(x, y) = p(y)p(x|y)
without loss of generality and without making any assumptions. The equations above
are sometimes called the chain rule of probabilities. They are true both when the
probability values are probability masses, and when they are probability densities.
71
72 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS
assuming that p(s = 1|x) > 0 for all x. Therefore we can learn a correct model of
p(y|x) from just the labeled training data, without using the unlabeled data in any
way. This case is rather misleadingly called “missing at random” (MAR). It is not
the case that labels are missing in a totally random way, because missingness does
depend on x. It is also not the case that s and y are independent. However, it is true
that s and y are conditionally independent, conditional on x. Concretely, for each
value of x the equation p(y|x, s = 0) = p(y|x) = p(y|x, s = 1) holds.
The assumption that p(s = 1|x) > 0 is important. The real-world meaning of
this assumption is that label information must be available with non-zero probability
for every possible instance x. Otherwise, it might be the case for some x that no
labeled training examples are available from which to estimate p(y|x, s = 1).
Suppose that even when x is known, there is still some correlation between s
and y. In this case the label y is said to be “missing not at random” (MNAR) and
inference is much more difficult. We do not discuss this case further here.
8.3. COVARIATE SHIFT 73
be correlated with the value of y, but this correlation disappears after conditioning
on x.
If missingness does depend on y, even after conditioning on x, then we are in the
MNAR situation and in general we are can draw no firm conclusions. An example of
this situation is “survivor bias.” Suppose our analysis is based on historical records,
and those records are more likely to exist for survivors, everything else being equal.
Then p(s = 1|y = 1, x) > p(s = 1|y = 0, x) where “everything else being equal”
means that x is the same on both sides.
A useful fact is the following. Suppose we want to estimate E[f (z)] where z
follows the distribution p(z), but we can only draw samples of z from the distribution
q(z). The fact is
p(z)
E[f (z)|z ∼ p(z)] = E[f (z) |z ∼ q(z)],
q(z)
assuming q(z) > 0 for all z. More generally the requirement is that q(z) > 0
whenever p(z) > 0, assuming the definition 0/0 = 0. The equation above is called
the importance-sampling identity.
Let the goal be to compute E[f (x, y)|x, y ∼ p(x, y)] for any function f . To make
notation more concise, write this as E[f ] and write E[f (x, y)|x, y ∼ p(x, y, s = 1)]
as E[f |s = 1]. We have
p(x)p(y|x)
E[f ] = E[f |s = 1]
p(x|s = 1)p(y|x, s = 1)
p(x)
= E[f |s = 1]
p(x|s = 1)
p(x)
E[f ] = E[f |s = 1]
p(s = 1|x)p(x)/p(s = 1)
p(s = 1)
= E[f |s = 1].
p(s = 1|x)
The constant p(s = 1) can be estimated as r/n where r is the number of labeled
training examples and n is the total number of training examples. Let p̂(s = 1|x) be
a trained model of the conditional probability p(s = 1|x). The estimate of E[f ] is
then
r
r X f (xi , yi )
.
n p̂(s = 1|xi )
i=1
8.5. POSITIVE AND UNLABELED EXAMPLES 75
This estimate is called a “plug-in” estimate because it is based on plugging the ob-
served values of hx, yi into a formula that would be correct if based on integrating
over all values of hx, yi.
The weighting approach just explained is correct in the statistical sense that it
is unbiased, if the propensity estimates p̂(s = 1|x) are correct. However, this ap-
proach typically has high variance since a few labeled examples with high values
for 1/p̂(s = 1|x) dominate the sum. Therefore an important question is whether
alternative approaches exist that have lower variance. One simple heuristic is to
place a ceiling on the values 1/p̂(s = 1|x). For example the ceiling 1000 is used
by [Huang et al., 2006]. However, no good method for selecting the ceiling value is
known.
In medical research, the ratio p(s = 1)/p(s = 1|x) is called the “inverse proba-
bility of treatment” (IPT) weight.
When does the reject inference scenario give rise to MNAR bias?
Without some assumption about which positive examples are labeled, it is impossi-
ble to make progress. A common assumption is that the labeled positive examples
are chosen completely randomly from all positive examples. Let this be called the
“selected completely at random” assumption. Stated formally, it is that
Another way of stating the assumption is that s and x are conditionally independent
given y.
A training set consists of two subsets, called the labeled (s = 1) and unlabeled
(s = 0) sets. Suppose we provide these two sets as inputs to a standard training
algorithm. This algorithm will yield a function g(x) such that g(x) = p(s = 1|x)
approximately. The following lemma shows how to obtain a model of p(y = 1|x)
from g(x).
Lemma 1. Suppose the “selected completely at random” assumption holds. Then
p(y = 1|x) = p(s = 1|x)/c where c = p(s = 1|y = 1).
76 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS
Proof. Remember that the assumption is p(s = 1|y = 1, x) = p(s = 1|y = 1).
Now consider p(s = 1|x). We have that
Note that in principle any single example from P is sufficient to determine c, but that
in practice averaging over all members of P is preferable.
There is an alternative way of using Lemma 1. Let the goal be to estimate
Ep(x,y,s) [h(x, y)] for any function h, where p(x, y, s) is the overall distribution. To
make notation more concise, write this as E[h]. We want an estimator of E[h] based
on a positive-only training set of examples of the form hx, si.
8.5. POSITIVE AND UNLABELED EXAMPLES 77
where
1 − c p(s = 1|x)
w(x) = p(y = 1|x, s = 0) = (8.3)
c 1 − p(s = 1|x)
and m is the cardinality of the training set. What this says is that each labeled exam-
ple is treated as a positive example with unit weight, while each unlabeled example
is treated as a combination of a positive example with weight p(y = 1|x, s = 0)
and a negative example with complementary weight 1 − p(y = 1|x, s = 0). The
probability p(s = 1|x) is estimated as g(x) where g is the nontraditional classifier
explained in the previous section.
78 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS
The result above on estimating E[h] can be used to modify a learning algorithm
in order to make it work with positive and unlabeled training data. One method is to
give training examples individual weights. Positive examples are given unit weight
and unlabeled examples are duplicated; one copy of each unlabeled example is made
positive with weight p(y = 1|x, s = 0) and the other copy is made negative with
weight 1 − p(y = 1|x, s = 0).
where
p(z)
w(z) = .
q(z)
We assume that q(z) > 0 for all z such that p(z) > 0, and we define 0/0 = 0.
Suppose that the training set consists of values z sampled according to the probability
distribution q(z). Explain intuitively which members of the training set will have
greatest influence on the estimate of E[f (z)|z ∼ p(z)].
Quiz for 2009
(a) Suppose you use the weighting approach to deal with reject inference. What are
the minimum and maximum possible values of the weights?
Let x be a labeled example, and let its weight be p(s = 1)/p(s = 1|x). Intu-
itively, this weight is how many copies are needed to allow the one labeled example
to represent all the unlabeled examples that are similar. The conditional probability
p(s = 1|x) can range between 0 and 1, so the weight can range between p(s = 1)
and infinity.
(b) Suppose you use the weighting approach to learn from positive and unlabeled
examples. What are the minimum and maximum possible values of the weights?
In this scenario, weights are assigned to unlabeled examples, not to labeled ex-
amples as above. The weights here are probabilities p(y = 1|x, s = 0), so they range
between 0 and 1.
(c) Explain intuitively what can go wrong if the “selected completely at random”
assumption is false, when learning from positive and unlabeled examples.
The “selected completely at random” assumption says that the positive examples
with known labels are perfectly representative of the positive examples with unknown
labels. If this assumption is false, then there will be unlabeled examples that in fact
are positive, but that we treat as negative, because they are different from the labeled
positive examples. The trained model of the positive class will be too narrow.
8.6. FURTHER ISSUES 81
Assignment (revised)
The goal of this assignment is to train useful classifiers using training sets with miss-
ing information of three different types: (i) covariate shift, (ii) reject inference, and
(iii) no labeled negative training examples. In http://www.cs.ucsd.edu/
users/elkan/291/dmfiles.zip you can find four datasets: one test set, and
one training set for each of the three scenarios. Training set (i) has 5,164 examples,
sets (ii) and (iii) have 11,305 examples, and the test set has 11,307 examples. Each
example has values for 13 predictors. (Many thanks to Aditya Menon for creating
these files.)
You should train a classifier separately based on each training set, and measure
performance separately but on the same test set. Use accuracy as the primary measure
of success, and use the logistic regression option of FastLargeMargin as the
main training method. In each case, you should be able to achieve better than 82%
accuracy.
All four datasets are derived from the so-called Adult dataset that is available
at http://archive.ics.uci.edu/ml/datasets/Adult. Each example
describes one person. The label to be predicted is whether or not the person earns
over $50,000 per year. This is an interesting label to predict because it is analogous
to a label describing whether or not the person is a customer that is desirable in some
way. (We are not using the published weightings, so the fnlwgt feature has been
omitted from our datasets.)
First, do cross-validation on the test set to establish the accuracy that is achievable
when all training set labels are known. In your report, show graphically a learning
curve, that is accuracy as a function of the number of training examples, for 1000,
2000, etc. training examples.
Training set (i) requires you to overcome covariate shift, since it does not follow
the same distribution as the population of test examples. Evaluate experimentally the
effectiveness of learning p(y|x) from biased training sets of size 1000, 2000, etc.
Training set (ii) requires reject inference, because the training examples are a
random sample from the test population, but the training label is known only for some
training examples. The persons with known labels, on average, are better prospects
than the ones with unknown labels. Compare learning p(y|x) directly with learning
p(y|x) after reweighting.
In training set (iii), a random subset of positive training examples have known
labels. Other training examples may be negative or positive. Use an appropriate
weighting method. Explain how you estimate the constant c = p(s = 1|y = 1) and
discuss the accuracy of this estimate. The true value is c = 0.4; in a real application
we would not know this, of course.
82 CHAPTER 8. LEARNING CLASSIFIERS DESPITE MISSING LABELS
For each scenario and each training method, include in your report a learning
curve figure ’ that shows accuracy as a function of the number (1000, 2000, etc.)
of labeled training examples used. Discuss the extent to which each of the three
missing-label scenarios reduces achievable accuracy.
Chapter 9
Recommender systems
The collaborative filtering (CF) task is to recommend items to a user that he or she
is likely to like, based on ratings for different items provided by the same user and
on ratings provided by other users. The general assumption is that users who give
similar ratings to some items are likely also to give similar ratings to other items.
From a formal perspective, the input to a collaborative filtering algorithm is a
matrix of incomplete ratings. Each row corresponds to a user, while each column
corresponds to an item. If user 1 ≤ i ≤ m has rated item 1 ≤ j ≤ n, the matrix
entry xij is the value of this rating. Often rating values are integers between 1 and 5.
If a user has not rated an item, the corresponding matrix entry is missing. Missing
ratings are often represented as 0, but this should not be viewed as an actual rating
value.
The output of a collaborative filtering algorithm is a prediction of the value of
each missing matrix entry. Typically the predictions are not required to be integers.
Given these predictions, many different real-world tasks can be performed. For ex-
ample, a recommender system might suggest to a user those items for which the
predicted ratings by this user are highest.
There are two main general approaches to the formal collaborative filtering task:
nearest-neighbor-based and model-based. Given a user and item for which a predic-
tion is wanted, nearest neighbor (NN) approaches use a similarity function between
rows, and/or a similarity function between columns, to pick a subset of relevant other
users or items. The prediction is then some sort of average of the known ratings in
this subset.
Model-based approaches to collaborative filtering construct a low-complexity
representation of the complete xij matrix. This representation is then used instead of
the original matrix to make predictions. Typically each prediction can be computed
in O(1) time using only a fixed number of coefficients from the representation.
83
84 CHAPTER 9. RECOMMENDER SYSTEMS
tribution while MSE is minimized by taking the mean of the distribution. In general
the mean and the median are different, so in general predictions that minimize MAE
are different from predictions that minimize MSE.
Most matrix approximation algorithms aim to minimize MSE between the train-
ing data (a complete or incomplete matrix) and the approximation obtained from the
learned low-complexity representation. Some methods can be used with equal ease
to minimize MSE or MAE.
∂ X ∂
E = e(f (ri , cj ), xij )
∂ri ∂ri
hi,ji∈I
X ∂ ∂
= e(f (ri , cj ), xij ) f (ri , cj ).
∂f (ri , cj ) ∂ri
hi,ji∈I
As before, I is the set of matrix indices for which xij is known. Consider the special
case where e(u, v) = |u − v|p for p > 0. This case is a generalization of MAE and
MSE: p = 2 corresponds to MSE and p = 1 corresponds to MAE. We have
∂ ∂
e(u, v) = p|u − v|p−1 |u − v|
∂u ∂u
= p|u − v|p−1 sgn(u − v).
Here sgn(a) is the sign of a, that is −1, 0, or 1 if a is negative, zero, or positive. For
computational purposes we can ignore the non-differentiable case u = v. Therefore
∂ X ∂
E= p|f (ri , cj ) − xij |p−1 sgn(f (ri , cj ) − xij ) f (ri , cj ).
∂ri ∂ri
hi,ji∈I
∂
Now suppose f (ri , cj ) = ri + cj so ∂ri f (ri , cj ) = 1. We obtain
∂ X
E=p |ri + cj − xij |p−1 sgn(ri + cj − xij ).
∂ri
hi,ji∈I
9.5. COMBINING MODELS BY FITTING RESIDUALS 87
∂
Alternatively, suppose f (ri , cj ) = ri cj so ∂ri f (ri , cj ) = cj . We get
∂ X
E=p |ri cj − xij |p−1 sgn(ri cj − xij )cj .
∂ri
hi,ji∈I
Given the gradients above, we apply online (stochastic) gradient descent. This means
that we iterate over each triple hri , cj , xij i in the training set, compute the gradient
with respect to ri and cj based just on this one example, and perform the updates
∂
ri := ri − λ e(f (ri , cj ), xij )
∂ri
and
∂
cj := cj − λ e(f (ri , cj ), xij )
∂cj
where λ is a learning rate. Note that λ determines the step sizes for stochastic gradient
descent; no separate algorithm parameter is needed for this.
After any finite number of epochs, stochastic gradient descent does not converge
fully. The choice of 30 epochs is a type of early stopping that leads to good results by
not overfitting the training data. For the learning rate we use a decreasing schedule:
λ = 0.2/e for additive models and λ = 0.4/e for multiplicative models, where
1 ≤ e ≤ 30 is the number of the current epoch.
Gradient descent as described above directly optimizes precisely the same ob-
jective function (given the MSE error function) that is called “incomplete data likeli-
hood” in EM approaches to factorizing matrices with missing entries. It is sometimes
forgotten that EM is just one approach to solving maximum-likelihood problems; in-
complete matrix factorization is an example of a maximum-likelihood problem where
an alternative solution method is superior.
Quiz
For each part below, say whether the statement in italics is true or false, and then
explain your answer briefly.
(a) For any model, the average predicted rating for unrated movies is expected to
be less than the average actual rating for rated movies.
True. People are more likely to like movies that they have actually watched, than
random movies that they have not watched. Any good model should capture this fact.
In more technical language, the value of a rating is correlated with whether or
not it is missing.
(b) Let the predicted value of rating xij be ri cj + si dj and suppose ri and cj
are trained first. For all viewers i, ri should be positive, but for some i, si should be
negative.
True. Ratings xij are positive, and xij = ri cj on average, so ri and cj should
always be positive. (They could always both be negative, but that would be unintuitive
without being more expressive.)
The term si dj models the difference xij − ri cj . This difference is on average
zero, so it is sometimes positive and sometimes negative. In order to allow si dj
to be negative, si must be negative sometimes. (Making si be always positive, while
allowing dj to be negative, might be possible. However in this case the expressiveness
of the model si dj would be reduced.)
(c) We have a training set of 500,000 ratings for 10,000 viewers and 1000 movies,
and we train a rank-50 unregularized factor model. This model is likely to overfit the
training data.
True. The unregularized model has 50 · (10, 000 + 1000) = 550, 000 parameters,
which is more than the number of data points for training. Hence, overfitting is
practically certain.
Quiz for May 25, 2010
Page 87 of the lecture notes says
∂
ri := ri − λ e(f (ri , cj ), xij )
∂ri
and
∂
cj := cj − λ e(f (ri , cj ), xij )
∂cj
where λ is a learning rate.
State and explain what the first update rule is for the special case e(u, v) = (u − v)2
and f (ri , cj ) = ri · cj .
Assignment
The goal of this assignment is to apply a collaborative filtering method to the task of
predicting movie ratings. You should use the small MovieLens dataset available at
http://www.grouplens.org/node/73. This has 100,000 ratings given by
943 users to 1682 movies.
You can select any collaborative filtering method that you like. You may reuse
existing software, or you may write your own, in Matlab or in another programming
language. Whatever your choice, you must understand fully the algorithm that you
apply, and you should explain it with mathematical clarity in your report. The method
that you choose should handle missing values in a sensible and efficient way.
When reporting final results, do five-fold cross-validation, where each rating is
assigned randomly to one fold. Note that this experimental procedure makes the task
easier, because it is likely that every user and every movie is represented in each
training set. Moreover, evaluation is biased towards users who have provided more
ratings; it is easier to make accurate predictions for these users.
In your report, show mean absolute error graphically as a function of a measure of
complexity of your chosen method. If you select a matrix factorization method, this
measure of complexity will likely be rank. Also show timing information graphically.
Discuss whether you could run your chosen method on the full Netflix dataset of
about 108 ratings. Also discuss whether your chosen method needs a regularization
technique to reduce overfitting.
Good existing software includes the following:
• Jason Rennie’s fast maximum margin matrix factorization for collaborative fil-
tering (MMMF) at http://people.csail.mit.edu/jrennie/matlab/.
You are also welcome to write your own code, or to choose other software. If you
choose other existing software, you may want to ask the instructor for comments
first.
Chapter 10
Text mining
This chapter explains how to do data mining on datasets that are collections of doc-
uments. Text mining tasks include
• classifier learning
• clustering,
• topic modeling, and
• latent semantic analysis.
Classifiers for documents are useful for many applications. Major uses for binary
classifiers include spam detection and personalization of streams of news articles.
Multiclass classifiers are useful for routing messages to recipients.
Most classifiers for documents are designed to categorize according to subject
matter. However, it is also possible to learn to categorize according to qualitative
criteria such as helpfulness for product reviews submitted by consumers.
Classifiers are useful for ranking documents as well as for dividing them into
categories. With a training set of very helpful product reviews, and another training
set of very unhelpful reviews, we can learn a scoring function that sorts other reviews
accirding to their degree of helpfulness. There is often no need to pick a threshold,
which would be arbitrary, to separate marginally helpful from marginally unhelpful
reviews.
In many applications of multiclass classification, a single document can belong
to more than one category, so it is correct to predict more than one label. This task is
specifically called multilabel classification. In standard multiclass classification, the
classes are mutually exclusive, i.e. a special type of negative correlation is fixed in
advance. In multilabel classification, it is important to learn the positive and negative
correlations between classes.
93
94 CHAPTER 10. TEXT MINING
Then, given a test document, we can evaluate its probability according to the
model. The higher this probability is, the more similar the test document is to the
training set.
The probability distribution that we use is the multinomial. Mathematically, this
distribution is
m
n! Y x
p(x; θ) = Qm θj j .
x
j=1 j !
j=1
where the data x are a vector of non-negative integers and the parameters θ are a
real-valued vector. Both vectors haveP the same length m. The components of θ are
non-negative and have unit sum: m j=1 θj = 1.
Intuitively, θj is the probability of word j while xj is the count of word j. Each
time word j appears in the document it contributes an amount θj to the total proba-
bility, hence the term θj to the power xj .
Like any discrete distribution, a multinomial has to sum to one, where the sum is
over all possible data points. Here, a data point is a document containing n words.
The number of such documents is exponential in their length n: it is mn . The proba-
bility of any individual document will therefore be very small. What is important is
the relative probability of different documents. A document that mostly uses words
with high probability will have higher relative probability.
At first sight, computing the probability of a document requires O(m) time be-
x
cause of the product over j. However, if xj = 0 then θj j = 1 so the jth factor can
be omitted from the product. Similarly, 0! = 1 so the jth factor can be omitted from
Q m
j=1 xj !. Hence, computing the probability of a document needs only O(n) time.
Because the probabilities of individual documents decline exponentially with
length n, it is necessary to do numerical computations with log probabilities:
Xm Xm
log p(x; θ) = log n! − [ log xj !] + [ xj · log θj ]
j=1 j=1
Given a set of training documents, the maximum-likelihood estimate of the jth pa-
rameter is
1X
θj = xj
T x
where the sum isP overPall documents x belonging to the training set. The normalizing
constant is T = x j xj which is the sum of the sizes of all training documents.
If a multinomial has θj = 0 for some j, then every document with xj > 0 for this
j has zero probability, regardless of any other words in the document. Probabilities
that are perfectly zero are undesirable, so we want θj > 0 for all j. Smoothing with
96 CHAPTER 10. TEXT MINING
where the symbol ∝ means “proportional to.” The constant c is called a pseudocount.
Intuitively, it is a notional number of appearances of word j that are assumed to exist,
regardless of the true number of appearances.
P Typically c is chosen in the range
0 < c ≤ 1. Because the equality j θj = 1 must be preserved, the normalizing
constant must be T 0 = mc + T in
1 X
θj = 0 (c + xj ).
T x
In order to avoid big changes in the estimated probabilities θj , one should have c <
T /m.
Technically, one multinomial is a distribution over all documents of a fixed size
n. Therefore, what is learned by the maximum-likelihood process just described is
in fact a different distribution for each size n. These distributions, although separate,
have the same parameter values.
Generative process. Sampling with replacement.
the sum is taken over the documents in one class. Unfortunately, the exact value of c
can strongly influence classification accuracy.
As discussed in Chapter ?? above, when one class of documents is rare, it is not
reasonable to use accuracy to measure the success of a classifier for documents. In-
stead, it is common to use the so-called F-measure instead of accuracy. This measure
is the harmonic mean of precision and recall:
1
f=
1/p + 1/r
where p and r are precision and recall for the rare class.
10.4 Burstiness
The multinomial model says that each appearance of the same word j always has
the same probability θj . In reality, additional appearances of the same word are less
surprising, i.e. they have higher probability. Consider the following excerpt from a
newspaper article, for example.
The multinomial distribution arises from a process of sampling words with re-
placement. An alternative distribution named the the Dirichlet compound multino-
mial (DCM) arises from an urn process that captures the authorship process better.
Consider a bucket with balls of |V | different colors. After a ball is selected randomly,
it is not just replaced, but also one more ball of the same color is added. Each time a
98 CHAPTER 10. TEXT MINING
ball is drawn, the chance of drawing the same color again is increased. This increased
probability models the phenomenon of burstiness.
Let the initial number of balls with color j be βj . These initial values are the
parameters of the DCM distribution. The DCM parameter vector β has length |V |,
like the multinomial parameter vector, but the sum of the components of β is un-
constrained. This one extra degree of freedom allows the DCM to discount multiple
observations of the same word, in an adjustable way. The smaller the parameter vales
βj are, the more words are bursty.
Above, tp is the number of positive training examples containing the word, and f n
is the number of these examples not containing the word, while f p is the number
of negative training examples containing the word, and tn is the number of these
examples not containing the word. If any of these numbers is zero, we replace it by
0.5, which of course is less than any of these numbers that is genuinely non-zero.
Notice that in the formula above the positive and negative classes are treated in a
perfectly symmetric way. The value | log tp/f n − log f p/tn| is large if tp/f n and
f p/tn have very different values. In this case the word is highly diagnostic for at
least one of the two classes.
10.6. CLUSTERING DOCUMENTS 99
Here, K is the number of components in the mixture model. For each k, p(x; θk ) is
the distribution of component number k. The scalar αk is the proportion of compo-
nent number k.
Each component is a cluster.
where i ranges over the classes, tpi is the number of training examples in class i
containing the word, and f ni is the number of these examples not containing the
word.
It is also not known if combining the log transformation and logodds weighting
is beneficial, as in
Quiz
(a) Explain why, with a multinomial distribution, “the probabilities of individual doc-
uments decline exponentially with length n.”
The probability of document x of length n according to a multinomial distribution
is
m
n! Y x
p(x; θ) = Qm θj j .
j=1 jx !
j=1
P
[Rough argument.] Each θj value is less than 1. In total n = xj of these
Qm jxj
the product j=1 θj decreases
values are multiplied together. Hence as n increases Q
exponentially. Note the multinomial coefficient n!/ m j=1 xj ! does increase with n,
but more slowly.
(b) Consider the multiclass Bayesian classifier
p(x|y = k)p(y = k)
ŷ = argmax .
k p(x)
Simplify the expression inside the argmax operator as much as possible, given that
the model p(x|y = k) for each class is a multinomial distribution.
The denominator p(x) is the same for all k, so it does not influence which k is the
argmax. Within the multinomial distributions p(x|y = k), the multinomial coefficient
does not depend on k, so it is constant for a single x and it can be eliminated also,
giving
m
x
Y
ŷ = argmax p(y = k) θkjj .
k j=1
m
X m
X
ŷ = I log p(y = 1) + xj log θ1j − log p(y = 0) − xj log θ0j .
j=1 j=1
102 CHAPTER 10. TEXT MINING
where the coefficients are c0 = log p(y = 1) − log p(y = 0) and cj = log θ1j −
log θ0j . The expression inside the indicator function is a linear function of x.
Quiz for May 18, 2010
Your name:
Consider the task of learning to classify documents into one of two classes, using the
bag-of-words representation. Explain why a regularized linear SVM is expected to
be more accurate than a Bayesian classifier using one maximum-likelihood multino-
mial for each class.
Assignment (2009)
The purpose of this assignment is to compare two different approaches to learning
a classifier for text documents. The dataset to use is called Classic400. It consists
of 400 documents from three categories over a vocabulary of 6205 words. The cate-
gories are quite distinct from a human point of view, and high accuracy is achievable.
First, you should try Bayesian classification using a multinomial model for each
of the three classes. Note that you may need to smooth the multinomials with pseu-
docounts. Second, you should train a support vector machine (SVM) classifier. You
will need to select a method for adapting SVMs for multiclass classification. Since
there are more features than training examples, regularization is vital.
For each of the two classifier learning methods, investigate whether you can
achieve better accuracy via feature selection, feature transformation, and/or feature
weighting. Because there are relatively few examples, using ten-fold cross-validation
to measure accuracy is suggested. Try to evaluate the statistical significance of dif-
ferences in accuracy that you find. If you allow any leakage of information from the
test fold to training folds, be sure to explain this in your report.
The dataset is available at http://www.cs.ucsd.edu/users/elkan/
291/classic400.zip, which contains three files. The main file cl400.csv is
a comma-separated 400 by 6205 array of word counts. The file truelabels.csv
gives the actual class of each document, while wordlist gives the string corre-
sponding to the word indices 1 to 6205. These strings are not needed for train-
ing or applying classifiers, but they are useful for interpreting classifiers. The file
classic400.mat contains the same three files in the form of Matlab matrices.
Assignment due on May 18, 2010
The purpose of this assignment is to compare two different approaches to learning a
binary classifier for text documents. The dataset to use is the movie review polarity
dataset, version 2.0, published by Lillian Lee at http://www.cs.cornell.
edu/People/pabo/movie-review-data/. Be sure to read the README
carefully, and to understand the data and task fully.
First, you should try Bayesian classification using a multinomial model for each
of the two classes. You should smooth the multinomials with pseudocounts. Sec-
ond, you should train a linear discriminative classifier, either logistic regression or a
support vector machine. Since there are more features than training examples, regu-
larization is vital.
For both classifier learning methods, investigate whether you can achieve better
accuracy via feature selection, feature transformation, and/or feature weighting. Try
to evaluate the statistical significance of differences in accuracy that you find. Think
carefully about any ways in which you may be allowing leakage of information from
test subsets that makes estimates of accuracy be biased. Discuss this issue in your
report.
Compare the accuracy that you can achieve with accuracies reported in some of
the many published papers that use this dataset. You can find links to these papers at
http://www.cs.cornell.edu/People/pabo/movie-review-data/
otherexperiments.html. Also, analyze your trained classifier to identify what
features of movie reviews are most indicative of the review being favorable or unfa-
vorable.
Feedback on this text mining assignment: Leakage, overfitting, over 90% accu-
racy. For SVMs the C parameter is not the strength of regularization but rather the
reverse. Those who did not try strong regularization found worse performance with
an SVM than with the Bayesian classifier.
Chapter 11
A social network is a graph where nodes represent individual people and edges rep-
resent relationships.
Example: Telephone network. Nodes are subscribers, and edges are phone calls.
Call detail records (CDRs).
There are two types of data mining one can do with a network: supervised and
unsupervised. The aim of supervised learning is to obtain a model that can predict
labels for nodes, or labels for edges. For nodes, this is sometimes called collective
classification. For edges, the most basic label is existence. Predicting whether or not
edges exist is called link prediction.
Examples of tasks that involve collective classification: predicting churn of sub-
scribers; recognizing fraudulent applicants.
Examples of tasks that involve link prediction: suggesting new friends on Face-
book, identifying appropriate citations between scientific papers, predicting which
pairs of terrorists know each other.
107
108 CHAPTER 11. SOCIAL NETWORK ANALYTICS
at training time. We will not solve the cold-start problem of making predictions for
nodes that are not known during training.
Social networks, and other graphs in data mining, can be complex. In particular,
edges may be directed or undirected, they can be of multiple types, and/or they can
be weighted. The graph can be bipartite or not.
Given a social network, nodes have two fundamental characteristics. First, they
have identity. This means that we know which nodes are the same and which are
different in the graph, but we do not necessarily know anything else about nodes.
Second, nodes may be be associated with vectors that specify the values of features.
These vectors are sometimes called side-information.
For many data mining tasks where the examples are people, there are five general
types of feature: personal demographic, group demographic, behavioral, incentive,
and social. If group demographic features have predictive power, it is often because
they are reflections of social features. For example, if zipcode is predictive of the
brand of automobile a person will purchase, that is likely because of contagion effects
between people who know each other, or see each other on the streets.
2. Low rank approximation of the adjacency matrix. Empirically, the Enron email
adjacency matrix has rank approximately 2 only
2. Sum of neighbors and sum of papers are next most predictive; these features
measure the propensity of an individual as opposed to any interaction between
individuals.
if labels are not independent, then in principle a classifier can achieve higher accu-
racy by predicting labels for related examples simultaneously. This situation is called
collective classification.
Suppose that examples are nodes in a graph, and nodes are joined by edges.
Edges can have labels and/or weights, so in effect multiple graphs can be overlaid
on the same examples. For example, if nodes are persons then one set of edges may
represent the “same address” relationship while another set may represent the “made
telephone call to” relationship.
Intuitively, the labels of neighbors are often correlated. Given a node x, we would
like to use the labels of the neighbors of x as features when predicting the label of x,
and vice versa. A principled approach to collective classification would find mutu-
ally consistent predicted labels for x and its neighbors. However, in general there is
no guarantee that mutually consistent labels are unique, and there is no general algo-
rithm for inferring them. Experimentally, a simple iterative algorithm often performs
as well as more sophisticated methods for collective classification.
Given a node x, let N (x) be the set of its neighbors. Let S(x) be the bag of
labels of nodes in N (x), where we allow “unknown” as a special label value. Let
g(x) be some representation of S(x), perhaps using aggregate operators. A classifier
is a function f (x, g(x)). A training set may include examples with known labels and
examples with unknown labels. Let L be the training examples with known labels.
The examples that are used in training are the set
[
E= N (x).
x∈L
Given a trained classifier f (x, g(x)), the algorithm for classifying test examples is
the following.
The algorithm above is purely heuristic, but it is sensible and often effective in prac-
tice.
11.6. OTHER TOPICS 111
Consider the task of predicting which pairs of nodes in a social network are linked.
Specifically, suppose you have a training set of pairs for which edges are known to
exist, or known not to exist. You need to make predictions for the remaining edges.
Let the adjacency matrix A have dimension n × n. You have trained a matrix
factorization A = U V , where U has dimension n × k for some k n. Because A
is symmetric, V is the transpose of U .
Let ui be the row of U that represents node i. Let hj, ki be a pair for which you
need to predict whether an edge exists. Consider these two possible ways to make
this prediction:
Explain which of these two approaches is best, and why. (Do not mention any other
approaches, which are outside the scope of this question.)
Assignment due on June 1, 2010
The purpose of this assignment is to apply and evaluate methods to predict the pres-
ence of links that are unknown in a social network. Use either the Cora dataset or
the Terrorists dataset. Both datasets are available at http://www.cs.umd.edu/
˜sen/lbc-proj/LBC.html.
In the Cora dataset, each node represents an academic paper. Each paper has a
label, where the seven label values are different subareas of machine learning. Each
paper is also represented by a bag-of-words vector of length 1433. The network
structure is that each paper is cited by, or cites, at least one other paper in the dataset.
The Terrorists dataset is more complicated. For explanations see the paper Entity
and relationship labeling in affiliation networks by Zhao, Sen, and Getoor from the
2006 ICML Workshop on Statistical Network Analysis, available at http://www.
mindswap.org/papers/2006/RelClzPIT.pdf. According to Table 1 and
Section 6 of this paper, there are 917 edges that connect 435 terrorists. There is
information about each terrorist, and also about each edge. The edges are labeled
with types.
For either dataset, your task is to pretend that some edges are unknown, and then
to predict these, based on the known edges and on the nodes. Suppose that you are
doing ten-fold cross-validation on the Terrorists dataset. Then you would pretend
that 92 actual edges are unknown. Based on the 927 − 92 = 825 known edges, and
on all 435 nodes, you would predict a score for each of the 435 · 434/2 potential
edges. The higher the score of the 92 held-out edges, the better.
You should use logistic regression to predict the score of each potential edge.
When predicting edges, you may use features obtained from (i) the network structure
of the edges known to be present and absent, (ii) properties of the nodes, and/or
(iii) properties of pairs of nodes. For (i), use a method for converting the network
structure into a vector of fixed length of feature values for each node, as discussed in
class. Using one or more of the feature types (i), (ii), and (iii) yields seven alternative
sets of features. Do experiments comparing at least two of these.
Each experiment should use logistic regression and ten-fold cross-validation.
Think carefully about exactly what information is legitimately included in training
folds. Also, avoid the mistakes discussed in Section 5.7 above.
Chapter 12
Interactive experimentation
115
116 CHAPTER 12. INTERACTIVE EXPERIMENTATION
Quiz
The following text is from a New York Times article dated May 30, 2009.
Mr. Herman had run 27 ads on the Web for his client Vespa, the
scooter company. Some were rectangular, some square. And the text
varied: One tagline said, “Smart looks. Smarter purchase,” and dis-
played a $0 down, 0 percent interest offer. Another read, “Pure fun. And
function,” and promoted a free T-shirt.
Vespa’s goal was to find out whether a financial offer would attract
customers, and Mr. Herman’s data concluded that it did. The $0 down
offer attracted 71 percent more responses from one group of Web surfers
than the average of all the Vespa ads, while the T-shirt offer drew 29
percent fewer.
(a) What basic principle of the scientific method did Mr. Herman not follow,
according to the description above?
(b) Suppose that it is true that the financial offer does attract the highest number of
purchasers. What is an important reason why the other offer might still be preferable
for the company?
(c) Explain the writing mistake in the phrase “Mr. Herman’s data concluded that
it did.” Note that the error is relatively high-level; it is not a spelling or grammar
mistake.
Bibliography
[Huang et al., 2006] Huang, J., Smola, A., Gretton, A., Borgwardt, K. M., and
Schölkopf, B. (2006). Correcting sample selection bias by unlabeled data. In Pro-
ceedings of the Neural Information Processing Systems Conference (NIPS 2006).
[Jonas and Harper, 2006] Jonas, J. and Harper, J. (2006). Effective counterterror-
ism and the limited role of predictive data mining. Technical report, Cato In-
stitute. Available at http://www.cato.org/pub_display.php?pub_
id=6784.
[Michie et al., 1994] Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Ma-
chine Learning, Neural and Statistical Classification. Ellis Horwood.
[Vert and Jacob, 2008] Vert, J.-P. and Jacob, L. (2008). Machine learning for in
silico virtual screening and chemical genomics: New strategies. Combinatorial
Chemistry & High Throughput Screening, 11(8):677–685(9).
117