Professional Documents
Culture Documents
Well use:
WEKA (Waikato Environment for Knowledge Analysis)
Free (GPLed) Java package with GUI Online at www.cs.waikato.ac.nz/ml/weka Witten and Frank, 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations.
R packages
E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree, mlbench, randomForest, subselect
Numeric prediction
Regression trees, model trees
Classification
Methods for predicting a discrete response
One kind of supervised learning Note: in biological and other sciences, classification has long had a different meaning, referring to cluster analysis
Applications include:
Identifying good prospects for specific marketing or sales efforts
Cross-selling, up-selling when to offer products Customers likely to be especially profitable Customers likely to defect
Weather/Game-Playing Data
Small dataset
14 instances 5 attributes
Outlook - nominal Temperature - numeric Humidity - numeric Wind - nominal Play
Whether or not a certain game would be played This is what we want to understand and predict
@data % % 14 instances % sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no
Want to predict credit risk Data available at UCI machine learning data repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
Classification Algorithms
Many methods available in WEKA
0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM, Instance-based learner (IB1, IBk), C4.5 (J48), PART, Support vector machine (SMO)
Usually train on part of the data, test on the rest Simple method Zero-rule, or 0R
Predict the most common category
Class ZeroR in WEKA
Too simple for practical use, but a useful baseline for evaluating performance of more complex methods
Look at error rate for each predictor on training dataset, and choose best predictor Called OneR in WEKA Must group numerical predictor values for this method
Common method is to split at each change in the response Collapse buckets until each contains at least 6 instances
1R Algorithm (continued)
Biased towards predictors with more categories
These can result in over-fitting to the training data
Often error rate only a few percentages points higher than more sophisticated methods (e.g. decision trees) Produced rules that were much simpler and more easily understood
Response value with highest probability is predicted Numeric attributes are assumed to follow a normal distribution within each response value
Contribution to probability calculated from normal density function Instead can use kernel density estimate, or simply discretise the numerical attributes
Decision Trees
Classification rules can be expressed in a tree structure
Move from the top of the tree, down through various nodes, to the leaves At each node, a decision is made using a simple test based on attribute values The leaf you reach holds the appropriate predicted value
If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b
For accurate predictions, want leaf nodes to be as pure as possible Choose the attribute that maximises the average purity of the daughter nodes
The measure of purity used is the entropy of the node This is the amount of information needed to specify the value of an instance in that node, measured in bits
entropy p1 , p2 , , pn pi log 2 pi
i 1
(a)
(b)
(c)
(d)
Weather Example
First node from outlook split is for sunny, with entropy 2/5 * log2(2/5) 3/5 * log2(3/5) = 0.971 Average entropy of nodes from outlook split is
5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693
Entropy of root node is 0.940 bits Gain of 0.247 bits Other splits yield:
Gain(temperature)=0.029 bits Gain(humidity)=0.152 bits Gain(windy)=0.048 bits
(c)
Several more improvements have been made to handle numeric attributes (via univariate splits), missing values and noisy data (via pruning)
Resulting algorithm known as C4.5
Described by Quinlan (1993)
Widely used (as is the commercial version C5.0) WEKA has a version called J4.8
Classification Trees
Described (along with regression trees) in:
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984. Classification and Regression Trees.
CART also incorporates methods for pruning, missing values and numeric attributes
Multivariate splits are possible, as well as univariate CART typically uses Gini measure of node purity to determine best splits
This is of the form p(1-p) Split on linear combination cjxj > d
Regression Trees
Trees can also be used to predict numeric attributes
Predict using average value of the response in the appropriate node
Implemented in CART and C4.5 frameworks
Another approach that often works well is to fit the tree, remove all training cases that are not correctly predicted, and refit the tree on the reduced dataset
Typically gives a smaller tree This usually works almost as well on the training data But generalises better, e.g. works better on test data
=== Run information === Scheme: weka.classifiers.j48.J48 -C 0.25 -M 2 Relation: german_credit Instances: 1000 Attributes: 21 Number of Leaves : Size of the tree : 103 140
=== Stratified cross-validation === === Summary === Correctly Classified Instances 739 73.9 Incorrectly Classified Instances 261 26.1 Kappa statistic 0.3153 Mean absolute error 0.3241 Root mean squared error 0.4604 Relative absolute error 77.134 % Root relative squared error 100.4589 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.883 0.597 0.775 0.883 0.826 good 0.403 0.117 0.596 0.403 0.481 bad === Confusion Matrix === a b <-- classified as 618 82 | a = good 179 121 | b = bad % %
Cross-Validation
Due to over-fitting, cannot estimate prediction error directly on the training dataset Cross-validation is a simple and widely used method for estimating prediction error Simple approach
Set aside a test dataset Train learner on the remainder (the training dataset) Estimate prediction error by using the resulting prediction model on the test dataset
This is only feasible where there is enough data to set aside a test dataset and still have enough to reliably train the learning algorithm
k-fold Cross-Validation
For smaller datasets, use k-fold cross-validation
Split dataset into k roughly equal parts
Tr Tr Test Tr Tr Tr Tr Tr Tr
For each part, train on the other k-1 parts and use this part as the test dataset Do this for each of the k parts, and average the resulting prediction errors
This method measures the prediction error when training the learner on a fraction (k-1)/k of the data If k is small, this will overestimate the prediction error
k=10 is usually enough
24.58 | n=60
Call: rpart(formula = Mileage ~ Weight, data = car.test.frame) n= 60 CP nsplit rel error 0.59534912 0 1.0000000 0.13452819 1 0.4046509 0.01282843 2 0.2701227 0.01000000 3 0.2572943 xerror 1.0322233 0.6081645 0.4557341 0.4659556 xstd 0.17981796 0.11371656 0.09178782 0.09134201
1 2 3 4
Node number 1: 60 observations, complexity param=0.5953491 mean=24.58333, MSE=22.57639 left son=2 (45 obs) right son=3 (15 obs) Primary splits: Weight < 2567.5 to the right, improve=0.5953491, (0 missing) Node number 2: 45 observations, complexity param=0.1345282 mean=22.46667, MSE=8.026667 left son=4 (22 obs) right son=5 (23 obs) Primary splits: Weight < 3087.5 to the right, improve=0.5045118, (0 missing) (continued on next page)
Inf
0.28 cp
0.042
0.011
24.58 | n=60
22.47 n=45
30.93 n=15
20.41 n=22
24.43 n=23
Classification Methods
Project the attribute space into decision regions
Decision trees: piecewise constant approximation Logistic regression: linear log-odds approximation Discriminant analysis and neural nets: linear & non-linear separators
They can be useful tools for learning from examples to find patterns in data and predict outputs
However on their own, they tend to overfit the training data Meta-learning tools are needed to choose the best fit
ANNs have been applied to data editing and imputation, but not widely
Fit the model/learner on each bootstrap sample The bagged estimate is the average prediction from all these B models
E.g. for a tree learner, the bagged estimate is the average prediction from the resulting B trees Note that this is not a tree
In general, bagging a model or learner does not produce a model or learner of the same form
Bagging reduces the variance of unstable procedures like regression trees, and can greatly improve prediction accuracy
However it does not always work for poor 0-1 predictors
Predict using a weighted majority vote: mGm(x), where Gm(x) is the prediction from model m
Then adjust weights for incorrectly classified cases by multiplying them by exp(m), and repeat
Association Rules
Data on n purchase baskets in form (id, item1, item2, , itemk)
For example, purchases from a supermarket
May be useful for product placement decisions or cross-selling recommendations We say there is an association rule i1 ->i2 if
i1 and i2 occur together in at least s% of the n baskets (the support) And at least c% of the baskets containing item i1 also contain i2 (the confidence)
The confidence criterion ensures that often is a large enough proportion of the antecedent cases to be interesting The support criterion should be large enough that the resulting rules have practical importance
Also helps to ensure reliability of the conclusions
Association rules
The support/confidence approach is widely used
Efficiently implemented in the Apriori algorithm
First identify item sets with sufficient support Then turn each item set into sets of rules with sufficient confidence
This method was originally developed in the database community, so there has been a focus on efficient methods for large databases
Large means up to around 100 million instances, and about ten thousand binary attributes
However this approach can find a vast number of rules, and it can be difficult to make sense of these One useful extension is to identify only the rules with high enough lift (or odds ratio)
Association rules predict the value of an arbitrary attribute (or combination of attributes)
E.g. If temperature=cool then humidity=normal If humidity=normal and play=no then windy=true If temperature=high and humidity=high then play=no
Clustering EM Algorithm
Assume that the data is from a mixture of normal distributions
I.e. one normal component for each cluster
Log-likelihood:
l X log p1 xi 1 p 2 xi ,
n i 1
Clustering EM Algorithm
Think of data as being augmented by a latent 0/1 variable di indicating membership of cluster 1 If the values of this variable were known, the loglikelihood would be:
l X , D log d i1 xi 1 d i 2 xi
n
Starting with initial values for the parameters, calculate the expected value of di Then substitute this into the above log-likelihood and maximise to obtain new parameter values
This will have increased the log-likelihood
i 1
Clustering EM Algorithm
Resulting estimates may only be a local maximum
Run several times with different starting points to find global maximum (hopefully)
With parameter estimates, can calculate segment membership probabilities for each case 1 xi P Di 1 , xi 1 xi 2 xi
Clustering EM Algorithm
Extending to more latent classes is easy
Information criteria such as AIC and BIC are often used to decide how many are appropriate
Extending to multiple attributes is easy if we assume they are independent, at least conditioning on segment membership
It is possible to introduce associations, but this can rapidly increase the number of parameters required
Nominal attributes can be accommodated by allowing different discrete distributions in each latent class, and assuming conditional independence between attributes Can extend this approach to a handle joint clustering and prediction models, as mentioned in the MVA lectures
Probabilistic/EM
Multi-Resolution kd-Tree for EM [Moore99] Scalable EM [BRF98b] CF Kernel Density Estimation [ZRL99]