You are on page 1of 15

Machine Learning Tutorial

CB, GS, REC , ,

Section 5

API for Weka

Machine Learning Tutorial for the UKP lab June 10, 2011 10

Weka W k API
Series f S i of experiments are l b i i t laborious i WEKA GUI in The API is simple and easy to use to design complex workflows
e.g. grid search / simulated annealing over the classifier hyperparameter space g g g yp p p

Major concepts
arff file Instances / Instance Classifier

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Arff file format A ff fil f t


Header H d exactly specifying th parameters of th d t tl if i the t f the dataset t CSV representation of instances sparse representation is also handled Example
@relation MYDATASET @attribute att1 real @attribute att2 {value1,value2} @attribute classlabel {positive,negative} @data 0.1,value1,positive , , g 0.0,value2,negative
3

Reading R di an arff fil ff file


into i t an I t Instances object bj t import weka.core.Instances;
Instances trainData = null; Reader reader = new BufferedReader(new FileReader(new File(file))); try { trainData = new Instances(reader); trainData.setClassIndex(trainData.numAttributes() - 1); } finally { reader.close(); }
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 4

Training T i i a classifier l ifi


import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance; Instances trainData (from prev slide); Instances testData; Classifier cl = createClassifier(); //initialize a classifier, see later cl.buildClassifier(trainData); //perform the training process for (int i = 0; i < testData numInstances(); i++) { testData.numInstances(); Instance inst = testData.instance(i); //you grab a single test instance classification try { int value = (int) cl.classifyInstance(inst); //you get the offset of the nominal value String label = data.classAttribute().value(value); //you query the corresponding label int realvalue = (int) inst.classValue(); //in case the gold labels were in the arff //compare distribution try { double[] distr = cl. distributionForInstance(inst); //you get the class posteriors in an array
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 5

Creating / I iti li i a classifier C ti Initializing l ifi


public Ab t tCl bli AbstractClassifier createClassifier() th ifi t Cl ifi () throws E Exception ti { J48 j48 = new J48(); Configuration config = Configuration.getInstance(); j48.setUnpruned(config.getBooleanProperty(Configuration.USE_UNPRUNED_TREE)); if (!j48.getUnpruned()) { j48.setReducedErrorPruning( config.getBooleanProperty(Configuration.REDUCED_ERROR_PRUNING)); } j48.setConfidenceFactor( j48 setConfidenceFactor( config.getFloatProperty(Configuration.CONFIDENCE_FACTOR, j48.getConfidenceFactor())); j48.setMinNumObj(config.getIntProperty(Configuration.MIN_NUMOBJ, j48.getMinNumObj())); j j48.setNumFolds(config.getIntProperty(Configuration.NUM_FOLDS, j48.getNumFolds())); ( gg p y( g ,j g ())); return j48; } Also: setOptions(java.lang.String[] options) ; forName(java.lang.String classifierName, java.lang.String[] options) ;
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 6

Weka W k API
Very simple, self-explaining code V i l lf l i i d Clear architecture / structure
weka.classifiers; weka.classifiers.bayes (BayesNet, Naive Bayes) weka.classifiers.functions (Neural Net, Linear Regression, Logistic regression/Maxent, SVM, ) weka.classifiers.lazy (Nearest Neighbor, ) weka.classifiers.rules (JRip, ) weka.classifiers.trees (C4.5, Random Forest, ) weka.classifiers.meta (Boosting, Bagging, Attribute Selection, Voting, ) weka.clusterers;

Quick prototyping, testing of many algorithms Not all state of the art Not the most efficient (e g logreg is slow) (e.g. Ideal for learning / teaching / starting up
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 7

Machine Learning Tutorial


CB, GS, REC , ,

Section 6

Machine Learning further topics

Machine Learning Tutorial for the UKP lab June 10, 2011 10

Classification Cl ifi ti
Until U til now: classification l ifi ti
finite set of (nominal) class labels classification units / instances were
tokens token sequences sentences documents

Error is measured via the percentage of correct predictions


cf. accuracy, error rate, etc.

Methods worth to consider / try


maximum entropy models (in weka: logistic regression) decision trees (for easier tasks) boosted decision trees conditional random field support vector machines

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Regression R i
Regression: R i
approximate a real valued target variable g also called function learning error is measured as the difference between the predicted and the observed values usually based on real valued features

Less typical problem setting for NLP Methods worth to consider / try
linear regression support vector machine

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

10

Ranking R ki
Preference learning
instead of classification, try to predict a total order over a set of possible labels (e.g. all possible actions at a time) research area of the KE group h h f h here

Subset ranking / Learning to rank


instead of classification, try to predict a (partial) order of a set of instances (e.g. query-document pairs) more relevant in NLP and especially IR

Error is calculated according to some ranking measure


c.f. P@k, MAP, NDCG @ , ,

Methods worth to consider / try


(rank)SVM boosted decision/regression trees
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 11

Semi-supervised l S i i d learning i
Exploit labeled + unlabeled data to improve models (or likewise to get a similar model with less labeled data) Examples
in SVM, maximize the margin (distance from decision boundary) taking into account unlabeled points use unlabeled data to calculate feature statistics use automatically labeled data to append training set

Two different paradigms


Inductive setting: learn a model that applies to new examples use some labeled+unlabeled data and g pp p evaluate on unseen data more general less powerful Transductive setting: l T d ti tti learn a model th t predicts a predefined t t set accurately use some l b l d d l that di t d fi d test t t l labeled data + unlabeled test data and evaluate on the unlabeled test more powerful p g entails the need to retrain before predicting further new data c.f. Niklass thesis
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 12

Semi-supervised l S i i d learning i
Self-training Self training
train a model predicted instances that meet a predefined selection criterion (e.g. p(+) > 0.95) are added to the training pool, and then retrain dd d h i i l d h i

Co training Co-training
train two different models / the same model on 2 independent representations (e.g. spam filtering based on text and on links) predicted instances that meet a predefined selection criterion are added to the training pool of the other model, and then retrain both

Active learning
train a model on a small initial set instances that meet a predefined selection criterion (e.g. model shows high uncertainty, p(+) ~ P(-)) are asked for human labeling, and then retrain
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 13

Semi-sup. l S i learning ( i (generate t i i d t ) t training data)


Bootstrapping ( B t t i (generate t i i d t ) t training data)
start with an initial small seed set instances that meet a predefined selection criterion (e.g. contextual similarity to ( g y the seed) are added to the training pool, and then retry

Distant supervision
start with an assumption of positive / negative membership (e.g. for pairs in a knowledge base, you know the label, look for texts having that pair) generate potential positive/negative instances based on the assumption, and then train a model

Train on errors
having labeled data for an associated task, train on its errors (which partly are due to the lack of knowledge about your current problem e.g. e g disease and associated symptom codes are never added to the same document learn D/S relationships from D and S labels/classifiers
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 14

Domain adaptation D i d t ti
When Wh crossing d i domains, th t t (f t i the texts (feature and/or l b l di t ib ti d/ label distributions) can ) change
this degrades ML performance (on target D with small train, compared to source D with large train) i ) try to tackle this domain impact to have OK performance in (almost) unseen domains

Pivot features that are frequent and robust accross domains


e.g. good is a positive sentiment word in all domains

Structural correspondence learning


align source/target specific features through their similarities to pivot features can exploit target specific knowledge through correspondences to source specific features

Easy domain adaptation


use triple, source-only, target-only and source+target versions of all features can learn general (pivot patterns) and also target- (source-)specific knowledge target (source )specific

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

15

You might also like