Weka ML Tutorial 5 6

Machine Learning Tutorial
CB, GS, REC , ,
Section 5
API for Weka
Machine Learning Tutorial for the UKP lab June 10, 2011 10
Weka W k API
Series f S i of experiments are l b i i t laborious i WEKA GUI in The API is simple and easy to use to design complex workflows
e.g. grid search / simulated annealing over the classifier hyperparameter space g g g yp p p
Major concepts
arff file Instances / Instance Classifier
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Arff file format A ff fil f t

Header H d exactly specifying th parameters of th d t tl if i the t f the dataset t CSV representation of instances sparse representation is also handled Example
@relation MYDATASET @attribute att1 real @attribute att2 {value1,value2} @attribute classlabel {positive,negative} @data 0.1,value1,positive , , g 0.0,value2,negative
3
Reading R di an arff fil ff file

into i t an I t Instances object bj t import weka.core.Instances;
Instances trainData = null; Reader reader = new BufferedReader(new FileReader(new File(file))); try { trainData = new Instances(reader); trainData.setClassIndex(trainData.numAttributes() - 1); } finally { reader.close(); }
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 4
Training T i i a classifier l ifi

import weka.classifiers.Classifier; import weka.core.Instances; import weka.core.Instance; Instances trainData (from prev slide); Instances testData; Classifier cl = createClassifier(); //initialize a classifier, see later cl.buildClassifier(trainData); //perform the training process for (int i = 0; i < testData numInstances(); i++) { testData.numInstances(); Instance inst = testData.instance(i); //you grab a single test instance classification try { int value = (int) cl.classifyInstance(inst); //you get the offset of the nominal value String label = data.classAttribute().value(value); //you query the corresponding label int realvalue = (int) inst.classValue(); //in case the gold labels were in the arff //compare distribution try { double[] distr = cl. distributionForInstance(inst); //you get the class posteriors in an array
Creating / I iti li i a classifier C ti Initializing l ifi

public Ab t tCl bli AbstractClassifier createClassifier() th ifi t Cl ifi () throws E Exception ti { J48 j48 = new J48(); Configuration config = Configuration.getInstance(); j48.setUnpruned(config.getBooleanProperty(Configuration.USE_UNPRUNED_TREE)); if (!j48.getUnpruned()) { j48.setReducedErrorPruning( config.getBooleanProperty(Configuration.REDUCED_ERROR_PRUNING)); } j48.setConfidenceFactor( j48 setConfidenceFactor( config.getFloatProperty(Configuration.CONFIDENCE_FACTOR, j48.getConfidenceFactor())); j48.setMinNumObj(config.getIntProperty(Configuration.MIN_NUMOBJ, j48.getMinNumObj())); j j48.setNumFolds(config.getIntProperty(Configuration.NUM_FOLDS, j48.getNumFolds())); ( gg p y( g ,j g ())); return j48; } Also: setOptions(java.lang.String[] options) ; forName(java.lang.String classifierName, java.lang.String[] options) ;
Weka W k API
Very simple, self-explaining code V i l lf l i i d Clear architecture / structure
weka.classifiers; weka.classifiers.bayes (BayesNet, Naive Bayes) weka.classifiers.functions (Neural Net, Linear Regression, Logistic regression/Maxent, SVM, ) weka.classifiers.lazy (Nearest Neighbor, ) weka.classifiers.rules (JRip, ) weka.classifiers.trees (C4.5, Random Forest, ) weka.classifiers.meta (Boosting, Bagging, Attribute Selection, Voting, ) weka.clusterers;
Quick prototyping, testing of many algorithms Not all state of the art Not the most efficient (e g logreg is slow) (e.g. Ideal for learning / teaching / starting up
Machine Learning Tutorial

CB, GS, REC , ,
Section 6
Machine Learning further topics
Machine Learning Tutorial for the UKP lab June 10, 2011 10
Classification Cl ifi ti
Until U til now: classification l ifi ti
finite set of (nominal) class labels classification units / instances were
tokens token sequences sentences documents
Error is measured via the percentage of correct predictions

cf. accuracy, error rate, etc.
Methods worth to consider / try

maximum entropy models (in weka: logistic regression) decision trees (for easier tasks) boosted decision trees conditional random field support vector machines
Regression R i
Regression: R i
approximate a real valued target variable g also called function learning error is measured as the difference between the predicted and the observed values usually based on real valued features
Less typical problem setting for NLP Methods worth to consider / try
linear regression support vector machine
10
Ranking R ki
Preference learning
instead of classification, try to predict a total order over a set of possible labels (e.g. all possible actions at a time) research area of the KE group h h f h here
Subset ranking / Learning to rank

instead of classification, try to predict a (partial) order of a set of instances (e.g. query-document pairs) more relevant in NLP and especially IR
Error is calculated according to some ranking measure

c.f. P@k, MAP, NDCG @ , ,
Methods worth to consider / try

(rank)SVM boosted decision/regression trees
Semi-supervised l S i i d learning i
Exploit labeled + unlabeled data to improve models (or likewise to get a similar model with less labeled data) Examples
in SVM, maximize the margin (distance from decision boundary) taking into account unlabeled points use unlabeled data to calculate feature statistics use automatically labeled data to append training set
Two different paradigms

Inductive setting: learn a model that applies to new examples use some labeled+unlabeled data and g pp p evaluate on unseen data more general less powerful Transductive setting: l T d ti tti learn a model th t predicts a predefined t t set accurately use some l b l d d l that di t d fi d test t t l labeled data + unlabeled test data and evaluate on the unlabeled test more powerful p g entails the need to retrain before predicting further new data c.f. Niklass thesis
Semi-supervised l S i i d learning i
Self-training Self training
train a model predicted instances that meet a predefined selection criterion (e.g. p(+) > 0.95) are added to the training pool, and then retrain dd d h i i l d h i
Co training Co-training
train two different models / the same model on 2 independent representations (e.g. spam filtering based on text and on links) predicted instances that meet a predefined selection criterion are added to the training pool of the other model, and then retrain both
Active learning
train a model on a small initial set instances that meet a predefined selection criterion (e.g. model shows high uncertainty, p(+) ~ P(-)) are asked for human labeling, and then retrain
Semi-sup. l S i learning ( i (generate t i i d t ) t training data)

Bootstrapping ( B t t i (generate t i i d t ) t training data)
start with an initial small seed set instances that meet a predefined selection criterion (e.g. contextual similarity to ( g y the seed) are added to the training pool, and then retry
Distant supervision
start with an assumption of positive / negative membership (e.g. for pairs in a knowledge base, you know the label, look for texts having that pair) generate potential positive/negative instances based on the assumption, and then train a model
Train on errors
having labeled data for an associated task, train on its errors (which partly are due to the lack of knowledge about your current problem e.g. e g disease and associated symptom codes are never added to the same document learn D/S relationships from D and S labels/classifiers
Domain adaptation D i d t ti
When Wh crossing d i domains, th t t (f t i the texts (feature and/or l b l di t ib ti d/ label distributions) can ) change
this degrades ML performance (on target D with small train, compared to source D with large train) i ) try to tackle this domain impact to have OK performance in (almost) unseen domains
Pivot features that are frequent and robust accross domains

e.g. good is a positive sentiment word in all domains
Structural correspondence learning

align source/target specific features through their similarities to pivot features can exploit target specific knowledge through correspondences to source specific features
Easy domain adaptation

use triple, source-only, target-only and source+target versions of all features can learn general (pivot patterns) and also target- (source-)specific knowledge target (source )specific
15

Weka ML Tutorial 5 6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weka ML Tutorial 5 6

Uploaded by

Copyright:

Available Formats

Machine Learning Tutorial

CB, GS, REC , ,

API for Weka

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Arff file format A ff fil f t

Reading R di an arff fil ff file

Training T i i a classifier l ifi

Creating / I iti li i a classifier C ti Initializing l ifi

Machine Learning Tutorial

Machine Learning further topics

Error is measured via the percentage of correct predictions

Methods worth to consider / try

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Subset ranking / Learning to rank

Error is calculated according to some ranking measure

Methods worth to consider / try

Two different paradigms

Semi-sup. l S i learning ( i (generate t i i d t ) t training data)

Pivot features that are frequent and robust accross domains

Structural correspondence learning

Easy domain adaptation

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

You might also like