You are on page 1of 3

Fall 2012 IBM Jeopardy!

Great Minds Challenge Alex Bonomo, Scot Fang Intro Given a set candidate answers, each with a corresponding question ID, 341 different features, and a true/false label, our job was to train a classification algorithm, so that when given a set of unlabelled candidate answers, we could accurately predict if it was correct or not. Our general strategy was as follows: 1. Cluster candidate answers, with each cluster representing a question class. 2. Train a classifier for each cluster to accurately predict whether or not a candidate is correct. Our methodologies and results are described in the following sections. Clustering to Question Classes IBM Jeopardys literature explains that there are 10 question classes: final Jeopardy!, etymology, translation, puzzle, multiple choice, date, number, no focus, useless LAT, and default. Although a single question can only come from one class, we allow its candidate answers to span multiple classes, as they may be an attempt to answer the question in the form of another question class. Because the data set is so large and cannot fit into memory, we first use Correlation based Feature Selection (CFS) (insert reference). The idea here is to find the set of features that are highly correlated with the class, yet uncorrelated with each other. Another perspective of CFS is we are looking for a subset of these features can be used to assign a question to a class--thus CFS finds a subset of discriminatory features. In our case, we ran CFS to find correlation with the question ID, because we want to divide the question ids into their respective classes. Running the CFS on the full training set gave us a subset of 7 features. We used these 7 features to reduce the size of the training set and train a clusterer using Expectation Maximization (EM) over this subset of 7 features. We then cluster the training set into 10 clusters because we know there are 10 question classes from the supplied papers. After clustering with EM, every candidate answer is assigned a cluster probability distribution, rather than simply assigned to a single cluster. In the last section we see how this cluster prob. distribution weights our classification scores. Classification - Train Classifier per Question Class Our general strategy was to train a different classifier for each cluster; the logic being that each cluster represented a different question category, and the significance of certain features can be radically different in different question categories. Therefore a different classification model was trained for each cluster. When training a single classifier on a specified cluster, we only train over data instances that have the highest likelihood of belonging to that cluster. Thus each classifier is specialized to a unique cluster. Classification - Algorithms Used

For the actual classifiers per cluster, we tried the following algorithms: 1. Ridge Estimator Logistic Regression
le Cessie, S., van Houwelingen, J.C. (1992). Ridge Estimators in Logistic Regression. Applied Statistics. 41(1):191-201.

2. Additive Logit Boost with simple regression functions (SimpleLogistic in Weka)


Marc Sumner, Eibe Frank, Mark Hall: Speeding up Logistic Model Tree Induction. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 675-683, 2005.

3. Additive Logit Boost with Decision Stumps, C4.5 Tree, REP Tree
Friedman, J., T. Hastie and R. Tibshirani (1998) Additive Logistic Regression: a Statistical View of Boosting

4. Raced Incremental Logit Boost:


This is a method of logit-boosting comittees of different logitBoost models trained on different chunksizes of the dataset. It decides the best training chunksize and outputs an ensemble model based on the different logitBoost models trained. The advantage of this method is that is it can reduce the amount of memory loaded in your machine by reducing the training chunksize, but still train over all the data by producing different models per chunk.

Boosting Performance: Additive boosting with Decision stumps far outperformed boosting with Tree structures. We suspect the added complexity of a tree structure may have overfit the training data, more investigation is needed. LogitBoost, Ridge Estimator, and RacedIncremental all performed about the same. Classification - Other Strategies Initially, we tried to address the negative bias in the training set by using a cost-sensitive classifier, which penalized false negatives more than false positives. However, after evaluating our models on Jeopardys scoring scheme, we realized that reducing the false positive rate contributed to a better score even when the false negative rate increased above 40%. Thus, the scoring scheme encouraged a negative bias by penalizing false positives harshly. We retrained our classifiers with equal cost of false positive vs. false negative over the negatively biased dataset. These new classifiers received higher scores. Another strategy we tried was first obtaining the top 100, 50, and 5 candidate answers per question ID, then training a second set of classifiers on this data. The idea here is to first get rid of the weak negatives in order to train stronger, more sensitive classifiers. Using the top 5 candidates per QID performed terribly, likely due to overfitting. Using the top 100 and top 50 performed reasonably well, but, surprisingly, was unable to outperform some of our other techniques. In retrospect, the first set of classifiers should have penalized false negatives more harshly, with the second set treating them equally to weed out false positives. Classification - Outputting a Final Decision To evaluate a set of candidate answers given a question ID, we made the following steps: 1 Assign a cluster probability distribution to the instance using our trained EM model. 2 Run each clusters unique classifier on the instance. 3 Assign a final classification score to the instance, which is the sum of all the clusters classification scores weighted by the probability the instance belongs to that cluster. 4 Look through all instances of the QID, and find the top score.

5 6

If the top score is above the decision threshold (a parameter of our model), we assign a truth to that candidate answer, else we assign false. We assign false to all candidate answers with lower scores than the top score.

The decision threshold parameter was determined empirically by evaluating on test sets, essentially it tunes your ratio of FP/FN.

You might also like