Team Member: Ruixuan Ren, Yunzhe Yang, Shenli Yuan Mentor: Bryan McCann Motivation Models Diagnostics As International students, we inevitably Nave Bayes Learning curve was utilized to investigate the need to travel frequently and have to Multinomial event model of Nave Bayes with price prediction problem. We picked Nave Bayes deal with all the expenses associated Laplace smoothing was applied. The target and SVM to investigate further. with it, among which airfare is one of variable is discretized relative price (each price the most significant expenses. over overall minimum price). We used equal Therefore, we become really interested interval for discretization. in a model that is able to predict the airfare. Softmax regression Softmax regression was applied with the same method of discretization as that used in Nave Data Source Bayes. The data, provided by Professor Maria Gini [1], were originally collected using Support Vector Machine (SVM) Figure 1: Naive Bayes Learning Curve daily price quotes from a major travel SVM was also used with the same discretization search web site over the period method, producing a similar value of accuracy. February 22, 2011 to June 23, 2011. However, when the data is discretized into more than two bins, the error increases significantly. Data Features Departure week begin Training Error Weekday of departure Price quote date Model Error Weekday of the price quote Unweighted 0.3733 #days between quote and departure = 0.8 0.3989 Number of stops in the itinerary Linear Figure 2: SVM Learning Curve
regression From the plots above, it is obvious that for both
=2 0.3774 model, we have a high bias. Therefore, we tried = 10 0.3733 adding features to our models, which resulted in Models Nave Bayes 0.2694 smaller training errors. In the future, other Linear regression features, such as the available seat and departure Both unweighted and weighted linear Softmax regression 0.2316 time of a day, need to be considered. regressions were attempted. For SVM (two bins) 0.1939 weighted linear regression, three Reference bandwidth values (0.8, 2, 10) were SVM regression and Logistic regression have [1] A Regression Model For Predicting Optimal used for comparison. also been used; the results, however, are not Purchase Timing For Airline Tickets, Groves and satisfying enough and therefore discarded. Gini, 2011