You are on page 1of 1

Stanford CS229 Machine Learning Final Project

Prediction of Airline Ticket Price


Team Member: Ruixuan Ren, Yunzhe Yang, Shenli Yuan
Mentor: Bryan McCann
Motivation Models Diagnostics
As International students, we inevitably Nave Bayes Learning curve was utilized to investigate the
need to travel frequently and have to Multinomial event model of Nave Bayes with price prediction problem. We picked Nave Bayes
deal with all the expenses associated Laplace smoothing was applied. The target and SVM to investigate further.
with it, among which airfare is one of variable is discretized relative price (each price
the most significant expenses. over overall minimum price). We used equal
Therefore, we become really interested interval for discretization.
in a model that is able to predict the
airfare. Softmax regression
Softmax regression was applied with the same
method of discretization as that used in Nave
Data Source Bayes.
The data, provided by Professor Maria
Gini [1], were originally collected using Support Vector Machine (SVM) Figure 1: Naive Bayes Learning Curve
daily price quotes from a major travel SVM was also used with the same discretization
search web site over the period method, producing a similar value of accuracy.
February 22, 2011 to June 23, 2011. However, when the data is discretized into more
than two bins, the error increases significantly.
Data Features
Departure week begin Training Error
Weekday of departure
Price quote date Model Error
Weekday of the price quote Unweighted 0.3733
#days between quote and departure = 0.8 0.3989
Number of stops in the itinerary Linear Figure 2: SVM Learning Curve

regression From the plots above, it is obvious that for both


=2 0.3774
model, we have a high bias. Therefore, we tried
= 10 0.3733 adding features to our models, which resulted in
Models Nave Bayes 0.2694 smaller training errors. In the future, other
Linear regression features, such as the available seat and departure
Both unweighted and weighted linear Softmax regression 0.2316
time of a day, need to be considered.
regressions were attempted. For SVM (two bins) 0.1939
weighted linear regression, three Reference
bandwidth values (0.8, 2, 10) were SVM regression and Logistic regression have [1] A Regression Model For Predicting Optimal
used for comparison. also been used; the results, however, are not Purchase Timing For Airline Tickets, Groves and
satisfying enough and therefore discarded. Gini, 2011

You might also like