Professional Documents
Culture Documents
545
2015
Abstract
RMS Titanic was a British passenger liner that sank in
the North Atlantic Ocean after colliding with an iceberg.
Although there was some element of luck involved in
surviving the sinking, some groups of people were more
likely to survive than others, such as women, children, and
the upper-class. In this project, three kinds of machinelearning techniques, including modified gender based
model, random forest, and support vector machines (SVM),
are applied to predict passengers likelihood of survival.
For modified gender based model, class, fare, port, ticket,
and family size are considered together with gender to
reach a maximum accuracy of 0.78469; for random forest
algorithm, the combined influence of passenger class,
ticket, fare, gender, port, age fill, family size, and age*class
gives a maximum accuracy of 0.77990; while for support
vector machines, only strong indicators of survival,
passenger class, gender, and port are taken into account,
which leads to a maximum accuracy of 0.76077. Our best
maximum accuracy gives a 2% increase over other models
and it is also concluded that more features do not
necessarily give better result.
1. Introduction
The sinking of the RMS Titanic is one of the most
infamous shipwrecks in history. One of the reasons that
the shipwreck led to such loss of life was that there were
not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in
surviving the sinking, some groups of people were more
likely to survive than others, such as women, children,
and the upper-class.
The task of this project is to predict whether a given
passenger survived the sinking of the Titanic based on
various attributes including age, gender, family size,
ticket, the fare they paid, and other information using
tools of machine learning. Solutions are evaluated by
comparing the percentage of correct answers on a test
dataset.
91.545
2015
3. Approach
3.1 Random Forest [9]
3.1.1. Introduction
The common element is that for the kth tree, a random
vector k is generated, independent of the past random
vectors 1, ..., k1 but with the same distribution; and a
tree is grown using the training set and k, resulting in a
classifier h(x, k) where x is an input vector. For
instance, in bagging the random vector is generated as
the counts in N boxes resulting from N darts thrown at
random at the boxes, where N is number of examples in
the training set. In random split selection consists of a
number of independent random integers between 1 and
K. The nature and dimensionality of depends on its use
in tree construction.
After a large number of trees are generated, they vote for
the most popular class. We call these procedures random
forests.
A random forest is a classifier consisting of a collection
of tree-structured classifiers {h(x, k), k = 1, . . .} where
the {k} are independent identically distributed random
2. Background
Recently, many studies have been conducted to study
this problem in order to compare and contrast the
different machine learning techniques. As an important
aspect of prediction, the trade-off between different
features has been addressed in several of these studies
that are summarized below.
Nguyen et al. [3] created a supervised learning model,
which used Random Forests in the R randomForest
package and the RandomForestClassifier package in
91.545
2015
vectors and each tree casts a unit vote for the most
popular class at input x.
A random forest is an ensemble of decision trees, which
will output a prediction value, in this case survival. Each
decision tree is constructed by using a random subset of
the training data. After you have trained your forest, you
can then pass each test row through it, in order to output
a prediction. This particular python function requires
floats for the input variables, so all strings need to be
converted, and any missing data needs to be filled.
91.545
2015
5. Evaluation
4. Dataset
The historical data has been split into two groups, a
'training set' and a 'test set'. As part of the problem
Original data set was provided by kaggle.com, we had
819 rows of labelled training data. And using that, we
had to submit survival prediction labels for 418 rows of
test data.
Analysis
Cabin data size only account for Too much missing data, so
1/4
dropped
Survival
Survival
(0 = No; 1 = Yes)
pclass
Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name
Name
sex
Sex
age
Age
sibsp
Number of Siblings/Spouses Aboard
parch
Number of Parents/Children Aboard
ticket
Ticket Number
fare
Passenger Fare
cabin
Cabin
embarked
Port of Embarkation
(C=Cherbourg;Q=Queenstown;S=Southampton)
Name
title
divided
by Survived rates are all close to
Mr./Mrs./Miss/Dr./Master/Col/Rev 50%
Sex, embark isnt number
Transfer to number
91.545
2015
91.545
2015
Lam,
Tang [5]
X
Yang
[11]
X
Modified
Gender
Based
Random
0.76077
X
0.8134
Forest
SVM
0.77033
0.7799
.7799
Decision
X
.7943
.7946
Tree
Table 5.1: Comparison of results.
Figure: SVM with all the features that were used for
Random Forest. (failure)
6. Conclusion
According to our research, our results can be obtained
with a higher accuracy, which improves 2% compared to
the results of other models. Besides, during the process
of our study, we find more features utilized in the models
do not necessarily make better result.
7. Team roles
There are 3 members, Kunal Vyas, Zeshi Zheng, and Lin
Li, in this team working together to use different
machine learning methods to do the prediction:
Kunal Vyas - Problem analysis, SVM/ random forest
code implementation, dataset analysis, visualization,
testing and debugging, Writing.
Zeshi Zheng - Dataset analysis, improved gender
model/ random forest code implementations,
visualization, algorithm, testing and debugging, writing.
Lin Li - Problem analysis, random forest code
implementations, visualization, algorithm development,
writing.
Filename
TitanicIter1.py (258 lines)
TitanicIter2.py (200 lines)
Description
Main program,
gender part
Fare part
Class part
svm.py(230 lines)
SVM
implementation
Experiments on
the
Random
forest
algorithm
Randomforest.py(220
lines)
Author
Zeshi
Zheng
Zeshi
Zheng
Zeshi
Zheng
Kunal
Vyas
Kunal
Vyas
91.545
Titanic6NewFeaturesRF.py
(146 lines)
Using random
forest
algorithm
Table 1. Summary of Code
2015
Lin Li,
Zeshi
Zheng
References
[1] http://www.anesi.com/titanic.htm
[2] https://www.kaggle.com/c/titanic
[3] https://seelio.com/w/dgd/titanic-machine-learningfrom-disaster
[4] https://rstudio-pubsstatic.s3.amazonaws.com/25401_74410a27a4cf41a3b
257e42c50927d35.html
[5] http://cs229.stanford.edu/proj2012/LamTangTitanicMachineLearningFromDisaster.pdf
[6] A. Ng. CS229 Notes. Stanford University, 2012
[7] Cortes, Corinna; and Vapnik, Vladimir N.; "SupportVector Networks", Machine Learning, 20, 1995.
[8] Stuart J. Russell, Peter Norvig, Artificial Intelligence:
A Modern Approach, Pearson Education, 2003, pg
697-702
[9] Breiman, L. 2001a. Random forests. Machine Learning
45:5-32.
[10] http://en.wikipedia.org/wiki/Support_vector_machine
[11] http://murphy.wot.eecs.northwestern.edu/~xto633/xia
odong/fullreport.pdf