You are on page 1of 6

Predicting Survivors on the Titanic using SAP Predictive

Analytics
evtechnologies.com/predicting-survivors-titanic-using-sap/

Ahmed Sherif

4/26/2016

While Jack Dawson was a fictional character


portrayed by Leonardo DiCaprio for the
movie Titanic, there was an actual J. Dawson who
was aboard the Titanic. But this J stood for
Joseph, not Jack. Over the years, data was
collected on many of the members who were
aboard the Titanic and we know quite a bit about
them, including whether or not they survived the
sinking of the Titanic. Below are some of the
details that have been gathered about many of
those aboard the Titanic.
Name
Sex
Age
# of Siblings aboard
# of Parents/Children aboard
Ticket #
Passenger Fare
Passenger Class
Cabin
Port of Embarkation
Survived (Yes/No)
Some of this detail is just general information about them. But other indicators relate to their social status
aboard the Titanic. If youve seen the movie (most did since it was the 2nd highest grossing film of all time) you
know that many of those who didnt make it were trapped inside the ship and were not able to escape on boats or
rafts. Since we have this data available, lets go ahead and use Automated Analytics within the SAP Predictive
Analytics 2.5 tool to see if we can a prediction model to determine based on these indicators above whether or
not a probability score of survival could be assigned to each person.
We have data on 1,309 unique individuals, so lets use 1,000 of them to train a predictive model and then apply
that model to the remaining 309 to see how close we came to predicting their outcome. Since this is a binary
outcome where we will be measuring Survival as either a Yes or No response, a logistical regression model is
the most ideal model to incorporate. Since we will be using Automated Analytics, the tool will automatically make
that determination intuitively for us.
Lets get started!
Step 1: Create a Classification/Regression model within the Automated Analytics Module in SAP Predictive
Analytics 2.5

1/6

Step 2: Analyze the data to make sure all of the columns are appropriately valued

Step 3: Apply the target and predictor variables to assign roles within the model. For purposes of this model,
Survived will be our target variable since we are looking to see how well we can predict the survival outcome
based on everything else. (Please note that KxIndex is just a generated index variable produced by the tool to
provide uniqueness to each row of data and will be discarded from the final model. In addition, I removed boat,
home destination, and ID as they were also insignificant to the model outcome.)

Step 4: After selecting all the explanatory variables as well as the target variable, go ahead and generate the

2/6

model. (Please note that the settings for determining a probability threshold were kept at the default settings of
50%, meaning a probability of less than 0.5 would indicate No Survival and a 0.5 or greater is Yes Survival.
This setting can be adjusted in the advanced settings before model generation if the standard needs to be reevaluated.)
Step 5: Evaluate Results

The model produced some general information for us to consider.


The model found that ~56% of the training data had not survived while 44% did survive.
Additionally, the model was able to generate about 80% predictive power (KI) based on the training data
to determine the same outcome with a 91% chance of recreating the same results given similar datasets.
Finally, the model kept 9 out of the 14 predictor variables initially selected as contributing significantly to
the Survived response, with the remaining variables discarded.

3/6

The most significant contributor was the sex/gender of those aboard, with women more likely to survive than
men. Additionally, the next two contributors had to do the fare they paid for getting on board and the passenger
class they were in. Clearly there is a direct relationship between how much you paid for a ticket and the class
you were assigned to. These two indicators highlight that in fact social class may have been a factor if there was
priority given to those who were saved during the Titanics fatal end.
Final Step: Apply Training Model on Test Results

We will now apply the model built off of the training set against the 309 rows of data we stored in the test data set
to see how well we can predict the actual probability of survival. For the output, we want to select the option to
view probability. For us, this model goes back to predicting a probability of 0 or 1.
Less than 50% = 0
50% or Greater = 1

4/6

You can view the output inside of the tool or export it out to a data source. I always like to export it out to an
Excel spreadsheet to quickly calculate my scores. The variable that we are looking for is the
proba_rr_survived. If we round it to a whole number, we want to compare that result to the original response of
survived. After some calculations in Excel for the rows I tested, I got 245 correct matches out of 309. That rate
comes out to be 79.3%, almost identical to our KI score above of 79.8%.
In conclusion, we found that the data gathered on 1,300 of those on the Titanic were able to predict the survival
rate of 80% of those we tested our model on. We found that Sex, Fare, and Class were the highest individual
predictors contributing to the model score.
FYI, the actual Joseph Dawson did also die on the Titanic.
To learn more about SAP Predictive Analytics, please register and join our 4-part webinar series called
I Love Predictive
Resources
Learn more about logistic regression algorithms

5/6

SAP Predictive Analytics Tutorials for Automated Analytics and Expert Analytics
Titanic Data Set

6/6

You might also like