0% found this document useful (0 votes)
89 views

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

The document discusses building a machine learning model to predict if Titanic passengers survived based on their characteristics. Key steps included: 1. Analyzing the training data to engineer useful features like title, family size and filling in missing values. 2. Evaluating features' importance, with sex, title and fare being most predictive. 3. Growing decision trees to make predictions, using techniques like random forests and extra trees to prevent overfitting. 4. The model achieved a score of 80.383% correct predictions on Kaggle's test set.

Uploaded by

varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

The document discusses building a machine learning model to predict if Titanic passengers survived based on their characteristics. Key steps included: 1. Analyzing the training data to engineer useful features like title, family size and filling in missing values. 2. Evaluating features' importance, with sex, title and fare being most predictive. 3. Growing decision trees to make predictions, using techniques like random forests and extra trees to prevent overfitting. 4. The model achieved a score of 80.383% correct predictions on Kaggle's test set.

Uploaded by

varsha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Titanic

Machine Learning from Disaster


M.A.D.-Python team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

Louisiana State University, MATH 4020, Professor Peter Wolenski

1
The sinking of the RMS Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage,
the Titanic sank after colliding with an iceberg, killing 1502 out of
2224 passengers and crew. This sensational tragedy shocked the
international community and led to better safety regulations for
ships.

Introduction

The goal of the project was to predict the survival of passengers based off a set of
data. We used Kaggle competition "Titanic: Machine Learning from Disaster" (see
https://www.kaggle.com/c/titanic/data) to retrieve necessary data and evaluate
accuracy of our predictions. The historical data has been split into two groups, a
'training set' and a 'test set'. For the training set, we are provided with the outcome
(whether or not a passenger survived). We used this set to build our model to
generate predictions for the test set.
For each passenger in the test set, we had to predict whether or not they survived
the sinking. Our score was the percentage of correctly predictions.
In our work, we learned
 Programming language Python and its libraries NumPy (to perform matrix
operations) and SciKit-Learn (to apply machine learning algorithms)
 Several machine learning algorithms (decision tree, random forests, extra
trees, linear regression)
 Feature Engineering techniques
We used
 Online integrated development environment Cloud 9 (https://c9.io)
 Python 2.7.6 with the libraries numpy, sklearn, and matplotlib
 Microsoft Excel

2
Work Plan

1. Learn programming language Python


2. Learn Shennon Entropy and write Python code to compute Shennon
Entropy
3. Get familiar with Kaggle project and try using Pivot Tables in Microsoft
Excel to analyze the data.
4. Learn to use SciKit-Learn library in Python, including
a. Building decision tree
b. Building Random Forests
c. Building ExtraTrees
d. Using Linear Regression algorithm
5. Performing Feature Engineering, applying machine learning algorithms, and
analyzing results

3
Training and Test Data

Training and Test data come in CSV file and contain the following fields:
 Passenger ID
 Passenger Class
 Name
 Sex
 Age
 Number of passenger's siblings and spouses on board
 Number of passenger's parents and children on board
 Ticket
 Fare
 Cabin
 City where passenger embarked

4
Feature Engineering

Since the data can have missing fields, incomplete fields, or fields containing hidden
information, a crucial step in building any prediction system is Feature Engineering.
For instance, the fields Age, Fare, and Embarked in the training and test data, had
missing values that had to be filled in. The field Name while being useless itself,
contained passenger's Title (Mr., Mrs., etc.), we also used passenger's surname to
distinguish families on board of Titanic. Below is the list of all changes that has been
made to the data.
Extracting Title from Name

The field Name in the training and test data has the form "Braund, Mr. Owen
Harris". Since name is unique for each passenger, it is not useful for our prediction
system. However, a passenger's title can be extracted from his or her name. We
found 10 titles:
Index Title Number of occurrences
0 Col. 4
1 Dr. 8
2 Lady 4
3 Master 61
4 Miss 262
5 Mr. 757
6 Mrs. 198
7 Ms. 2
8 Rev. 8
9 Sir 5

We can see that title may indicate passenger's sex (Mr. vs Mrs.), class (Lady vs
Mrs.), age (Master vs Mr.), profession (Col., Dr., and Rev.).
Calculating Family Size

It seems advantageous to calculate family size as follows


Family_Size = Parents_Children + Siblings_Spouses + 1

5
Extracting Deck from Cabin

The field Cabin in the training and test data has the form "C85", "C125", where C
refers to the deck label. We found 8 deck labels: A, B, C, D, E, F, G, T. We see deck
label as a refinement of the passenger's class field since the decks A and B were
intended for passengers of the first class, etc.
Extracting Ticket_Code from Ticket

The field Ticket in the training and test data has the form "A/5 21171". Although
we couldn't understand meaning of letters in front of numbers in the field Ticket,
we extracted those letters and used them in our prediction system. We found the
following letters
Index Ticket Code Number of occurrences
0 No Code 961
1 A 42
2 C 77
3 F 13
4 L 1
5 P 98
6 S 98
7 W 19

Filling in missing values in the fields Fare, Embarked, and Age

Since the number of missing values was small, we used median of all Fare values to
fill in missing Fare fields, and the letter 'S' (most frequent value) for the field
Embarked.
In the training and test data, there was significant amount of missing Ages. To fill in
those, we used Linear Regression algorithm to predict Ages based on all other fields
except Passenger_ID and Survived.
Importance of fields

6
Decision Trees algorithm in the library SciKit-Learn allows to evaluate importance
of each field used for prediction. Below is the chart displaying importance of each
field.

We can see that the field Sex is the most important one for prediction, followed by
Title, Fare, Age, Class, Deck, Family_Size, etc.

7
Decision Trees

Our prediction system is based on growing Decision Trees to predict the survival
status. A typical Decision Tree is pictured below

The basic algorithm for growing Decision Tree:


1. Start at the root node as parent node
2. Split the parent node based on field X[i] to minimize the sum of child nodes
uncertainty (maximize information gain)
3. Assign training samples to new child nodes
4. Stop if leave nodes are pure or early stopping criteria is satisfied, otherwise
repeat step 1 and 2 for each new child node

Stopping Rules:
1. The leaf nodes are pure
2. A maximal node depth is reached
3. Splitting a node does not lead to an information gain

8
In order to measure uncertainty and information gain, we used the formula
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝐼𝐺(𝐷𝑝 ) = 𝐼(𝐷𝑝 ) − 𝐼(𝐷𝑙𝑒𝑓𝑡 ) − 𝐼(𝐷𝑟𝑖𝑔ℎ𝑡 )
𝑁𝑝 𝑁𝑝
where
 𝐼𝐺 : Information Gain
 𝐼 : Impurity (Uncertainty Measure)
 𝑁𝑝 , 𝑁𝑙𝑒𝑓𝑡 , 𝑁𝑟𝑖𝑔ℎ𝑡 : number of samples in the parent, the left child, and the
right child nodes
 𝐷𝑝 , 𝐷𝑙𝑒𝑓𝑡 , 𝐷𝑟𝑖𝑔ℎ𝑡 : training subset of the parent, the left child, and the right
child nodes
For Uncertainty Measure, we used Entropy defined by
𝐼(𝑝1 , 𝑝2 ) = −𝑝1 log 2 𝑝1 − 𝑝2 log 2 𝑝2
and GINI index defined by
𝐼(𝑝1 , 𝑝2 ) = 2𝑝1 𝑝2
The graphs of both measures are given below

Entropy

GINI

𝑝1

9
We can see on the graph that when probability of an event is 0 or 1, then the
uncertainty measure equals to 0, while if probability of an event is close to ½, then
the uncertainty measure is maximum.

Random Forest and ExtraTrees

One common issue with all machine learning algorithms is Overfitting. For Decision
Tree, it means growing too large tree (with strong bias, small variation) so it loses
its ability to generalize the data and to predict the output. In order to deal with
overfitting, we can grow several decision trees and take the average of their
predictions. The library SciKit-Learn provides to such algorithm Random Forest and
ExtraTrees.
In Random Forest, we grow N decision trees based on randomly selected subset of
the data and randomly selected M fields, where 𝑀 = √𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑓𝑖𝑒𝑙𝑑𝑠.
In ExtraTrees, in addition to randomness of subsets of the data and of field, splits
of nodes are chosen randomly.

10
Conclusion

As a result of our work, we gained valuable experience of building prediction


systems and achieved our best score on Kaggle: 80.383% of correct predictions (in
Kaggle leaderboard, it corresponds to positions 477 - 881 out of 3911 participants).
• We performed featured engineering techniques
• Changed alphabetic values to numeric
• Calculated family size
• Extracted title from name and deck label from ticket number
• Used linear regression algorithm to fill in missing ages
• We used several prediction algorithms in python
• Decision tree
• Random forests
• Extra trees
• We achieved our best score 80.383% correct predictions

11

You might also like