0% found this document useful (0 votes)

89 views

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

The document discusses building a machine learning model to predict if Titanic passengers survived based on their characteristics. Key steps included: 1. Analyzing the training data to engineer useful features like title, family size and filling in missing values. 2. Evaluating features' importance, with sex, title and fare being most predictive. 3. Growing decision trees to make predictions, using techniques like random forests and extra trees to prevent overfitting. 4. The model achieved a score of 80.383% correct predictions on Kaggle's test set.

Uploaded by

varsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

Uploaded by

varsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Titanic

Machine Learning from Disaster

M.A.D.-Python team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

Louisiana State University, MATH 4020, Professor Peter Wolenski

1
The sinking of the RMS Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden voyage,
the Titanic sank after colliding with an iceberg, killing 1502 out of
2224 passengers and crew. This sensational tragedy shocked the
international community and led to better safety regulations for
ships.

Introduction

The goal of the project was to predict the survival of passengers based off a set of
data. We used Kaggle competition "Titanic: Machine Learning from Disaster" (see
https://www.kaggle.com/c/titanic/data) to retrieve necessary data and evaluate
accuracy of our predictions. The historical data has been split into two groups, a
'training set' and a 'test set'. For the training set, we are provided with the outcome
(whether or not a passenger survived). We used this set to build our model to
generate predictions for the test set.
For each passenger in the test set, we had to predict whether or not they survived
the sinking. Our score was the percentage of correctly predictions.
In our work, we learned
 Programming language Python and its libraries NumPy (to perform matrix
operations) and SciKit-Learn (to apply machine learning algorithms)
 Several machine learning algorithms (decision tree, random forests, extra
trees, linear regression)
 Feature Engineering techniques
We used
 Online integrated development environment Cloud 9 (https://c9.io)
 Python 2.7.6 with the libraries numpy, sklearn, and matplotlib
 Microsoft Excel

2
Work Plan

1. Learn programming language Python

2. Learn Shennon Entropy and write Python code to compute Shennon
Entropy
3. Get familiar with Kaggle project and try using Pivot Tables in Microsoft
Excel to analyze the data.
4. Learn to use SciKit-Learn library in Python, including
a. Building decision tree
b. Building Random Forests
c. Building ExtraTrees
d. Using Linear Regression algorithm
5. Performing Feature Engineering, applying machine learning algorithms, and
analyzing results

3
Training and Test Data

Training and Test data come in CSV file and contain the following fields:
 Passenger ID
 Passenger Class
 Name
 Sex
 Age
 Number of passenger's siblings and spouses on board
 Number of passenger's parents and children on board
 Ticket
 Fare
 Cabin
 City where passenger embarked

4
Feature Engineering

Since the data can have missing fields, incomplete fields, or fields containing hidden
information, a crucial step in building any prediction system is Feature Engineering.
For instance, the fields Age, Fare, and Embarked in the training and test data, had
missing values that had to be filled in. The field Name while being useless itself,
contained passenger's Title (Mr., Mrs., etc.), we also used passenger's surname to
distinguish families on board of Titanic. Below is the list of all changes that has been
made to the data.
Extracting Title from Name

The field Name in the training and test data has the form "Braund, Mr. Owen
Harris". Since name is unique for each passenger, it is not useful for our prediction
system. However, a passenger's title can be extracted from his or her name. We
found 10 titles:
Index Title Number of occurrences
0 Col. 4
1 Dr. 8
2 Lady 4
3 Master 61
4 Miss 262
5 Mr. 757
6 Mrs. 198
7 Ms. 2
8 Rev. 8
9 Sir 5

We can see that title may indicate passenger's sex (Mr. vs Mrs.), class (Lady vs
Mrs.), age (Master vs Mr.), profession (Col., Dr., and Rev.).
Calculating Family Size

It seems advantageous to calculate family size as follows

Family_Size = Parents_Children + Siblings_Spouses + 1

5
Extracting Deck from Cabin

The field Cabin in the training and test data has the form "C85", "C125", where C
refers to the deck label. We found 8 deck labels: A, B, C, D, E, F, G, T. We see deck
label as a refinement of the passenger's class field since the decks A and B were
intended for passengers of the first class, etc.
Extracting Ticket_Code from Ticket

The field Ticket in the training and test data has the form "A/5 21171". Although
we couldn't understand meaning of letters in front of numbers in the field Ticket,
we extracted those letters and used them in our prediction system. We found the
following letters
Index Ticket Code Number of occurrences
0 No Code 961
1 A 42
2 C 77
3 F 13
4 L 1
5 P 98
6 S 98
7 W 19

Filling in missing values in the fields Fare, Embarked, and Age

Since the number of missing values was small, we used median of all Fare values to
fill in missing Fare fields, and the letter 'S' (most frequent value) for the field
Embarked.
In the training and test data, there was significant amount of missing Ages. To fill in
those, we used Linear Regression algorithm to predict Ages based on all other fields
except Passenger_ID and Survived.
Importance of fields

6
Decision Trees algorithm in the library SciKit-Learn allows to evaluate importance
of each field used for prediction. Below is the chart displaying importance of each
field.

We can see that the field Sex is the most important one for prediction, followed by
Title, Fare, Age, Class, Deck, Family_Size, etc.

7
Decision Trees

Our prediction system is based on growing Decision Trees to predict the survival
status. A typical Decision Tree is pictured below

The basic algorithm for growing Decision Tree:

1. Start at the root node as parent node
2. Split the parent node based on field X[i] to minimize the sum of child nodes
uncertainty (maximize information gain)
3. Assign training samples to new child nodes
4. Stop if leave nodes are pure or early stopping criteria is satisfied, otherwise
repeat step 1 and 2 for each new child node

Stopping Rules:
1. The leaf nodes are pure
2. A maximal node depth is reached
3. Splitting a node does not lead to an information gain

8
In order to measure uncertainty and information gain, we used the formula
𝑁𝑙𝑒𝑓𝑡 𝑁𝑟𝑖𝑔ℎ𝑡
𝐼𝐺(𝐷𝑝 ) = 𝐼(𝐷𝑝 ) − 𝐼(𝐷𝑙𝑒𝑓𝑡 ) − 𝐼(𝐷𝑟𝑖𝑔ℎ𝑡 )
𝑁𝑝 𝑁𝑝
where
 𝐼𝐺 : Information Gain
 𝐼 : Impurity (Uncertainty Measure)
 𝑁𝑝 , 𝑁𝑙𝑒𝑓𝑡 , 𝑁𝑟𝑖𝑔ℎ𝑡 : number of samples in the parent, the left child, and the
right child nodes
 𝐷𝑝 , 𝐷𝑙𝑒𝑓𝑡 , 𝐷𝑟𝑖𝑔ℎ𝑡 : training subset of the parent, the left child, and the right
child nodes
For Uncertainty Measure, we used Entropy defined by
𝐼(𝑝1 , 𝑝2 ) = −𝑝1 log 2 𝑝1 − 𝑝2 log 2 𝑝2
and GINI index defined by
𝐼(𝑝1 , 𝑝2 ) = 2𝑝1 𝑝2
The graphs of both measures are given below

Entropy

GINI

𝑝1

9
We can see on the graph that when probability of an event is 0 or 1, then the
uncertainty measure equals to 0, while if probability of an event is close to ½, then
the uncertainty measure is maximum.

Random Forest and ExtraTrees

One common issue with all machine learning algorithms is Overfitting. For Decision
Tree, it means growing too large tree (with strong bias, small variation) so it loses
its ability to generalize the data and to predict the output. In order to deal with
overfitting, we can grow several decision trees and take the average of their
predictions. The library SciKit-Learn provides to such algorithm Random Forest and
ExtraTrees.
In Random Forest, we grow N decision trees based on randomly selected subset of
the data and randomly selected M fields, where 𝑀 = √𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑓𝑖𝑒𝑙𝑑𝑠.
In ExtraTrees, in addition to randomness of subsets of the data and of field, splits
of nodes are chosen randomly.

10
Conclusion

As a result of our work, we gained valuable experience of building prediction

systems and achieved our best score on Kaggle: 80.383% of correct predictions (in
Kaggle leaderboard, it corresponds to positions 477 - 881 out of 3911 participants).
• We performed featured engineering techniques
• Changed alphabetic values to numeric
• Calculated family size
• Extracted title from name and deck label from ticket number
• Used linear regression algorithm to fill in missing ages
• We used several prediction algorithms in python
• Decision tree
• Random forests
• Extra trees
• We achieved our best score 80.383% correct predictions

Applied Generative AI For Beginners Practical Knowledge 1703207445
93% (14)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Medicinal Plant Identification Using Machine Learning".In
No ratings yet
Medicinal Plant Identification Using Machine Learning".In
27 pages
Individual Asignment Ucs551
70% (10)
Individual Asignment Ucs551
15 pages
Predictive Modeling of Titanic Survivors
No ratings yet
Predictive Modeling of Titanic Survivors
12 pages
Titanic Report ml report
No ratings yet
Titanic Report ml report
14 pages
Predicting_Titanic_Survivors_by_Using_Machine_Lear
No ratings yet
Predicting_Titanic_Survivors_by_Using_Machine_Lear
8 pages
Titanic (4)
No ratings yet
Titanic (4)
3 pages
Titanic (5)
No ratings yet
Titanic (5)
3 pages
Machine Learning
100% (1)
Machine Learning
62 pages
Thesis Slide
No ratings yet
Thesis Slide
24 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
34 pages
Rouse Final
No ratings yet
Rouse Final
8 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Jntuk ML RECORD Full
No ratings yet
Jntuk ML RECORD Full
46 pages
4.1.3.5 Lab - Decision Tree Classification
No ratings yet
4.1.3.5 Lab - Decision Tree Classification
11 pages
ML Mini Project 2
No ratings yet
ML Mini Project 2
26 pages
TITANIC SURVIVAL PREDICTION USING ML MINIPROJECT
No ratings yet
TITANIC SURVIVAL PREDICTION USING ML MINIPROJECT
21 pages
CEP Final
No ratings yet
CEP Final
11 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
No ratings yet
PredictingTitanicSurvivorsusing by Applying Exploratory Data Anyltics and ML
7 pages
iml project (1) (1)
No ratings yet
iml project (1) (1)
13 pages
LamTang TitanicMachineLearningFromDisaster
No ratings yet
LamTang TitanicMachineLearningFromDisaster
5 pages
Titanic Survival Prediction Using Machine Learning
No ratings yet
Titanic Survival Prediction Using Machine Learning
7 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
AD3461_ML Lab Manual
No ratings yet
AD3461_ML Lab Manual
54 pages
Machine Learninf File Final
No ratings yet
Machine Learninf File Final
45 pages
Titanic
No ratings yet
Titanic
1 page
Titanic Disaster Using Machine Learning
No ratings yet
Titanic Disaster Using Machine Learning
7 pages
Aim: Predicting The Survival of Titanic Passengers
No ratings yet
Aim: Predicting The Survival of Titanic Passengers
20 pages
Neural Network Project
No ratings yet
Neural Network Project
4 pages
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
Report TSP
No ratings yet
Report TSP
13 pages
Lab Manual
No ratings yet
Lab Manual
55 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Titanic Eda
No ratings yet
Titanic Eda
14 pages
VI Sem Machine Learning CS 601 PDF
No ratings yet
VI Sem Machine Learning CS 601 PDF
28 pages
MLWP LAB Experiment's
No ratings yet
MLWP LAB Experiment's
11 pages
My ML Lab Manual
No ratings yet
My ML Lab Manual
21 pages
IT ML Lab
No ratings yet
IT ML Lab
35 pages
Using Titanic Dataset for Comprehensive Machine Learning Model Training
No ratings yet
Using Titanic Dataset for Comprehensive Machine Learning Model Training
3 pages
Oomd
No ratings yet
Oomd
11 pages
ML Unit 3 Part 3
No ratings yet
ML Unit 3 Part 3
33 pages
AD3461-Machine Learning Lab Manual
No ratings yet
AD3461-Machine Learning Lab Manual
26 pages
ML Report-1
No ratings yet
ML Report-1
13 pages
Titanic Survival
No ratings yet
Titanic Survival
13 pages
1.1 Loading The Data: Survival by Sex
No ratings yet
1.1 Loading The Data: Survival by Sex
6 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
33 pages
VI Sem Machine Learning CS 601
No ratings yet
VI Sem Machine Learning CS 601
28 pages
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
No ratings yet
Set Sail: Read - CSV Read - CSV Train Read - CSV Test Train Test
2 pages
Supervised Learning Classification Algorithms Comparison
No ratings yet
Supervised Learning Classification Algorithms Comparison
6 pages
What Are Decision Trees?
No ratings yet
What Are Decision Trees?
9 pages
my_project_1_AI
No ratings yet
my_project_1_AI
3 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
Ds 6
No ratings yet
Ds 6
24 pages
ML-Lab
No ratings yet
ML-Lab
26 pages
MCA- Project Documentation Guidelines 2024-2025
No ratings yet
MCA- Project Documentation Guidelines 2024-2025
26 pages
Chap 18 B
No ratings yet
Chap 18 B
22 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Math for Deep Learning: What You Need to Know to Understand Neural Networks
From Everand
Math for Deep Learning: What You Need to Know to Understand Neural Networks
Ronald T. Kneusel
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Handling Missing Value in Decision Tree Algorithm PDF
No ratings yet
Handling Missing Value in Decision Tree Algorithm PDF
6 pages
Blog - Google Cloud AI
No ratings yet
Blog - Google Cloud AI
1 page
Stock Sentiment Analysis
No ratings yet
Stock Sentiment Analysis
3 pages
Simplilearn Deep Learning
No ratings yet
Simplilearn Deep Learning
6 pages
Naukri AnkitPandey (3y 5m)
No ratings yet
Naukri AnkitPandey (3y 5m)
2 pages
Artificial Intelligence in Manufacturing - State of The Art, Perspectives, and Future Directions
No ratings yet
Artificial Intelligence in Manufacturing - State of The Art, Perspectives, and Future Directions
27 pages
YOLOV10 AND SAM 2.1 FOR ENHANCED MRI SEGMENTATION AND IMPROVED NEUROLOGICAL DISEASE DIAGNOSIS
No ratings yet
YOLOV10 AND SAM 2.1 FOR ENHANCED MRI SEGMENTATION AND IMPROVED NEUROLOGICAL DISEASE DIAGNOSIS
30 pages
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
20 pages
Software Defect Prediction Using Machine Learning
No ratings yet
Software Defect Prediction Using Machine Learning
5 pages
The Digitalisation of Supply Chain A Review
No ratings yet
The Digitalisation of Supply Chain A Review
11 pages
Investment Banking Summer Internship Project
No ratings yet
Investment Banking Summer Internship Project
9 pages
3-Polynomial Regression Using Python
No ratings yet
3-Polynomial Regression Using Python
14 pages
1 s2.0 S0888327023007963 Main
No ratings yet
1 s2.0 S0888327023007963 Main
25 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Multivariate Time Series Classification With WEASE
No ratings yet
Multivariate Time Series Classification With WEASE
12 pages
Exercises Classificatiwqeon
No ratings yet
Exercises Classificatiwqeon
7 pages
A Practical Guide To Fast Fine-Tuning Your LLMs With Unsloth _ by Dr. Ashish Bamania _ Apr, 2025 _ AI Advances
No ratings yet
A Practical Guide To Fast Fine-Tuning Your LLMs With Unsloth _ by Dr. Ashish Bamania _ Apr, 2025 _ AI Advances
27 pages
Predicting The Stages of Diabetic Retinopathy Using Deep Learning
No ratings yet
Predicting The Stages of Diabetic Retinopathy Using Deep Learning
6 pages
A Market Approach To Forecasting: Background, Theory and Practice
No ratings yet
A Market Approach To Forecasting: Background, Theory and Practice
220 pages
Introduction To Maximum Entropy Models: Adwait Ratnaparkhi Yahoo! Labs
No ratings yet
Introduction To Maximum Entropy Models: Adwait Ratnaparkhi Yahoo! Labs
46 pages
Aiml FPP
No ratings yet
Aiml FPP
227 pages
Mindscapetravellers Riddhi Tushita Anandita1
No ratings yet
Mindscapetravellers Riddhi Tushita Anandita1
8 pages
ETICAS - Auditoria Externa Del Sistema VioGen - 20220308
No ratings yet
ETICAS - Auditoria Externa Del Sistema VioGen - 20220308
48 pages
Machine Learning new
No ratings yet
Machine Learning new
41 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
Human Emotion Detectionusing Machine Learning Techniques
No ratings yet
Human Emotion Detectionusing Machine Learning Techniques
8 pages
AI For Business module 1 questions
No ratings yet
AI For Business module 1 questions
19 pages
Imarticus DS & A
No ratings yet
Imarticus DS & A
18 pages

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

Uploaded by

Titanic Machine Learning From Disaster: M.A.D.-Python Team: Dylan Kenny, Matthew Kiggans, Aleksandr Smirnov

Uploaded by

Titanic

Machine Learning from Disaster

Louisiana State University, MATH 4020, Professor Peter Wolenski

1. Learn programming language Python

It seems advantageous to calculate family size as follows

Filling in missing values in the fields Fare, Embarked, and Age

The basic algorithm for growing Decision Tree:

Random Forest and ExtraTrees

As a result of our work, we gained valuable experience of building prediction

You might also like