SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Titanic Survivor Prediction
by Machine Learning
Ding Li 2018.05
online store: costumejewelry1.com
2
April 15, 1912 Titanic Sank
People
2,224
Crew
908
Passengers
1,316
Survivors
212
Victims
696
Survivors
498
Victims
818
Passengers
1,309
Train
891
Test
418
Survivors
342
Victims
549
Survivors
who?
Victims
who?
Modeling Prediction
History
Kaggle Machine Learning (ML) Project
3
PassengerId 195
Survived 1
Pclass 1
Name*
Brown, Mrs. James
Joseph (Margaret Tobin)
Sex female
Age 44
SibSp 0
ParCh 0
Ticket** PC 17610
Fare 27.7208
Cabin*** B4
Embarked C
Train
PassengerId 972
Survived (need to predict)
Pclass 3
Name*
Boulos, Master.
Akar
Sex male
Age 6
SibSp 1
ParCh 1
Ticket** 2678
Fare 15.2458
Cabin***
Embarked C
Test Goal
PassengerId Survived
892 1/0
893 1/0
…… ……
1309 1/0
Margaret Brown In Titanic Movie
By Kathy Bates
*Title can be extracted from Name. **Ticket not informative, not used ***Cabin most missing, not used
Age, Fare: missing data replaced with median value; Embarked: missing data replaced with mode value
4
Embarked from
S: Southampton C: Cherbourg Q: Queenstown
SibSp: # of Siblings or Spouse
ParCh: # of Parents or Children
Family Size = 𝑆𝑖𝑏𝑆𝑝 + 𝑃𝑎𝑟𝑐ℎ + 1
Is Alone = 1
0
if Family Size = 1
if Family Size > 1
5
6
Survived – Sex -0.54
P Class -0.34
Fare bin 0.3
Embarked -0.17
Fare bin – P Class -0.63
Family Size 0.47
Age bin – P Class -0.36
Title code 0.32
7
Models
Coding in Python
Sklearn, Xgboost
Train
Accuracy
Mean
Test
Accuracy
Mean
XGBClassifier 85.6% 82.9%
SVC 83.7% 82.6%
RandomForestClassifier 89.0% 82.2%
DecisionTreeClassifier 89.5% 82.0%
KNeighborsClassifier 85.0% 81.4%
RidgeClassifierCV 79.7% 79.4%
LogisticRegressionCV 79.7% 79.1%
GaussianNB 79.5% 78.1%
SGDClassifier 73.5% 73.2%
Cross Validation (CV)
8
Steps are easy to Interpret
Very complicated logic
Tree is too deep
Prone to overfitting
9
Observation
Model
Accuracy
Train
891
Survivors
342
(38%)
Victims
549
(62%)
Train
891
Victims
891
all
549
891
= 62%
Train
891
Male
577 (35%)
Female
314 (65%)
Survivors
109
(19%)
Victims
468
(81%)
Survivors
234
(74%)
Victims
80
(26%)
Train
891
Male
577 (35%)
Female
314 (65%)
Victims
577
Survivors
314
468+234
891
=
702
891
= 79%
With one more
layer, hand-made
tree can reach
82% accuracy
10
Before Turning:
Training Score = 89.5%
Test Score = 82.05%
After Turning:
(Best max_depth = 4)
Training Score = 89.4%
Test Score = 87.4%
Alleviate the overfitting
11
• Kaggle is a convenient platform to study and practice machine learning.
• Python code can be executed directly at the host server from the browser.
• Numerous datasets were provided on the site, including training and test data.
• Once the prediction file is submitted, a score will be returned to evaluate your model.
• Many developers share runnable code with detailed explanation.
• Appling artificial intelligence blindly without human intelligence is dangerous.
• Some ML models can be too complicated, leading to overfitting.
• The performance of some ML models can be worse than simple hand-made model.
• Combining AI and human logic can make the analytical process enjoyable and reliable.
Python code of the project at kaggle: https://www.kaggle.com/dingli/titanic-survivor-prediction-machine-learning

More Related Content

PDF
Event-Driven Microservices With NATS Streaming
PDF
Mux loves Clickhouse. By Adam Brown, Mux founder
PDF
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
PDF
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
PDF
Database Security Threats - MariaDB Security Best Practices
PDF
Hardware & Software Platforms for HPC, AI and ML
PDF
Introduction to Akka
PPTX
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push
Event-Driven Microservices With NATS Streaming
Mux loves Clickhouse. By Adam Brown, Mux founder
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
A Trifecta of Real-Time Applications: Apache Kafka, Flink, and Druid
Database Security Threats - MariaDB Security Best Practices
Hardware & Software Platforms for HPC, AI and ML
Introduction to Akka
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push

What's hot (20)

PDF
Learn O11y from Grafana ecosystem.
PPTX
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
PDF
[B31] LOGMinerってレプリケーションソフトで使われているけどどうなってる? by Toshiya Morita
PDF
Locondo 20190215@ec tech_group
PDF
Machine learning pipeline with spark ml
PDF
Oops! I Started a Broker | Yinon Kahta, Taboola
PDF
[Cloud OnAir] Google Cloud とつなぐ色々な方法 〜 つなぐ方法をゼロからご紹介します〜 2019年1月31日 放送
PPTX
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
PDF
Logging and observability
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Thrift vs Protocol Buffers vs Avro - Biased Comparison
PDF
Implementing Microservices with NATS
PPTX
Kafka at Peak Performance
PPTX
Accel series 2022 Spring
PDF
Moving Your Machine Learning Models to Production with TensorFlow Extended
PDF
Real Time Processing Using Twitter Heron by Karthik Ramasamy
PPTX
MicroProfile 5で超手軽に始める今どきのクラウド完全対応エンタープライズシステム
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Apache kafka
Learn O11y from Grafana ecosystem.
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
[B31] LOGMinerってレプリケーションソフトで使われているけどどうなってる? by Toshiya Morita
Locondo 20190215@ec tech_group
Machine learning pipeline with spark ml
Oops! I Started a Broker | Yinon Kahta, Taboola
[Cloud OnAir] Google Cloud とつなぐ色々な方法 〜 つなぐ方法をゼロからご紹介します〜 2019年1月31日 放送
Postgres Playground で pgbench を走らせよう!(第35回PostgreSQLアンカンファレンス@オンライン 発表資料)
Logging and observability
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Implementing Microservices with NATS
Kafka at Peak Performance
Accel series 2022 Spring
Moving Your Machine Learning Models to Production with TensorFlow Extended
Real Time Processing Using Twitter Heron by Karthik Ramasamy
MicroProfile 5で超手軽に始める今どきのクラウド完全対応エンタープライズシステム
Practical learnings from running thousands of Flink jobs
Apache kafka
Ad

More from Ding Li (13)

PPTX
Software architecture for data applications
PPTX
Seismic data analysis with u net
PPTX
Find nuclei in images with U-net
PPTX
Digit recognizer by convolutional neural network
PPTX
Reinforcement learning
PPTX
Recommendation system
PPTX
Practical data science
PPTX
Generative adversarial networks
PPTX
AI to advance science research
PPTX
Machine learning with graph
PPTX
Natural language processing and transformer models
PPTX
Great neck school budget 2016-2017 analysis
PPTX
Business Intelligence and Big Data in Cloud
Software architecture for data applications
Seismic data analysis with u net
Find nuclei in images with U-net
Digit recognizer by convolutional neural network
Reinforcement learning
Recommendation system
Practical data science
Generative adversarial networks
AI to advance science research
Machine learning with graph
Natural language processing and transformer models
Great neck school budget 2016-2017 analysis
Business Intelligence and Big Data in Cloud
Ad

Recently uploaded (20)

PDF
Navigating the Thai Supplements Landscape.pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
DOCX
Factor Analysis Word Document Presentation
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Leprosy and NLEP programme community medicine
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Predictive modeling basics in data cleaning process
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Introduction to Inferential Statistics.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Navigating the Thai Supplements Landscape.pdf
[EN] Industrial Machine Downtime Prediction
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
SAP 2 completion done . PRESENTATION.pptx
New ISO 27001_2022 standard and the changes
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Factor Analysis Word Document Presentation
Topic 5 Presentation 5 Lesson 5 Corporate Fin
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Leprosy and NLEP programme community medicine
ISS -ESG Data flows What is ESG and HowHow
Predictive modeling basics in data cleaning process
Microsoft Core Cloud Services powerpoint
Introduction to Inferential Statistics.pptx
IMPACT OF LANDSLIDE.....................
DU, AIS, Big Data and Data Analytics.ppt
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}

Titanic survivor prediction by machine learning

  • 1. Titanic Survivor Prediction by Machine Learning Ding Li 2018.05 online store: costumejewelry1.com
  • 2. 2 April 15, 1912 Titanic Sank People 2,224 Crew 908 Passengers 1,316 Survivors 212 Victims 696 Survivors 498 Victims 818 Passengers 1,309 Train 891 Test 418 Survivors 342 Victims 549 Survivors who? Victims who? Modeling Prediction History Kaggle Machine Learning (ML) Project
  • 3. 3 PassengerId 195 Survived 1 Pclass 1 Name* Brown, Mrs. James Joseph (Margaret Tobin) Sex female Age 44 SibSp 0 ParCh 0 Ticket** PC 17610 Fare 27.7208 Cabin*** B4 Embarked C Train PassengerId 972 Survived (need to predict) Pclass 3 Name* Boulos, Master. Akar Sex male Age 6 SibSp 1 ParCh 1 Ticket** 2678 Fare 15.2458 Cabin*** Embarked C Test Goal PassengerId Survived 892 1/0 893 1/0 …… …… 1309 1/0 Margaret Brown In Titanic Movie By Kathy Bates *Title can be extracted from Name. **Ticket not informative, not used ***Cabin most missing, not used Age, Fare: missing data replaced with median value; Embarked: missing data replaced with mode value
  • 4. 4 Embarked from S: Southampton C: Cherbourg Q: Queenstown SibSp: # of Siblings or Spouse ParCh: # of Parents or Children Family Size = 𝑆𝑖𝑏𝑆𝑝 + 𝑃𝑎𝑟𝑐ℎ + 1 Is Alone = 1 0 if Family Size = 1 if Family Size > 1
  • 5. 5
  • 6. 6 Survived – Sex -0.54 P Class -0.34 Fare bin 0.3 Embarked -0.17 Fare bin – P Class -0.63 Family Size 0.47 Age bin – P Class -0.36 Title code 0.32
  • 7. 7 Models Coding in Python Sklearn, Xgboost Train Accuracy Mean Test Accuracy Mean XGBClassifier 85.6% 82.9% SVC 83.7% 82.6% RandomForestClassifier 89.0% 82.2% DecisionTreeClassifier 89.5% 82.0% KNeighborsClassifier 85.0% 81.4% RidgeClassifierCV 79.7% 79.4% LogisticRegressionCV 79.7% 79.1% GaussianNB 79.5% 78.1% SGDClassifier 73.5% 73.2% Cross Validation (CV)
  • 8. 8 Steps are easy to Interpret Very complicated logic Tree is too deep Prone to overfitting
  • 9. 9 Observation Model Accuracy Train 891 Survivors 342 (38%) Victims 549 (62%) Train 891 Victims 891 all 549 891 = 62% Train 891 Male 577 (35%) Female 314 (65%) Survivors 109 (19%) Victims 468 (81%) Survivors 234 (74%) Victims 80 (26%) Train 891 Male 577 (35%) Female 314 (65%) Victims 577 Survivors 314 468+234 891 = 702 891 = 79% With one more layer, hand-made tree can reach 82% accuracy
  • 10. 10 Before Turning: Training Score = 89.5% Test Score = 82.05% After Turning: (Best max_depth = 4) Training Score = 89.4% Test Score = 87.4% Alleviate the overfitting
  • 11. 11 • Kaggle is a convenient platform to study and practice machine learning. • Python code can be executed directly at the host server from the browser. • Numerous datasets were provided on the site, including training and test data. • Once the prediction file is submitted, a score will be returned to evaluate your model. • Many developers share runnable code with detailed explanation. • Appling artificial intelligence blindly without human intelligence is dangerous. • Some ML models can be too complicated, leading to overfitting. • The performance of some ML models can be worse than simple hand-made model. • Combining AI and human logic can make the analytical process enjoyable and reliable. Python code of the project at kaggle: https://www.kaggle.com/dingli/titanic-survivor-prediction-machine-learning