SlideShare a Scribd company logo
How to Approach
Data Science Problems
from Start to End
Polong Lin
Data Scientist
IBM Analytics, Emerging Technologies
@polonglin
@bigdatau
台灣資料科學年會
• Free online	courses
• Data	Science	& Data	Engineering
• A	communityinitiative	led	by	IBM
• Certificates	and	Badges
• >	450,000	users
What	is	Big	Data	University	(BDU)?
3
4
5
“5-5-5	Rule”
Course
Lesson	1
Lesson	2
Lesson	3
Lesson	4
Final	Exam
Certificate/Badge
Lesson	5
5	videos
5	videos
5	videos
5	videos
5	videos
Lab	Exercises
6
Learn	hands-on. Exercises	in	the	cloud.
DataScientistWorkbench.com
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
“Polong	will	fly	from	San	Francisco	to	New	York
for	a	meeting	at	3:00pm	on	Friday,	July	22.”
Can	Polong	anticipate	whether	his	flight	will	be	delayed?
Flight	delays
8
San	Francisco New	York
• Every	project	begins	with	business	understanding.
• What	is	the	project	objective?
• What	are	we	trying	to	do	– what	is	our	goal?
1. Formulate	a	clear	question
2. Define	problem	and	solution	requirements
9
1. Business	
Understanding
Flight	delays:	Create	some	solution	that	can	help	
users	predict	if	a	flight on	a	given	day	will	be	
delayed or	not	delayed
1.	Business	understanding
Using	departing	&	arrival	airport,	date,	carrier,	etc.,
we	could	predict	flight	[DELAY]	or	[NO-DELAY]	using	
logistic	regression.
• Identify	suitable	statistical/machine	learning	technique(s)
10
2.	Analytic
Approach
• Linear	regression
• Logistic	regression
• Clustering
• Decision	Trees
• Principal	component	
analysis
• Text	analysis
• SVM/SVR
• Neural	networks
• Dimension	
Reduction
2.	Analytic	approach
11
3.	Data
Requirements
4.	Data
Collection
5.	Data
Understanding
What	data	is	required?
What	format?
Collect	the	data
What	does	the	data	look	like?
What	are	initial	insights?
Can	we	visualize	the	data?
Are	missing	anything?
• Flight	data
• Open	data	available
• All	domestic	US	flights	per	year
• CSV	format
• Which	airports	are	busiest?
• Which	flights	are	most	delayed?
• Which	airports	are	best/worst?
Flight	Data
12
We	will	only	look	at		data	from	2007	(seven	million	flights)
http://stat-computing.org/dataexpo/2009/the-data.html
Departure	Delay	(min)
13
Which	airports	are	busiest?
14
Which	flights	are	most	likely	be	delayed?
Data	Preparation	typically	includes:
• Data	cleaning
• Merging	data
• Transforming	data
• Feature	engineering
• Text	analysis
15
6.	Data	preparation
6.	Data
Preparation
Flights	are	classified	as	“delayed”	if	>15	min	late.
• Delayed? [True	or	False]
Does	time	of	day	for	departure	predict	delays?
• Hour
16
Which	day	of	the	week and	time	of	departure	is	worst?
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
Modeling is	a:
• Highly	iterative	process
• Multiple	models	may	be	used	and	tested
18
Modelling
Modeling
Using	inputs:
• Year
• Month
• Day	of	Month
• Hour of	departure
• Distance
• Destination airport
Predict:
Delay (True/False)
Logistic	Regression
How	well	does	our	model	accurately	predict	
delays?
• Does	the	model	performance	meet	our	business	goals?
• Do	we	need	to	refine	our	model?
19
Evaluation
Model	evaluation
• Once	finalized,	the	model	is	deployed into	a	production	environment.
• May	be	in	a	limited	/	test	environment	until	model	is	proven
• Involves	additional	groups,	skills,	and	technologies	
• Solution	owner
• Marketing
• Application	developers	and	designers
• IT	administration
• Feedback to	assess	model	performance
• Gathering	and	analysis	of	feedback	for	assessment
of	the	model’s	performance	and	impact
• Iterative	process	for	model	refinement	and	redeployment
• Accelerate	through	automated	processes
20
Deployment
Feedback
Prediction
Interpretation
Justification
Testing
Deployment	and	feedback
21
Creating	a	prototype
1.	Business	
Understanding
Data	Science	Methodology
7.	Modelling
6.	Data
Preparation
3.	Data
Requirements
9.	Deployment10.	Feedback
Prediction
Interpretation
Justification
Testing
4.	Data
Collection
8.	Evaluation
5.	Data
Understanding
2.	Analytic
Approach
Case-study	&	Demo:	Food
Can	we	use	ingredients	to	predict	what	cuisine	a	recipe	belongs	to?
23
What	cuisine	is	this?
2	PM
4	minute	
BLT	
Beast
24
What	cuisine	is	this?
Ingredients:
Rice
Seaweed
Wasabi
Soy	sauce
25
http://allrecipes.com/recipe/189477/california-roll-sushi/
26
How	are	we	able	to	tell	what	kind	
of	cuisine	some	food	dish	is,
even	if	we’ve	never	seen	it	before?
Schellack at English	Wikipedia
https://www.flickr.com/photos/10559879@N00/4004745542
A. Based	on	the	ingredients	alone,	can	we	predict	
what	cuisine a	food	dish	belongs	to?
B. Which cuisines	are similar	to	each	other	based	
on	their	ingredients?
27
Business	
1.	Research
Understanding
Japanese American
British Indian
Chinese
French Italian
Vietnamese Canadian
Food	and	ingredients
28
Rice?
ALL	CUISINES
NON-ASIAN	
FOOD
ASIAN	FOOD
NO YES
Wasabi?
NO YES
NOT	JAPANESE JAPANESE
A. Based	on	the	ingredients	alone,	
can	we	predict	what	cuisine a	food	dish	belongs	to?
2.	Analytic
Approach
Decision	trees
B. Which	cuisines	are similar	to	each	other	based	on	their	ingredients?
Analytic
Approach
K-means
Clustering Group	similar	cuisines	together	
into	k number	of	clusters.
www.allrecipes.com
www.epicurious.com
www.menupan.com
30
Web	Scrape
Data
Collection
Data	scraped	by	Yong-Yeol Ahn
http://yongyeol.com/
31
Data
Understanding
Polong Lin(林伯龍)/how to approach data science problems from start to end

More Related Content

PDF
李俊良/Feature Engineering in Machine Learning
PPTX
李育杰/The Growth of a Data Scientist
PDF
陸永祥/全球網路攝影機帶來的機會與挑戰
PDF
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
PDF
許永真/Crowd Computing for Big and Deep AI
PPTX
Deep Learning with Python (PyData Seattle 2015)
PDF
Big-data analytics: challenges and opportunities
PDF
視訊訊號處理與深度學習應用
李俊良/Feature Engineering in Machine Learning
李育杰/The Growth of a Data Scientist
陸永祥/全球網路攝影機帶來的機會與挑戰
[台灣人工智慧學校] 主題演講 - 張智威總經理 (President of HTC DeepQ)
許永真/Crowd Computing for Big and Deep AI
Deep Learning with Python (PyData Seattle 2015)
Big-data analytics: challenges and opportunities
視訊訊號處理與深度學習應用

What's hot (20)

PDF
Introduction to Data Science
PDF
Introduction to Data Science
PPTX
Python for Data Science with Anaconda
PDF
Machine learning for_finance
PDF
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
PDF
PyData 2015 Keynote: "A Systems View of Machine Learning"
PPTX
Tales from an ip worker in consulting and software
PDF
Data By The People, For The People
PDF
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
PDF
34th.余凯.机器学习进展及语音图像中的应用
PDF
How to Interview a Data Scientist
PPTX
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
PDF
Deep Water - GPU Deep Learning for H2O - Arno Candel
PPTX
Implementing Artificial Intelligence with Big Data
PPTX
machine learning in the age of big data: new approaches and business applicat...
PDF
Approximate "Now" is Better Than Accurate "Later"
PPTX
Deep Learning Jump Start
PPTX
Everything you need to know about AutoML
PDF
Deep Learning for Recommender Systems
PPTX
Using the search engine as recommendation engine
Introduction to Data Science
Introduction to Data Science
Python for Data Science with Anaconda
Machine learning for_finance
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
PyData 2015 Keynote: "A Systems View of Machine Learning"
Tales from an ip worker in consulting and software
Data By The People, For The People
ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for...
34th.余凯.机器学习进展及语音图像中的应用
How to Interview a Data Scientist
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Deep Water - GPU Deep Learning for H2O - Arno Candel
Implementing Artificial Intelligence with Big Data
machine learning in the age of big data: new approaches and business applicat...
Approximate "Now" is Better Than Accurate "Later"
Deep Learning Jump Start
Everything you need to know about AutoML
Deep Learning for Recommender Systems
Using the search engine as recommendation engine
Ad

Viewers also liked (20)

PDF
李祈均/人類行為訊號處理 : 跨學科 (醫療、教育、心理) 應用實例分享、心得、展望
PDF
闕嘉宏/我在智慧交通資料解析的失敗歷程
PDF
林煜軒…œ/從手機解讀行為與心理
PDF
林守德/Practical Issues in Machine Learning
PDF
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
PDF
鄭世昐/未來城市的任意門 (Mobility on Demand for Future Cities)
PDF
黃從仁/心理與行為資料中的因與果
PDF
林峰正/智慧型工程管考系統 : 資料分析經驗談
PPTX
顏汝芳/從薪酬制度讀 CEO 的行為心理學
PDF
陳伶志/自己的空氣品質自己量 : 談參與式環境感測的機會與挑戰
PDF
江振宇/It's Not What You Say: It's How You Say It!
PDF
楊奕軒/音樂資料檢索
PDF
周世恩/資料分析前的奏曲 : 談資料收集的挑戰
PDF
孫民/從電腦視覺看人工智慧 : 下一件大事
PDF
許懷中/娛樂產業中的資料科學家 : 談資料科學於線上遊戲與職業運動之應用
PDF
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
PDF
林佳賢/資料視覺化的 20 個小訣竅
PDF
李宏毅/當語音處理遇上深度學習
PDF
姜俊宇/從資料到知識:從零開始的資料探勘
PDF
[系列活動] 資料探勘速遊
李祈均/人類行為訊號處理 : 跨學科 (醫療、教育、心理) 應用實例分享、心得、展望
闕嘉宏/我在智慧交通資料解析的失敗歷程
林煜軒…œ/從手機解讀行為與心理
林守德/Practical Issues in Machine Learning
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
鄭世昐/未來城市的任意門 (Mobility on Demand for Future Cities)
黃從仁/心理與行為資料中的因與果
林峰正/智慧型工程管考系統 : 資料分析經驗談
顏汝芳/從薪酬制度讀 CEO 的行為心理學
陳伶志/自己的空氣品質自己量 : 談參與式環境感測的機會與挑戰
江振宇/It's Not What You Say: It's How You Say It!
楊奕軒/音樂資料檢索
周世恩/資料分析前的奏曲 : 談資料收集的挑戰
孫民/從電腦視覺看人工智慧 : 下一件大事
許懷中/娛樂產業中的資料科學家 : 談資料科學於線上遊戲與職業運動之應用
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
林佳賢/資料視覺化的 20 個小訣竅
李宏毅/當語音處理遇上深度學習
姜俊宇/從資料到知識:從零開始的資料探勘
[系列活動] 資料探勘速遊
Ad

Similar to Polong Lin(林伯龍)/how to approach data science problems from start to end (11)

PPTX
What is data science ?
PDF
Data fluency for the 21st century
PDF
From Data to Discovery: The Journey of a Data Scientist
PDF
Data Science Lecture: Overview and Information Collateral
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PDF
Data Science Presentation.pdf
PPTX
Cloudera Data Science Challenge
PPTX
Data Science Challenge presentation given to the CinBITools Meetup Group
PPT
data science ppt of emngineering studnets
PPTX
Data science life cycle
PPTX
Data science life cycle final
What is data science ?
Data fluency for the 21st century
From Data to Discovery: The Journey of a Data Scientist
Data Science Lecture: Overview and Information Collateral
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science Presentation.pdf
Cloudera Data Science Challenge
Data Science Challenge presentation given to the CinBITools Meetup Group
data science ppt of emngineering studnets
Data science life cycle
Data science life cycle final

More from 台灣資料科學年會 (20)

PDF
[台灣人工智慧學校] 人工智慧技術發展與應用
PDF
[台灣人工智慧學校] 執行長報告
PDF
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
PDF
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
PDF
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
PDF
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
PDF
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
PDF
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
PDF
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
PDF
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
PDF
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
PDF
台灣人工智慧學校成果發表會
PDF
[台中分校] 第一期結業典禮 - 執行長談話
PDF
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
PDF
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
PDF
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
PDF
[TOxAIA新竹分校] 深度學習與Kaggle實戰
PDF
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
PDF
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
PDF
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
台灣人工智慧學校成果發表會
[台中分校] 第一期結業典禮 - 執行長談話
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Foundation of Data Science unit number two notes
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Quality review (1)_presentation of this 21
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IB Computer Science - Internal Assessment.pptx
Introduction to machine learning and Linear Models
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Foundation of Data Science unit number two notes
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Quality review (1)_presentation of this 21
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Polong Lin(林伯龍)/how to approach data science problems from start to end