SlideShare a Scribd company logo
Skillsto Master Machine Learning and
Data Science Project Cycle &
Strategiesto Avoid Common Pitfalls
MING LI, AMAZON
KDD 2019 WORKSHOP “INITIATIVE FOR ANALYTICS AND DATA SCIENCE STANDARDS”
It is rare to see machine learning and data
science presentations with topics focusing
on end-to-end project cycle. It is even less to
discuss project common pitfalls.
In this talk, we cover both!
End-to-End Machine Learning and
Data Science Project Cycle
Overview
 Model exploring and development
 Model training, validation, testing
 Model selection
Breakdown
o Python + sklearn
o R & various libs
o Keras + DL libs
o Spark MLlib
o Traditional machine
learning (ML) methods
o Deep learning methods
o General model training
and tuning process
 Data science formulation
 Data quality and availability
 Data preprocessing
 Feature engineering
Breakdown
o Statistical/ML thinking
o Idea abstraction
o Data generation process
o Data understanding
o Common preprocessing and
feature engineering ideas
o SQL
o Python: pandas
o R: tidyverse
o PySpark / SparkR
 Business problem definition and understanding
 Quantifying business value and define key metrics
 Computation resources assessment
 Key milestones and timeline
 Data security, privacy and legal review
Breakdown
o Business insight
o Collaboration
o Teamwork
o Planning
o Prototype
 A/B testing in production system
 Model deployment in production
environment
 Exception management
 Performance monitoring
Breakdown
o Production system
knowledge
o Launch plan
o Anomaly detection
o Dashboard
o Tableau
o R-Shiny
 Model tuning and re-training
 Model update and add-on
 Model failure and retirement
 …
Breakdown
o Quantify business
dynamics
o Consider different
scenario
o Plan and set criteria
o Auto ML
Breakdown  Business problem definition and
understanding
 Quantifying business value and define
key metrics
 Computation resources assessment
 Key milestones and timeline
 Data security, privacy and legal review
 Data science formulation
 Data quality and availability
 Data preprocessing
 Feature engineering
 Model exploring and development
 Model training, validation, testing
 Model selection
 A/B testing in production system
 Model deployment in production
environment
 Exception management
 Performance monitoring
 Model tuning and re-training
 Model update and add-on
 Model failure and retirement
Project
Cycle
Planning
Formulation
Modeling
Production
Post
Production
 Business teams
o Operation team
o Business analyst team
o Insight and reporting team
 Technology team
o Database and data warehouse team
o Data engineering team
o Infrastructure team
o Core machine learning team
o Software development team
o Visualization dashboard team
o Production implementation
 Project management team
 Program management team
 Product management team
 Senior leadership team
 Leaders across organizations
Cross Team
Collaboration
Agile-Style
Project
Management
Team Work
Common Pitfalls in Machine Learning and
Data Science Projects
Project Planning Stage
Solving the wrong problem
o Vague description of business needs
o Misalignment across many teams (Scientist, Developer, Operation, Project Managers etc.)
o Scientist team are not actively participating in the problem formulation process
Too optimistic about the timeline
o Project managers may not have past experience for ML and data science projects
o Many ML method-specific uncertainties are not accounted for at planning stage
o ML and data science projects are fundamentally different from each other and from software development
projects (such as online vs. offline model, batch model, real time training, re-training etc.)
Over promise on business value
o Unrealistic high expectation (i.e. advertisement vs actual product)
o Many assumptions about the project are usually not true
o Similar projects from other teams/companies are not evaluated thoroughly to set realistic expectation of time line
and outcome
Problem Formulation Stage
Too optimistic about standard statistical and ML methods
o Extra efforts are needed to abstract business problem into a set of analytics problems
o Standard methods are usually not enough to solve the business problems
Too optimistic about data availability and quality
o “Big data” is not a guarantee of good and relevant data, usually big and messy
o Ideal data for the business problem is almost always not available
o Unexpected efforts to bring the right data
o Under estimate effort to evaluate quality of data
Too optimistic about needed effort on data preprocessing
o Table or column descriptions are not detailed enough
o Lack in-depth understanding of the dataset
o Under estimate of date preprocessing (such as dealing missing data)
o Under estimate the effort for feature engineering
o Mismatch between different data sources (such as online vs offline, different tables etc.)
Modeling Stage
Un-representative data (such as lack of future outlook of what
will happen in production or biased data)
Too optimistic about model selection and hyper-parameter
tuning to reach desired performance
Overfitting and obsession for complicated models (heavy models
may leads poor production performance)
Take too long to fail
Productionization Stage
Bad production performance
oLack shadow mode dry run
oLack needed A/B testing
oData availability and stability issue in real time
oLack exception management on issues such as timeout and missing data
Fail to scale in real time applications
oComputation capacity limitation
oReal time data storage and processing limitation
oLatency constrains
oNot enough engineering resources (i.e. SDE, DE) during implementation
Post Production Stage
Missing necessary checkup
oLack model monitoring for key metrics
oLack exception notification
oLack model failures/timeout notification
oOnline feature not stored for future analysis
Production performance degradation
oNot aware of dynamic nature of the business problem
oNot aware of changing input data quality and availability
oLack model tuning and re-training plan
oLack model retirement or replacement plan
Strategies to Avoid Pitfalls
Be aware these pitfalls
Proactively discuss these pitfalls across teams
Be prepared when it happens
Have an end-to-end project cycle mindset
Ensure needed engineering recourses available
True collaboration among teams
Data Science Family Career Path
Skills Career Title Education
If you are good at all
three areas, you
become a "full-stack"
data scientist!
Each career title has
its own growth path
has promotion cycle.
Thank you!

More Related Content

PPTX
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
PDF
KDD 2019 IADSS Workshop - How Data Scientists can bridge the gap between Data...
PPTX
KDD 2019 IADSS Workshop - Standardizing data science to help hiring - Greg Ma...
PPTX
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
PDF
Who is a data scientist
PDF
Data science vs. Data scientist by Jothi Periasamy
PDF
Planning Your Data Science Projects
PDF
Data Analytics: From Basic Skills to Executive Decision-Making
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
KDD 2019 IADSS Workshop - How Data Scientists can bridge the gap between Data...
KDD 2019 IADSS Workshop - Standardizing data science to help hiring - Greg Ma...
KDD 2019 IADSS Workshop - Research Updates from Usama Fayyad & Hamit Hamutcu
Who is a data scientist
Data science vs. Data scientist by Jothi Periasamy
Planning Your Data Science Projects
Data Analytics: From Basic Skills to Executive Decision-Making

What's hot (20)

PDF
Data Science Salon: Building a Data Science Culture
PPTX
Data science for business leaders executive program
PDF
Data Scientist Toolbox
PDF
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
PDF
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
PPTX
Data Science Salon: Applying Machine Learning to Modernize Business Processes
PDF
Challenges of Executing AI
PDF
2016 Data Science Salary Survey
PDF
How academic institutions best support PhDs and postdocs in the transition to...
PPTX
New professional careers in data
PPTX
Advanced Analytics - Frameworks, Platforms and Metholodologies v 1.0
PPTX
Data science as a professional career
PPTX
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
PDF
1645 track 1 bress_using his laptop
PDF
From Rocket Science to Data Science
PPTX
Data scientist the sexiest job of the 21st century by thomas h davenport and ...
PDF
Leveraged Analytics at Scale
PPTX
Dr. Gábor Kismihók: Labour Market driven Learning Analytics
PPT
Delivering Value Through Business Analytics
PPTX
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
Data Science Salon: Building a Data Science Culture
Data science for business leaders executive program
Data Scientist Toolbox
Data Science Salon: Culture, Data Engineering and Hamburger Stands: Thoughts ...
#Datacaeer - AI Guild workshop on data roles in industry with Adam Green
Data Science Salon: Applying Machine Learning to Modernize Business Processes
Challenges of Executing AI
2016 Data Science Salary Survey
How academic institutions best support PhDs and postdocs in the transition to...
New professional careers in data
Advanced Analytics - Frameworks, Platforms and Metholodologies v 1.0
Data science as a professional career
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
1645 track 1 bress_using his laptop
From Rocket Science to Data Science
Data scientist the sexiest job of the 21st century by thomas h davenport and ...
Leveraged Analytics at Scale
Dr. Gábor Kismihók: Labour Market driven Learning Analytics
Delivering Value Through Business Analytics
System Dynamics, Analytics & Big Data (16th Conference of the UK Chapter of t...
Ad

Similar to KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science Project Cycle & Strategies to Avoid Common Pitfalls - Ming Li (20)

PDF
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
PDF
How Will Your ML Project Fail
PDF
DS Life Cycle
PDF
DS Life Cycle
PDF
Learn How to Make Machine Learning Work
PPTX
How to build a data science project in a corporate setting, by Soraya Christi...
PDF
Understanding-the-Data-Science-Lifecycle
PPTX
Data Science course in Hyderabad .
PPTX
Data Science course in Hyderabad .
PDF
Data science course in ameerpet Hyderabad
PPTX
data science course training in Hyderabad
PPTX
data science course in Hyderabad data science course in Hyderabad
PPTX
data science.pptx
PPTX
best data science course institutes in Hyderabad
PDF
Real World End to End machine Learning Pipeline
PPTX
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
PPTX
Why Data Science Projects Fail
PPTX
Why Data Science Projects Fail
PPTX
Why Data Science Projects Fail?
PDF
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
How Will Your ML Project Fail
DS Life Cycle
DS Life Cycle
Learn How to Make Machine Learning Work
How to build a data science project in a corporate setting, by Soraya Christi...
Understanding-the-Data-Science-Lifecycle
Data Science course in Hyderabad .
Data Science course in Hyderabad .
Data science course in ameerpet Hyderabad
data science course training in Hyderabad
data science course in Hyderabad data science course in Hyderabad
data science.pptx
best data science course institutes in Hyderabad
Real World End to End machine Learning Pipeline
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
Why Data Science Projects Fail
Why Data Science Projects Fail
Why Data Science Projects Fail?
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Business Analytics and business intelligence.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Quality review (1)_presentation of this 21
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
Clinical guidelines as a resource for EBP(1).pdf
Fluorescence-microscope_Botany_detailed content
Business Analytics and business intelligence.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science Project Cycle & Strategies to Avoid Common Pitfalls - Ming Li

  • 1. Skillsto Master Machine Learning and Data Science Project Cycle & Strategiesto Avoid Common Pitfalls MING LI, AMAZON KDD 2019 WORKSHOP “INITIATIVE FOR ANALYTICS AND DATA SCIENCE STANDARDS”
  • 2. It is rare to see machine learning and data science presentations with topics focusing on end-to-end project cycle. It is even less to discuss project common pitfalls. In this talk, we cover both!
  • 3. End-to-End Machine Learning and Data Science Project Cycle
  • 5.  Model exploring and development  Model training, validation, testing  Model selection Breakdown o Python + sklearn o R & various libs o Keras + DL libs o Spark MLlib o Traditional machine learning (ML) methods o Deep learning methods o General model training and tuning process
  • 6.  Data science formulation  Data quality and availability  Data preprocessing  Feature engineering Breakdown o Statistical/ML thinking o Idea abstraction o Data generation process o Data understanding o Common preprocessing and feature engineering ideas o SQL o Python: pandas o R: tidyverse o PySpark / SparkR
  • 7.  Business problem definition and understanding  Quantifying business value and define key metrics  Computation resources assessment  Key milestones and timeline  Data security, privacy and legal review Breakdown o Business insight o Collaboration o Teamwork o Planning o Prototype
  • 8.  A/B testing in production system  Model deployment in production environment  Exception management  Performance monitoring Breakdown o Production system knowledge o Launch plan o Anomaly detection o Dashboard o Tableau o R-Shiny
  • 9.  Model tuning and re-training  Model update and add-on  Model failure and retirement  … Breakdown o Quantify business dynamics o Consider different scenario o Plan and set criteria o Auto ML
  • 10. Breakdown  Business problem definition and understanding  Quantifying business value and define key metrics  Computation resources assessment  Key milestones and timeline  Data security, privacy and legal review  Data science formulation  Data quality and availability  Data preprocessing  Feature engineering  Model exploring and development  Model training, validation, testing  Model selection  A/B testing in production system  Model deployment in production environment  Exception management  Performance monitoring  Model tuning and re-training  Model update and add-on  Model failure and retirement Project Cycle Planning Formulation Modeling Production Post Production
  • 11.  Business teams o Operation team o Business analyst team o Insight and reporting team  Technology team o Database and data warehouse team o Data engineering team o Infrastructure team o Core machine learning team o Software development team o Visualization dashboard team o Production implementation  Project management team  Program management team  Product management team  Senior leadership team  Leaders across organizations Cross Team Collaboration Agile-Style Project Management Team Work
  • 12. Common Pitfalls in Machine Learning and Data Science Projects
  • 13. Project Planning Stage Solving the wrong problem o Vague description of business needs o Misalignment across many teams (Scientist, Developer, Operation, Project Managers etc.) o Scientist team are not actively participating in the problem formulation process Too optimistic about the timeline o Project managers may not have past experience for ML and data science projects o Many ML method-specific uncertainties are not accounted for at planning stage o ML and data science projects are fundamentally different from each other and from software development projects (such as online vs. offline model, batch model, real time training, re-training etc.) Over promise on business value o Unrealistic high expectation (i.e. advertisement vs actual product) o Many assumptions about the project are usually not true o Similar projects from other teams/companies are not evaluated thoroughly to set realistic expectation of time line and outcome
  • 14. Problem Formulation Stage Too optimistic about standard statistical and ML methods o Extra efforts are needed to abstract business problem into a set of analytics problems o Standard methods are usually not enough to solve the business problems Too optimistic about data availability and quality o “Big data” is not a guarantee of good and relevant data, usually big and messy o Ideal data for the business problem is almost always not available o Unexpected efforts to bring the right data o Under estimate effort to evaluate quality of data Too optimistic about needed effort on data preprocessing o Table or column descriptions are not detailed enough o Lack in-depth understanding of the dataset o Under estimate of date preprocessing (such as dealing missing data) o Under estimate the effort for feature engineering o Mismatch between different data sources (such as online vs offline, different tables etc.)
  • 15. Modeling Stage Un-representative data (such as lack of future outlook of what will happen in production or biased data) Too optimistic about model selection and hyper-parameter tuning to reach desired performance Overfitting and obsession for complicated models (heavy models may leads poor production performance) Take too long to fail
  • 16. Productionization Stage Bad production performance oLack shadow mode dry run oLack needed A/B testing oData availability and stability issue in real time oLack exception management on issues such as timeout and missing data Fail to scale in real time applications oComputation capacity limitation oReal time data storage and processing limitation oLatency constrains oNot enough engineering resources (i.e. SDE, DE) during implementation
  • 17. Post Production Stage Missing necessary checkup oLack model monitoring for key metrics oLack exception notification oLack model failures/timeout notification oOnline feature not stored for future analysis Production performance degradation oNot aware of dynamic nature of the business problem oNot aware of changing input data quality and availability oLack model tuning and re-training plan oLack model retirement or replacement plan
  • 18. Strategies to Avoid Pitfalls Be aware these pitfalls Proactively discuss these pitfalls across teams Be prepared when it happens Have an end-to-end project cycle mindset Ensure needed engineering recourses available True collaboration among teams
  • 19. Data Science Family Career Path Skills Career Title Education If you are good at all three areas, you become a "full-stack" data scientist! Each career title has its own growth path has promotion cycle.

Editor's Notes

  • #15: At problem formulation stage, the biggest pitfall is “Solving the wrong problem”, it sounds funny, but it is sadly true. One of the main reason behind this problem is that, data scientist are not involved in the first place for the initial discussion. Over promise on business values in another big problem in problem formulation stage, often times, Data scientist’s data-driven and fact-base voice is not heard much behind this problem.
  • #16: At modeling stage, these pitfalls are familiar to statisticians, such as un-representative data and in the most times, the ideal data is not available for a specific business problem, but very biased data will definitely lead to model failure.