SlideShare a Scribd company logo
WHY SO MANY
DATA SCIENCE
PROJECTS FAIL?
Ethan Ram / Aug. 2018
1
• Between 70% to 80% of
corporate business
intelligence projects fail
(Gartner)
• 55% of big data projects are
never finished (Inforchimps)
• Only 13% of organizations
achieve full-scale production
for their in-house big-data
implementations (Qubole)
• And the results…
DATA SCIENCE PROJECTS FAIL…
9/3/2018 Why So Many Data Science Projects Fail 2
Top of the list of
developers who said they
are looking for a new job*:
• ML specialists - 14.3%
• Data scientists - 13.2%
9/3/2018 Why So Many Data Science Projects Fail 3
“I HATE THIS JOB!”
* 2018 Stack Overflow survey based on 64,000 developers’ answers
Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
We’ll look at some common failures in each step and
suggest better approaches.
DATA SCIENCE APPLICATION LIFECYCLE
9/3/2018 Why So Many Data Science Projects Fail 4
•First day success
•No false-positives
•100% accuracy
•No business value expected
•Expecting that the ML itself
would be the product
•Not defining the deliverable
9/3/2018 Why So Many Data Science Projects Fail 5
BUSINESS OBJECTIVE FAILURES
• Google “fixed” its “racist” algorithm
by removing gorillas from its image-
labeling tech
CAN YOU AFFORD A FALSE POSITIVE?
•Very few business’ core product is
AI/ML/Data based
•Most use those tools to improve
their bottom lines with existing
products
BE REALISTIC!
9/3/2018 Why So Many Data Science Projects Fail 7
1. Descriptive analysis (offline report)
2. Dashboard (real-time system)
3. Automated decision making system (“self
driving” system)
4. Dataset with specific qualities (to be used by
another ML)
Define: leverage, friction to impact and
cleanness
5. Methodology (dataset >> model)
6. Framework (API/SDK to build methodologies)
7. Proof-of-concept (proof a viable methodology)
TYPES OF DELIVERABLES
9/3/2018 Why So Many Data Science Projects Fail 8
•Missing diversity in the team
•In many projects 80% of work is
working on the dataset!
•It’s a *research* project!
•Short time to delivery
PLANNING FAILURES
Drue Conaway: Data Science Diagram
9/3/2018 Why So Many Data Science Projects Fail 9
Engineering
YOLO V3 NETWORK ARCHITECTURE
•Too little data to
build on
•Dataset is dirty
•Missing data from
the field
DATA INVENTORY FAILURES
9/3/2018 Why So Many Data Science Projects Fail 11
9/3/2018 Why So Many Data Science Projects Fail 12
DIRTY DATASET: NEGATIVE INFLUENCE
Data-set includes
negative influence
examples
Resulting
Classification
(with confidence)
9/3/2018 Why So Many Data Science Projects Fail 13
DATA MODELING FAILURESYou need to be
able to understand
the result! •Jumping to conclusions on what
the data is
•Assuming it works based on a
small sample
•Feedback-loop in results
•Missing cross validation
•Choosing algorithms that are too
heavy for the application
Supervised
learning
Classification
Linear classifiers
/ Fisher's
discriminant
Support vector
machines /
Least squares
Quadratic
classifiers
Kernel
estimation
K-nearest
neighbor
Regression
Linear
Regression
Logistic
Regression
CART
Naïve Bayes
Ensemble
Bagging with
Random Forests
Boosting with
XGBoost
Unsupervised
learning
Association
Apriori
K-means
Clustering
Mean-Shift
Density-Based
Spatial
EM-GMM
Agglomerative
Hierarchical
Dimensionality
Reduction
Feature
Selection
Variance
Thresholds
Correlation
Thresholds
Genetic
Algorithms (GA)
Stepwise Search
Feature
extraction
PCA
Linear
Discriminant
Analysis (LDA)
Autoencoders
Reinforcement
learning
Exploration
a.Criterion of
optimality
a.Brute force
a.Value function
a.Direct policy
search
9/3/2018 Why So Many Data Science Projects Fail 14
Application
Class
Algorithms
ML ALGORITHMS [PARTIAL] MAP
Boosting
Decision trees
Random forests
Neural networks
Learning vector
quantization
•Requesting the Data Scientists
team to build the application…
•Not testing to scale
•Switching from monitoring to
automatic action-taking too fast
•Missing safeguards on output
•Not preparing for attack!
APPLICATION FAILURES
9/3/2018 Why So Many Data Science Projects Fail 15
9/3/2018 Why So Many Data Science Projects Fail 16
DIRECT ATTACK EXAMPLE

9/3/2018 Why So Many Data Science Projects Fail 17
SYNTHESIZED ADVERSARIAL EXAMPLE
“WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…”
9/3/2018 Why So Many Data Science Projects Fail 18
•Assuming it just works…
• Not having a long enough
beta
• Missing feedback from real
users
•Missing KPIs
• Measure business success
• Find false-positives
•Missing A-B testing built-in
MONITOR > MEASURE > OPTIMIZE FAILURES
9/3/2018 Why So Many Data Science Projects Fail 19
"Right now, a lot of our AI
systems make decisions in
ways that people don't really
understand… And I don't
think that… we want to end
up with systems that people
don't understand how they're
making decisions.“
• ZUCKERBERG at Senate
hearing 10-Apr-18
9/3/2018 Why So Many Data Science Projects Fail 20
Business
objective and
plan
Build
dataset
Model data
and validate
Implement
application
Deploy
Monitor,
measure &
optimize
DATA SCIENCE APPLICATION LIFECYCLE
•Q&A
9/3/2018 Why So Many Data Science Projects Fail 21

More Related Content

PPTX
Why Data Science Projects Fail
PPTX
Managing Data Science | Lessons from the Field
PDF
Leveraged Analytics at Scale
PPTX
Reproducible Dashboards and other great things to do with Jupyter
PPTX
ODSC May 2019 - The DataOps Manifesto
PPTX
How to add security in dataops and devops
PPTX
Moving Data Science from an Event to A Program: Considerations in Creating Su...
PDF
Building a Data Platform Strata SF 2019
Why Data Science Projects Fail
Managing Data Science | Lessons from the Field
Leveraged Analytics at Scale
Reproducible Dashboards and other great things to do with Jupyter
ODSC May 2019 - The DataOps Manifesto
How to add security in dataops and devops
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Building a Data Platform Strata SF 2019

What's hot (20)

PDF
Architecting a Platform for Enterprise Use - Strata London 2018
PDF
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
PDF
[Infographic] Uniting Internet of Things and Big Data
PPTX
Max Cottica slides from Future of Business Intelligence
PDF
Pay no attention to the man behind the curtain - the unseen work behind data ...
PDF
Focus on Your Analysis, Not Your SQL Code
PPTX
Lifecycle of a Data Science Project
PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
PDF
Back to Square One: Building a Data Science Team from Scratch
PDF
How to understand trends in the data & software market
PDF
How Data Science Builds Better Products - Data Science Pop-up Seattle
PDF
What Managers Need to Know about Data Science
PDF
20151016 Data Science For Project Managers
PDF
The Black Box: Interpretability, Reproducibility, and Data Management
PPTX
How can a quality engineering and assurance consultancy keep you ahead of others
PPTX
Big data and Predictive Analytics By : Professor Lili Saghafi
PDF
1645 track 1 bress_using his laptop
PDF
Never Mind Big Data: We're Still Living in the Era of Big Spreadsheet
PPTX
Giovanni Lanzani GoDataDriven
PPTX
Beyond the Science Gateway
Architecting a Platform for Enterprise Use - Strata London 2018
Architecting a Data Platform For Enterprise Use (Strata NY 2018)
[Infographic] Uniting Internet of Things and Big Data
Max Cottica slides from Future of Business Intelligence
Pay no attention to the man behind the curtain - the unseen work behind data ...
Focus on Your Analysis, Not Your SQL Code
Lifecycle of a Data Science Project
Domino and AWS: collaborative analytics and model governance at financial ser...
Back to Square One: Building a Data Science Team from Scratch
How to understand trends in the data & software market
How Data Science Builds Better Products - Data Science Pop-up Seattle
What Managers Need to Know about Data Science
20151016 Data Science For Project Managers
The Black Box: Interpretability, Reproducibility, and Data Management
How can a quality engineering and assurance consultancy keep you ahead of others
Big data and Predictive Analytics By : Professor Lili Saghafi
1645 track 1 bress_using his laptop
Never Mind Big Data: We're Still Living in the Era of Big Spreadsheet
Giovanni Lanzani GoDataDriven
Beyond the Science Gateway
Ad

Similar to Why Data Science Projects Fail? (20)

PPTX
Why Data Science Projects Fail
PPTX
KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science ...
PDF
How Will Your ML Project Fail
PDF
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PPTX
Why many data science projects fail
PPTX
How to build a data science project in a corporate setting, by Soraya Christi...
PDF
How to add machine learning to your applications today
PPTX
Why do the majority of Data Science projects never make it to production?
PDF
Real-World-Case-Studies-in-Data-Science.
PDF
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
PDF
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
PPTX
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
PDF
10 Tips From A Young Data Scientist
PPTX
Top reasons why big data projects are still a failure
PDF
Data science - An Introduction
PDF
Building successful data science teams
PDF
Your Data Scientist Hates You
PDF
introduction-to-data-science-210911034830 (1).pdf
PDF
Introduction to-data-science
PPTX
Future of data science as a profession
Why Data Science Projects Fail
KDD 2019 IADSS Workshop - Skills to Master Machine Learning and Data Science ...
How Will Your ML Project Fail
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Why many data science projects fail
How to build a data science project in a corporate setting, by Soraya Christi...
How to add machine learning to your applications today
Why do the majority of Data Science projects never make it to production?
Real-World-Case-Studies-in-Data-Science.
BDW17 London - Abed Ajraou - First Utility - Putting Data Science in your Bus...
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
10 Tips From A Young Data Scientist
Top reasons why big data projects are still a failure
Data science - An Introduction
Building successful data science teams
Your Data Scientist Hates You
introduction-to-data-science-210911034830 (1).pdf
Introduction to-data-science
Future of data science as a profession
Ad

More from Ethan Ram (6)

PPTX
App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019
PPTX
Kiss.ts - The Keep It Simple Software Stack for 2017++
PDF
How to Measure Agility Project Success in Business Terms
PPT
Making the Agile Leap to Continuous Deployment
PPT
DevOps / Agile Tools Seminar 2013
PPT
Advanced topics in Agile: Implementing Scrum in a project-based company
App Install Fraud - Who? How? Why? and How to Fight it? - FraudCon 3.0 2019
Kiss.ts - The Keep It Simple Software Stack for 2017++
How to Measure Agility Project Success in Business Terms
Making the Agile Leap to Continuous Deployment
DevOps / Agile Tools Seminar 2013
Advanced topics in Agile: Implementing Scrum in a project-based company

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Modernizing your data center with Dell and AMD
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Modernizing your data center with Dell and AMD
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Why Data Science Projects Fail?

  • 1. WHY SO MANY DATA SCIENCE PROJECTS FAIL? Ethan Ram / Aug. 2018 1
  • 2. • Between 70% to 80% of corporate business intelligence projects fail (Gartner) • 55% of big data projects are never finished (Inforchimps) • Only 13% of organizations achieve full-scale production for their in-house big-data implementations (Qubole) • And the results… DATA SCIENCE PROJECTS FAIL… 9/3/2018 Why So Many Data Science Projects Fail 2
  • 3. Top of the list of developers who said they are looking for a new job*: • ML specialists - 14.3% • Data scientists - 13.2% 9/3/2018 Why So Many Data Science Projects Fail 3 “I HATE THIS JOB!” * 2018 Stack Overflow survey based on 64,000 developers’ answers
  • 4. Business objective and plan Build dataset Model data and validate Implement application Deploy Monitor, measure & optimize We’ll look at some common failures in each step and suggest better approaches. DATA SCIENCE APPLICATION LIFECYCLE 9/3/2018 Why So Many Data Science Projects Fail 4
  • 5. •First day success •No false-positives •100% accuracy •No business value expected •Expecting that the ML itself would be the product •Not defining the deliverable 9/3/2018 Why So Many Data Science Projects Fail 5 BUSINESS OBJECTIVE FAILURES
  • 6. • Google “fixed” its “racist” algorithm by removing gorillas from its image- labeling tech CAN YOU AFFORD A FALSE POSITIVE?
  • 7. •Very few business’ core product is AI/ML/Data based •Most use those tools to improve their bottom lines with existing products BE REALISTIC! 9/3/2018 Why So Many Data Science Projects Fail 7
  • 8. 1. Descriptive analysis (offline report) 2. Dashboard (real-time system) 3. Automated decision making system (“self driving” system) 4. Dataset with specific qualities (to be used by another ML) Define: leverage, friction to impact and cleanness 5. Methodology (dataset >> model) 6. Framework (API/SDK to build methodologies) 7. Proof-of-concept (proof a viable methodology) TYPES OF DELIVERABLES 9/3/2018 Why So Many Data Science Projects Fail 8
  • 9. •Missing diversity in the team •In many projects 80% of work is working on the dataset! •It’s a *research* project! •Short time to delivery PLANNING FAILURES Drue Conaway: Data Science Diagram 9/3/2018 Why So Many Data Science Projects Fail 9 Engineering
  • 10. YOLO V3 NETWORK ARCHITECTURE
  • 11. •Too little data to build on •Dataset is dirty •Missing data from the field DATA INVENTORY FAILURES 9/3/2018 Why So Many Data Science Projects Fail 11
  • 12. 9/3/2018 Why So Many Data Science Projects Fail 12 DIRTY DATASET: NEGATIVE INFLUENCE Data-set includes negative influence examples Resulting Classification (with confidence)
  • 13. 9/3/2018 Why So Many Data Science Projects Fail 13 DATA MODELING FAILURESYou need to be able to understand the result! •Jumping to conclusions on what the data is •Assuming it works based on a small sample •Feedback-loop in results •Missing cross validation •Choosing algorithms that are too heavy for the application
  • 14. Supervised learning Classification Linear classifiers / Fisher's discriminant Support vector machines / Least squares Quadratic classifiers Kernel estimation K-nearest neighbor Regression Linear Regression Logistic Regression CART Naïve Bayes Ensemble Bagging with Random Forests Boosting with XGBoost Unsupervised learning Association Apriori K-means Clustering Mean-Shift Density-Based Spatial EM-GMM Agglomerative Hierarchical Dimensionality Reduction Feature Selection Variance Thresholds Correlation Thresholds Genetic Algorithms (GA) Stepwise Search Feature extraction PCA Linear Discriminant Analysis (LDA) Autoencoders Reinforcement learning Exploration a.Criterion of optimality a.Brute force a.Value function a.Direct policy search 9/3/2018 Why So Many Data Science Projects Fail 14 Application Class Algorithms ML ALGORITHMS [PARTIAL] MAP Boosting Decision trees Random forests Neural networks Learning vector quantization
  • 15. •Requesting the Data Scientists team to build the application… •Not testing to scale •Switching from monitoring to automatic action-taking too fast •Missing safeguards on output •Not preparing for attack! APPLICATION FAILURES 9/3/2018 Why So Many Data Science Projects Fail 15
  • 16. 9/3/2018 Why So Many Data Science Projects Fail 16 DIRECT ATTACK EXAMPLE
  • 17.  9/3/2018 Why So Many Data Science Projects Fail 17 SYNTHESIZED ADVERSARIAL EXAMPLE
  • 18. “WE HAVEN’T SEEN ANYTHING LIKE THIS BEFORE…” 9/3/2018 Why So Many Data Science Projects Fail 18
  • 19. •Assuming it just works… • Not having a long enough beta • Missing feedback from real users •Missing KPIs • Measure business success • Find false-positives •Missing A-B testing built-in MONITOR > MEASURE > OPTIMIZE FAILURES 9/3/2018 Why So Many Data Science Projects Fail 19
  • 20. "Right now, a lot of our AI systems make decisions in ways that people don't really understand… And I don't think that… we want to end up with systems that people don't understand how they're making decisions.“ • ZUCKERBERG at Senate hearing 10-Apr-18 9/3/2018 Why So Many Data Science Projects Fail 20
  • 21. Business objective and plan Build dataset Model data and validate Implement application Deploy Monitor, measure & optimize DATA SCIENCE APPLICATION LIFECYCLE •Q&A 9/3/2018 Why So Many Data Science Projects Fail 21

Editor's Notes

  • #6: Expect magic to happen! YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream. It took 5 years to get to this version.. Dan Ariely: Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...
  • #7: Precision  accuracy False positive vs false negative… Think of automatic cancer prediction – Can reduce false negatives of a human proffesional. - - like a radiologist
  • #9: Outputs of data science: Descriptive analysis (a report): clear answer to a clear question like what should be the platform to release the new product: Android or iOS. This is usually on offline system. Dashboard: helps a human decide and take action continuously, or again and again. This is usually an online/real-time system. Automated decision making: based on the dashboard, take automatic action. (“self driving” system) Data-set: data that is then used by another algorithm. For example, a cleaned-up list of addresses that were given by users on a form. A data-set used for training and benchmarking object extraction from images: COCO dataset or IMAGE-NET dataset. If your dataset is no good you will never get a good results. Qualities of a dataset: Leverage: the potential of the dataset - - what it can be used for Friction to impact: what is the additional work needed on the dataset to get a significant Cleanness: percentage of errors in the dataset that may sabotage the learning process. A methodology: the system/algorithm that is used to take a dataset and create a model that can then be used to answer a question. For example, how to estimate national poll results based on a sample questioner of 500 ppl >> a "Bias correction" system A Recommendation system Framework: an API or SDK that is used to build (code) methodologies. For example, Google's AI framework, TensorFlow. The framework should assist in lowering the Friction to impact. Proof-of-concept: it does not give the business impact but it gives the notion that the methodology is viable. Used for "fail-fast" or as a first milestone in a larger project.
  • #10: Computer science >> computer engineering Math & stats - - many times done by physicians Need ppl that can do the data tagging It’s a research project! Data Science is more than machine learning The importance of diversity in a data-scientists team: based on the diagram it is clear that it's very hard to find ppl that are able to answer all the above, especially for a team that is meant to answer questions from a diverse set of domains. Some like offline-analysis Some like real-time systems Some are about processes some are about tools Some are very good in one domain but has zero knowledge about anything else… Etc..   A good data scientist better have at least excellent proficiency in one side _and_ at least some understanding in the other 2 sides.  
  • #11: Example of how complicated an ML project can be… YOLO (You Only Look Once) is a lightweight real-time object detection – can detect objects on a video-stream. It took 5 years to get to this version..
  • #12: Google Translate’s Maori dataset is too small, leading to some funny mistakes. Better not train your model on these cat pictures… A satiations would know this… but a computer system engineer would not. Internal politics – would the engineer get access to the transactions database???
  • #14: Feedback-loop in results => need to understand causality. e.g. testing a 'like' btn size. clicking 'like' on a big-btn brings the item to top of list for everyone so it affects control-group clicks. Must make sure the observational inference matches causality. You need to be able to understand the result.
  • #15: … Boosting Consider changing to Yoav’s chart - - give examples
  • #16: Building an application with a good UX is outside the scope of a Data Scientist team Tay is an AI-based chat bot created by Microsoft and “unleashed” on Tweeter in 2016. It soon absorbed what people talking with her as the truth..
  • #17: URME Personal Surveillance Identity Prosthetic – by http://www.urmesurveillance.com/ Kerckhoffs-Shannon principle: “one ought to design systems under the assumption that the enemy will immediately gain full familiarity with them”. Don’t rely on the privacy of the model because one day or another, it will be leaked. You should not base your code entirely on open-source algorithms. You should not base your model on open data-sets
  • #18: Generative Adversarial Networks (GAN) – is sometimes used to fool the original network. In the example: projected gradient decent Synthesizing adversarial examples for neural networks is surprisingly easy: small, carefully-crafted perturbations to inputs can cause neural networks to misclassify inputs in arbitrarily chosen ways. Given that adversarial examples transfer to the physical world and can be made extremely robust, this is a real security concern.
  • #19: In the GIF: Tesla Model S adaptive cruise control 1 second before crashes into a parked Van on the roadside - - May 2016
  • #20: KPI: Key Performance Indicator
  • #21: Generative Adversarial Networks (GAN) – is sometimes used to fool the original network >> it can be used to understand how the neural network works.
  • #25: Original map (interactive): http://scikit-learn.org/stable/tutorial/machine_learning_map/
  • #26: Types of data - 2 axis: Is it a qualitive (e.g. questionnaire) or a quantitative (sales transaction logs) Our data (e.g. logs) or 3rd pty data (e.g. Wikipedia dataset)