SlideShare a Scribd company logo
Applied data science in the industry:
How to build a data science project in a
corporate setting
BEST PRACTICES AND A REAL-WORLD EXAMPLE
Soraya Yama
Wednesday, June 26, 2019
WIMLDS Montreal #3: Business & AI
How to guarantee the success of your
data science project in industry?
Challenges and solutions when building data science projects in industry or in a corporate
environment
 How to generate insights for better business decision making is what drives data science projects?
 How to work with business side by side?
 How to build a reliable and understandable analysis flow/solution/product?
 How to properly communicate results and key elements?
Data science in industry vs in research
Industry Research
Faster pace than academia – quick iterations Experiments are easier in a lab
If analysis does not produce results quickly, drop it
and/or redesign it
Follow best practices to get approvals after peer
reviews
Simple solutions are preferred over novel complex
ones – hard to understand, hard to trust
Let’s go for the fancy cool new algorithms!!!
Limited time and resources so need to balance
research excellence with business needs
Research is expected to take a lot of time
Not everyone you work with understands data science
 need to convince decision makers to use the
insights to drive decisions
Peers understands data science and the importance of
research
The team might not be data-driven or analytics-
minded
You will most likely have more than one analyst in the
team
Explain statistical concepts in layman terms Your peers are more likely to understand the statistical
jargon you use
You won’t do data science only – you might need to
learn new skills (data engineering, new programming
language, new packages etc)
It is less likely that you do data engineering or
architecture while being a data scientist
Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure
Focus on industry specific projects
Challenges faced
1. Sometimes problems are not well defined
2. Sometimes data is not available or not in a usable format
3. Sometimes tools or data analysis platforms are not available
4. Which models to use? Which algorithms are more suitable for the analysis and the
infrastructure?
5. Sometimes clients or business lines will not understand your analysis, the methods used
6. How to build your data science flow and what to avoid?
7. How to presents results in a way business stake holders understand them?
1. Sometimes problems are not well
defined
Data Science is a science therefore it follows the scientific method
In a scientific method, the process starts with a question to be asked or a
problem to be identified
In data science, the process also starts with a problem to solve
This requires a proper understanding of the business context
Sometimes sitting with the business and help formalize the problem is key
How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley
2. Sometimes data is not available or not
in a usable format
Which data sources to use?
◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files
How to collect the data?
◦ import data, create a data pipeline
Who to work with for the data acquisition?
◦ data engineers, database system managers etc.
How to convince teams you need this data?
◦ explain the use case, have your manager support you
How to maintain this new data acquisition?
◦ is it a one shot data acquisition, is it a recurrent feed?
Where to store the data?
◦ big data storage, file system, cluster like Hadoop?
If it’s a data stream, how to build it?
◦ Kafka, AWS, Flume etc.
3. Sometimes tools or data analysis
platforms are not available
 Identify which tools or platforms are well adapted to solve the problem and which ones are
available or easy to get
 Request them / install them
 Work on the data infrastructure
Questions to ask
Eg.
 Can I solve this specific use case using a Python script in and IDE?
 Am I looking at big data in which case I might need a distributed system like Spark?
 Shall I store the data in a filesystem or on HDFS?
 The team is using R, but can I productionnize a script written in R?
 There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?
4. Which models to use? Which algorithms
are more suitable for the analysis and the
infrastructure?
 KNN is a weak learner
 Decisions trees work best to detect non-linear interactions (so should not be used for time-
series)
 Radom forests can work with large labelled or unlabelled data
 Ordinary Least Square should be used if high dimensional data set (nb variable > nb
observations)
 Stratified sampling is better than random sampling for classification problems
 Etc.
 Ask yourself the right questions before jumping ahead and using the fanciest model you
can think of. Business might not understand it.
5. Sometimes clients or business lines will not
understand your analysis, the methods used
Start small – use a data sample to build your case
Do a prototype (Proof of Concepts) and show them how they can leverage data analysis
Do not use the statistical jargons, use layman terms to communicate your idea
Sell your idea
Make it simple enough to understand, efficient enough to implement, interesting enough to use
Real world example
Signal analysis followed by a stock price behaviour prediction using a convolutional neural
network Data points to be investigated will labelled 1.
All other cases will be labelled 0.
Detecting ratios anomalies using tradition
statistical detection method and isolation
Forest (clustering for anomaly detection)
-
Process time very long especially when using
millions of rows – need to distribute the data
Isolation Forest exist in sklearn, but has not
yet been fully implemented in MLLib
+
Isolation Forest efficient when handling big
data
Very accurate detection compared to
traditional methods
6. How to build your data science flow
and what to avoid?
Your analysis code has to be understandable and reproducible (structured and testable)
If you are using a data analysis flow, your flow has to be structured
7. How to presents results in a way
business stake holders understand them?
Making complex concepts easy to understand by business lines
 Sometimes a graph is worth a thousand words
 Reports or dashboards have to be clear with ideally one insight per view (do not overload the
page)
 Show the results in a way they are easily interpretable
A real-world example of a real-time
failure prediction using Spark
System failure real-time predictions using:
 Sources systems metrics
 Kafka for data streaming
 Spark for the predictions
 HDFS to store data
 Javascript/Jquery or vendor product for the frontend
Source
systems
Kafka Stream
Spark
Streaming
Spark ML
Front End
HDFS
How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley
How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley
Offline training – Online testing using
Spark
ANNEXE
How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley
Data science tools magic quadrant January 2019

More Related Content

PPTX
Introduction to data science
PDF
Buzzword scheme
PDF
Data Scientist Toolbox
PDF
Five critical lessons you should learn from the IBM Watson misfire slideshare
PDF
CRISP-DM: a data science project methodology
PDF
Barga Data Science lecture 1
PPTX
What kind of analytical training should you lookout for?
PDF
Introduction to Data Science - Week 3 - Steps involved in Data Science
Introduction to data science
Buzzword scheme
Data Scientist Toolbox
Five critical lessons you should learn from the IBM Watson misfire slideshare
CRISP-DM: a data science project methodology
Barga Data Science lecture 1
What kind of analytical training should you lookout for?
Introduction to Data Science - Week 3 - Steps involved in Data Science

What's hot (20)

PPTX
Data Science Lifecycle
PPTX
Data science 101
PPTX
Introduction to Data Science by Datalent Team @Data Science Clinic #9
PPTX
Session 01 designing and scoping a data science project
PPTX
Data analytics
PDF
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
PPTX
Data science | What is Data science
PDF
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
PDF
Barga DIDC'14 Invited Talk
PDF
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
PDF
Barga Data Science lecture 2
PDF
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
PPTX
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
PPTX
Machine Learning in Healthcare: A Case Study
PDF
Crisp dm
PPTX
Applications of Data Science in Microsoft Cloud Products
PPSX
Data Science 101
PDF
Data Analytics and Big Data on IoT
PPTX
Tips and Tricks to be an Effective Data Scientist
PPTX
How to conduct research
Data Science Lifecycle
Data science 101
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Session 01 designing and scoping a data science project
Data analytics
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data science | What is Data science
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Barga DIDC'14 Invited Talk
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Barga Data Science lecture 2
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Scientist Salary, Skills, Jobs And Resume | Data Scientist Career | Data...
Machine Learning in Healthcare: A Case Study
Crisp dm
Applications of Data Science in Microsoft Cloud Products
Data Science 101
Data Analytics and Big Data on IoT
Tips and Tricks to be an Effective Data Scientist
How to conduct research
Ad

Similar to How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley (20)

PPTX
Behind the scenes of data science
PDF
Understanding-the-Data-Science-Lifecycle
PDF
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
PDF
Data Science Introduction and Process in Data Science
PDF
Applied_Data_Science_Presented_by_Yhat
PDF
How to Prepare for a Career in Data Science
PDF
How to start your journey as a data scientist
PPTX
Navigating-the-World-of-Data-Science.pptx
PDF
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
DOCX
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
PPTX
Why Data Science Projects Fail
PPTX
Why Data Science Projects Fail
PDF
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
PPTX
Data science meetup - Spiros Antonatos
PDF
How can a data scientist expert solve real world problems?
PPTX
Data Science Mastery Course in Pitampura
PPT
The data science process and fundamentals ppt
PDF
Data science guide
PDF
20151016 Data Science For Project Managers
PDF
Lean Analytics: How to get more out of your data science team
Behind the scenes of data science
Understanding-the-Data-Science-Lifecycle
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Data Science Introduction and Process in Data Science
Applied_Data_Science_Presented_by_Yhat
How to Prepare for a Career in Data Science
How to start your journey as a data scientist
Navigating-the-World-of-Data-Science.pptx
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
Why Data Science Projects Fail
Why Data Science Projects Fail
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Data science meetup - Spiros Antonatos
How can a data scientist expert solve real world problems?
Data Science Mastery Course in Pitampura
The data science process and fundamentals ppt
Data science guide
20151016 Data Science For Project Managers
Lean Analytics: How to get more out of your data science team
Ad

More from WiMLDSMontreal (11)

PPTX
The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
PPTX
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
PDF
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
PPTX
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
PPTX
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
PPTX
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
PDF
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
PPTX
Artistic Applications of AI, by Luba Elliott, AI Curator
PPTX
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
PPTX
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
PPTX
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
The Five Ws of Funding, by Sahar Ansary, Partner, R&D Partners
The Agile methodology - Delivering new ways of working, by Sandra Frechette, ...
Coveo Machine Learning for E-Commerce: At the Center of Business Challenges, ...
Diversity and Knowledge Production, by Jihane Lamouri, Diversity, Equity and ...
Diversity & Deep Tech Start-ups, by Eleonora Vella, Program Director & Princi...
Ubiquitous Machine Learning: Lessons from DeepRL in Robotics and Speech, by F...
Fashion-Gen: The Generative Fashion Dataset and Challenge by Negar Rostamzade...
Artistic Applications of AI, by Luba Elliott, AI Curator
What Scares Me About AI, by Rachel Thomas, Co-founder of fast.ai & Professor ...
Building Analytics and Data Science at A Start-Up, by Kathleen Siminyu, Head ...
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
Digital-Transformation-Roadmap-for-Companies.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing

How to build a data science project in a corporate setting, by Soraya Christina, Senior Data Scientist at Morgan Stanley

  • 1. Applied data science in the industry: How to build a data science project in a corporate setting BEST PRACTICES AND A REAL-WORLD EXAMPLE Soraya Yama Wednesday, June 26, 2019 WIMLDS Montreal #3: Business & AI
  • 2. How to guarantee the success of your data science project in industry? Challenges and solutions when building data science projects in industry or in a corporate environment  How to generate insights for better business decision making is what drives data science projects?  How to work with business side by side?  How to build a reliable and understandable analysis flow/solution/product?  How to properly communicate results and key elements?
  • 3. Data science in industry vs in research Industry Research Faster pace than academia – quick iterations Experiments are easier in a lab If analysis does not produce results quickly, drop it and/or redesign it Follow best practices to get approvals after peer reviews Simple solutions are preferred over novel complex ones – hard to understand, hard to trust Let’s go for the fancy cool new algorithms!!! Limited time and resources so need to balance research excellence with business needs Research is expected to take a lot of time Not everyone you work with understands data science  need to convince decision makers to use the insights to drive decisions Peers understands data science and the importance of research The team might not be data-driven or analytics- minded You will most likely have more than one analyst in the team Explain statistical concepts in layman terms Your peers are more likely to understand the statistical jargon you use You won’t do data science only – you might need to learn new skills (data engineering, new programming language, new packages etc) It is less likely that you do data engineering or architecture while being a data scientist Rejecting a hypothesis is equally interesting Rejecting a hypothesis can be looked at as a failure
  • 4. Focus on industry specific projects
  • 5. Challenges faced 1. Sometimes problems are not well defined 2. Sometimes data is not available or not in a usable format 3. Sometimes tools or data analysis platforms are not available 4. Which models to use? Which algorithms are more suitable for the analysis and the infrastructure? 5. Sometimes clients or business lines will not understand your analysis, the methods used 6. How to build your data science flow and what to avoid? 7. How to presents results in a way business stake holders understand them?
  • 6. 1. Sometimes problems are not well defined Data Science is a science therefore it follows the scientific method In a scientific method, the process starts with a question to be asked or a problem to be identified In data science, the process also starts with a problem to solve This requires a proper understanding of the business context Sometimes sitting with the business and help formalize the problem is key
  • 8. 2. Sometimes data is not available or not in a usable format Which data sources to use? ◦ data lake, data warehouse, database, row data to be imported like images, sound files, spreadsheets or flat files How to collect the data? ◦ import data, create a data pipeline Who to work with for the data acquisition? ◦ data engineers, database system managers etc. How to convince teams you need this data? ◦ explain the use case, have your manager support you How to maintain this new data acquisition? ◦ is it a one shot data acquisition, is it a recurrent feed? Where to store the data? ◦ big data storage, file system, cluster like Hadoop? If it’s a data stream, how to build it? ◦ Kafka, AWS, Flume etc.
  • 9. 3. Sometimes tools or data analysis platforms are not available  Identify which tools or platforms are well adapted to solve the problem and which ones are available or easy to get  Request them / install them  Work on the data infrastructure
  • 10. Questions to ask Eg.  Can I solve this specific use case using a Python script in and IDE?  Am I looking at big data in which case I might need a distributed system like Spark?  Shall I store the data in a filesystem or on HDFS?  The team is using R, but can I productionnize a script written in R?  There is a vendor product I am asked to use, but is it convenient for the purpose of the use case?
  • 11. 4. Which models to use? Which algorithms are more suitable for the analysis and the infrastructure?  KNN is a weak learner  Decisions trees work best to detect non-linear interactions (so should not be used for time- series)  Radom forests can work with large labelled or unlabelled data  Ordinary Least Square should be used if high dimensional data set (nb variable > nb observations)  Stratified sampling is better than random sampling for classification problems  Etc.  Ask yourself the right questions before jumping ahead and using the fanciest model you can think of. Business might not understand it.
  • 12. 5. Sometimes clients or business lines will not understand your analysis, the methods used Start small – use a data sample to build your case Do a prototype (Proof of Concepts) and show them how they can leverage data analysis Do not use the statistical jargons, use layman terms to communicate your idea Sell your idea Make it simple enough to understand, efficient enough to implement, interesting enough to use
  • 13. Real world example Signal analysis followed by a stock price behaviour prediction using a convolutional neural network Data points to be investigated will labelled 1. All other cases will be labelled 0.
  • 14. Detecting ratios anomalies using tradition statistical detection method and isolation Forest (clustering for anomaly detection) - Process time very long especially when using millions of rows – need to distribute the data Isolation Forest exist in sklearn, but has not yet been fully implemented in MLLib + Isolation Forest efficient when handling big data Very accurate detection compared to traditional methods
  • 15. 6. How to build your data science flow and what to avoid? Your analysis code has to be understandable and reproducible (structured and testable) If you are using a data analysis flow, your flow has to be structured
  • 16. 7. How to presents results in a way business stake holders understand them? Making complex concepts easy to understand by business lines  Sometimes a graph is worth a thousand words  Reports or dashboards have to be clear with ideally one insight per view (do not overload the page)  Show the results in a way they are easily interpretable
  • 17. A real-world example of a real-time failure prediction using Spark System failure real-time predictions using:  Sources systems metrics  Kafka for data streaming  Spark for the predictions  HDFS to store data  Javascript/Jquery or vendor product for the frontend Source systems Kafka Stream Spark Streaming Spark ML Front End HDFS
  • 20. Offline training – Online testing using Spark
  • 23. Data science tools magic quadrant January 2019