SlideShare a Scribd company logo
AC&AI EMEA Masterclass
Data Science 101
Wednesday 2nd December 2020
Ben Keen
Shahzia Holtom
Introductions
aka.ms/benkeen
AGENDA What is a Data Scientist?
Data
AI Ethics & Responsibility
MLOps
What is a Data Scientist?
Data Impact
What makes a Data Scientist?
Scientist
Scientist
Strong understanding of scientific method &
hypothesis testing
Asks clarifying questions and remains
sceptical and objective
Strong critical thinking, root cause analysis,
and research skills
Bases decisions on data and statistical
analysis
What makes a Data Scientist?
Scientist
Engineer
Statistics
Machine Learning Data Storage
Visualisation
Optimisation
Data ProcessingData Manipulation
Programming
Data Lakes
Azure Storage
SQL Server
MySQL
PostgreSQL
Oracle DB
Azure Data Warehouse
HDFS MongoDB
Neo4jAzure Cosmos DB
Cassandra
Word2Vec
SQLite
Spark/Databricks
Azure Data Factory
Airflow
Kubernetes
Azure Event Hub
Azure Service Bus Kafka
Hadoop
Logstash/Elasticsearch
NiFi
Docker
Swarm
Python
statsmodels
scipy
Scikit-learn
PyTorch
spark.ml
SAS
R
ggplot2
TensorflowKeras
Scala
Perl
MATLAB
Node.js
M
VBA
JavaScript
Julia
Jupyter
Weka
Azure Machine Learning
MLFlow
SPSS
Bayesian Statistics
ONNX
XGBoost
Continuous Distributions
PMCC/Spearman’s Rank
Monte Carlo Methods
χ2
Probability Theory
Skewness/Curtosis
Hypothesis Testing
Covariance
matplotlib
Power BID3.js
Highcharts
plotly
sankeymatic
Tableau
seaborn
Bokeh
React-vis
Dash
CanvasJS
Chart.js
Excel
ISOMAP
PIL
ScraPy / BS4
LibROSA
Flink
lifetimes
Bonsai
dplyr
NumPy
pandas
Powershell
Bash
NLTK
spaCy
OpenCV
Gensim
Azure Cognitive Services
pytz
Dijkstra
Gradient Descent
Ant Colony Optimisation
Particle Swarm Optimisation
Evolutionary Algorithms
Mixed-integer linear programming
Differential Calculus
Simulated Annealing
Least Squares
DAX
Tools for the job
Artificial
Intelligence
Machine
Learning
Deep
Learning
Artificial Intelligence
The ability for machines to mimic human behaviours.
See “Computing Machinery and Intelligence”, Turing, 1950.
Machine Learning
The application of mathematical and statistical techniques
that learn parameters from data rather than being
explicitly programmed.
Deep Learning
Subset of machine learning in which neural networks with
many layers are used to learn highly non-linear
relationships from large amounts of training data.
What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
A Simple Example
38 2
4 556
A Simple Example
Business Context: Machinery failure costs
£500,000 but maintenance costs £1,000
Total Cost: £1,004,000
38 2
4 556
Yes
No
A Simple Example
Expensive
Yes
No
Yes
No
A Simple Example
Yes
No
Yes
No
Yes
No
A Simple Example
Yes
No
New Total Cost: £60,000
New Accuracy = 90%
40 0
60 490
Another Simple Example
What makes a Data Scientist?
Scientist
EngineerBusiness Analyst
Types of Data Scientist
ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert
• Operationalisation
of models
• Focus on MLOps,
Automated Tests,
CI/CD, ETL
• Focus on A/B
Testing, Modelling
and
Experimentation
• View to
contributing to a
product
• Uses Tried &
Tested Techniques
• Experimentation
with view to
expand
community
knowledge and
understanding of
algorithms
• Uses novel
techniques
• Generalist
• Works across
modelling, ETL,
operationalisation
and app
development
• May be less
focused on depth
of modelling
understanding
• Focus on
storytelling with
data
• Wizard with
graphing libraries,
including D3.js
Data
Data
What data do you need?
Data
How much data do you need?
Data
How much data do you need?
How do we know this is a cat?
We have 140 million neurons in V1
And we have V2, V3, V4, V5 and V6
Data
How much data do you need?
88 239 33 178 38 122
208 115 215 36 119 203
229 65 52 64 4 23
92 114 26 29 155 183
101 142 222 54 187 109
45 6 95 67 35 212
93 103 142 57 207 117
174 228 201 24 101 176
100 9 141 241 144 37
8 34 198 125 138 246
178 126 255 108 161 128
How do you get a computer to recognise this
as a cat
Data
How much data do you need?
? ?
?
?
?
Data
How much data do you need?
Garbage in…
…Garbage out
Data
How much data do you need?
HorsesGoats
?
Data
How much data do you need?
? ?
Data
How much data do you need?
? ?
Bias – AI ethics and responsibility
Value realization is only possible
through Continuous Delivery
MLOps
Data Science solutions need to be
integrated with People, Process
and Products
Pilot
PoC
Experiment
PoV
MVP
I have a model
for you…
How do I
deploy, manage,
monitor…Wall
Of
Confusion
Data Science Ops
Data Drift
Model Decay
Stale Models
Concept Drift
Traditional DS Delivery
DevOps is the union of people,
process, and products to enable
continuous delivery of value.
“
”
Build
&
Test
Continuous
Delivery
Deploy
Operate
Monitor
&
Learn
Plan
&
Track
Develop
People
• Collaborate early and often
• Cross-disciplinary teams
• Share common goals and metrics
• Shared responsibility Process
• Agile Principles
• Streamline feedback
• Delivering value faster Products
What is DevOps?
The ability to continuously integrate, automatically
test, build, deploy and monitor Machine Learning
artifacts such as Data & Training pipelines and
models.
MLOps
Data Science is a Team Effort
Architects
Change Management
Data Engineers
Data Scientists
Project Management
App Developers
UX Designers
Conclusions
 Data Scientists are Scientists, Engineers and Business Analysts
 We work best with data of high volume, veracity and variety
 We need to keep in mind ethical considerations and act responsibly
when designing systems
 MLOps is paramount for delivering customer value
 Data Science is a team effort
 Data Science is about turning data into impact
Q&A
Thank you

More Related Content

PDF
Introduction to Machine Learning with Azure & Databricks
 
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PPTX
Machine Learning with Azure and Databricks Virtual Workshop
 
PDF
Software Engineering for Data Scientists
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PPTX
Building Data Analytics pipelines in the cloud using serverless technology
PDF
The Proliferation of New Database Technologies and Implications for Data Scie...
PDF
Analytics in a Day Virtual Workshop
 
Introduction to Machine Learning with Azure & Databricks
 
Analytics in a Day Ft. Synapse Virtual Workshop
 
Machine Learning with Azure and Databricks Virtual Workshop
 
Software Engineering for Data Scientists
Analytics in a Day Ft. Synapse Virtual Workshop
 
Building Data Analytics pipelines in the cloud using serverless technology
The Proliferation of New Database Technologies and Implications for Data Scie...
Analytics in a Day Virtual Workshop
 

What's hot (20)

PPTX
Afternoons with Azure - Power BI and Azure Analysis Services
 
PPTX
Data Ops at TripActions
PPTX
Creating an Enterprise AI Strategy
PPTX
Advanced Analytics for Investment Firms and Machine Learning
PPTX
Afternoons with Azure - Azure Machine Learning
 
PPTX
Simplifying AI and Machine Learning with Watson Studio
PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
PDF
Analytics in a Day Ft. Synapse Virtual Workshop
 
PDF
Belladati Meetup Singapore Workshop
PPTX
Data Science in Enterprise
PPTX
Overview Microsoft's ML & AI tools
PPTX
Webinar: Question Answering and Virtual Assistants with Deep Learning
PDF
Data estate modernization feb webinar 2 18 2020
PPTX
Azure databricks by usama whaba khan
PDF
Citizen Data Science Training using KNIME
PDF
Software Analytics for Pragmatists [DevOps Camp 2017]
PDF
Global Data Science Platform : Platform for AI Democratization
PPTX
Advanced Analytics and Data Science Expertise
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
Real World End to End machine Learning Pipeline
Afternoons with Azure - Power BI and Azure Analysis Services
 
Data Ops at TripActions
Creating an Enterprise AI Strategy
Advanced Analytics for Investment Firms and Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Simplifying AI and Machine Learning with Watson Studio
Domino and AWS: collaborative analytics and model governance at financial ser...
Analytics in a Day Ft. Synapse Virtual Workshop
 
Belladati Meetup Singapore Workshop
Data Science in Enterprise
Overview Microsoft's ML & AI tools
Webinar: Question Answering and Virtual Assistants with Deep Learning
Data estate modernization feb webinar 2 18 2020
Azure databricks by usama whaba khan
Citizen Data Science Training using KNIME
Software Analytics for Pragmatists [DevOps Camp 2017]
Global Data Science Platform : Platform for AI Democratization
Advanced Analytics and Data Science Expertise
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Real World End to End machine Learning Pipeline
Ad

Similar to Data science 101 Masterclass (20)

PPTX
intro to data science Clustering and visualization of data science subfields ...
PDF
iTrain Malaysia: Data Science by Tarun Sukhani
PDF
How to become a data scientist
PDF
Decoding Data Science
PDF
Building successful data science teams
PDF
Building the Data Science Profession in Europe
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PPTX
Career_Jobs_in_Data_Science.pptx
PPTX
Big Data and the Art of Data Science
PDF
Lean Analytics: How to get more out of your data science team
PPT
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
PDF
Introduction-to-Data-Science.pdf
PDF
Introduction-to-Data-Science.pdf
PDF
Professional Cert in Data Science Course
PPTX
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
PDF
A Beginner’s Guide to An Incredible Technology Data Science.pdf
PDF
a-beginner-guide-to-an-incredible-technology-data-science.pdf
PPTX
Navigating-the-Data-Science-Ecosystem.pptx
PPTX
Lesson 3 ai in the enterprise
PPTX
Introduction to Big Data and Data Science
intro to data science Clustering and visualization of data science subfields ...
iTrain Malaysia: Data Science by Tarun Sukhani
How to become a data scientist
Decoding Data Science
Building successful data science teams
Building the Data Science Profession in Europe
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Career_Jobs_in_Data_Science.pptx
Big Data and the Art of Data Science
Lean Analytics: How to get more out of your data science team
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Introduction-to-Data-Science.pdf
Introduction-to-Data-Science.pdf
Professional Cert in Data Science Course
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
A Beginner’s Guide to An Incredible Technology Data Science.pdf
a-beginner-guide-to-an-incredible-technology-data-science.pdf
Navigating-the-Data-Science-Ecosystem.pptx
Lesson 3 ai in the enterprise
Introduction to Big Data and Data Science
Ad

Recently uploaded (20)

PPTX
normal_menstrual_cycle_,,physiology.PPTX
DOCX
mcsp232projectguidelinesjan2023 (1).docx
PPTX
Job-opportunities lecture about it skills
PDF
Why Today’s Brands Need ORM & SEO Specialists More Than Ever.pdf
PDF
Sales and Distribution Managemnjnfijient.pdf
PPTX
1751884730-Visual Basic -Unitj CS B.pptx
PPTX
OnePlus 13R – ⚡ All-Rounder King Performance: Snapdragon 8 Gen 3 – same as iQ...
PDF
MCQ Practice CBT OL Official Language 1.pptx.pdf
PPTX
Sports and Dance -lesson 3 powerpoint presentation
PPTX
microtomy kkk. presenting to cryst in gl
PDF
HR Jobs in Jaipur: 2025 Trends, Banking Careers & Smart Hiring Tools
DOCX
How to Become a Criminal Profiler or Behavioural Analyst.docx
PPT
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
PDF
Daisia Frank: Strategy-Driven Real Estate with Heart.pdf
PDF
シュアーイノベーション採用ピッチ資料|Company Introduction & Recruiting Deck
PDF
Blue-Modern-Elegant-Presentation (1).pdf
PPTX
PMP (Project Management Professional) course prepares individuals
PDF
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
PPTX
Autonomic_Nervous_SystemM_Drugs_PPT.pptx
PPTX
Principles of Inheritance and variation class 12.pptx
normal_menstrual_cycle_,,physiology.PPTX
mcsp232projectguidelinesjan2023 (1).docx
Job-opportunities lecture about it skills
Why Today’s Brands Need ORM & SEO Specialists More Than Ever.pdf
Sales and Distribution Managemnjnfijient.pdf
1751884730-Visual Basic -Unitj CS B.pptx
OnePlus 13R – ⚡ All-Rounder King Performance: Snapdragon 8 Gen 3 – same as iQ...
MCQ Practice CBT OL Official Language 1.pptx.pdf
Sports and Dance -lesson 3 powerpoint presentation
microtomy kkk. presenting to cryst in gl
HR Jobs in Jaipur: 2025 Trends, Banking Careers & Smart Hiring Tools
How to Become a Criminal Profiler or Behavioural Analyst.docx
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
Daisia Frank: Strategy-Driven Real Estate with Heart.pdf
シュアーイノベーション採用ピッチ資料|Company Introduction & Recruiting Deck
Blue-Modern-Elegant-Presentation (1).pdf
PMP (Project Management Professional) course prepares individuals
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
Autonomic_Nervous_SystemM_Drugs_PPT.pptx
Principles of Inheritance and variation class 12.pptx

Data science 101 Masterclass

  • 1. AC&AI EMEA Masterclass Data Science 101 Wednesday 2nd December 2020 Ben Keen Shahzia Holtom
  • 3. AGENDA What is a Data Scientist? Data AI Ethics & Responsibility MLOps
  • 4. What is a Data Scientist? Data Impact
  • 5. What makes a Data Scientist? Scientist
  • 6. Scientist Strong understanding of scientific method & hypothesis testing Asks clarifying questions and remains sceptical and objective Strong critical thinking, root cause analysis, and research skills Bases decisions on data and statistical analysis
  • 7. What makes a Data Scientist? Scientist Engineer
  • 8. Statistics Machine Learning Data Storage Visualisation Optimisation Data ProcessingData Manipulation Programming Data Lakes Azure Storage SQL Server MySQL PostgreSQL Oracle DB Azure Data Warehouse HDFS MongoDB Neo4jAzure Cosmos DB Cassandra Word2Vec SQLite Spark/Databricks Azure Data Factory Airflow Kubernetes Azure Event Hub Azure Service Bus Kafka Hadoop Logstash/Elasticsearch NiFi Docker Swarm Python statsmodels scipy Scikit-learn PyTorch spark.ml SAS R ggplot2 TensorflowKeras Scala Perl MATLAB Node.js M VBA JavaScript Julia Jupyter Weka Azure Machine Learning MLFlow SPSS Bayesian Statistics ONNX XGBoost Continuous Distributions PMCC/Spearman’s Rank Monte Carlo Methods χ2 Probability Theory Skewness/Curtosis Hypothesis Testing Covariance matplotlib Power BID3.js Highcharts plotly sankeymatic Tableau seaborn Bokeh React-vis Dash CanvasJS Chart.js Excel ISOMAP PIL ScraPy / BS4 LibROSA Flink lifetimes Bonsai dplyr NumPy pandas Powershell Bash NLTK spaCy OpenCV Gensim Azure Cognitive Services pytz Dijkstra Gradient Descent Ant Colony Optimisation Particle Swarm Optimisation Evolutionary Algorithms Mixed-integer linear programming Differential Calculus Simulated Annealing Least Squares DAX
  • 9. Tools for the job Artificial Intelligence Machine Learning Deep Learning Artificial Intelligence The ability for machines to mimic human behaviours. See “Computing Machinery and Intelligence”, Turing, 1950. Machine Learning The application of mathematical and statistical techniques that learn parameters from data rather than being explicitly programmed. Deep Learning Subset of machine learning in which neural networks with many layers are used to learn highly non-linear relationships from large amounts of training data.
  • 10. What makes a Data Scientist? Scientist EngineerBusiness Analyst
  • 12. A Simple Example Business Context: Machinery failure costs £500,000 but maintenance costs £1,000 Total Cost: £1,004,000 38 2 4 556
  • 15. Yes No Yes No A Simple Example Yes No New Total Cost: £60,000 New Accuracy = 90% 40 0 60 490
  • 17. What makes a Data Scientist? Scientist EngineerBusiness Analyst
  • 18. Types of Data Scientist ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert • Operationalisation of models • Focus on MLOps, Automated Tests, CI/CD, ETL • Focus on A/B Testing, Modelling and Experimentation • View to contributing to a product • Uses Tried & Tested Techniques • Experimentation with view to expand community knowledge and understanding of algorithms • Uses novel techniques • Generalist • Works across modelling, ETL, operationalisation and app development • May be less focused on depth of modelling understanding • Focus on storytelling with data • Wizard with graphing libraries, including D3.js
  • 19. Data
  • 20. Data What data do you need?
  • 21. Data How much data do you need?
  • 22. Data How much data do you need? How do we know this is a cat? We have 140 million neurons in V1 And we have V2, V3, V4, V5 and V6
  • 23. Data How much data do you need? 88 239 33 178 38 122 208 115 215 36 119 203 229 65 52 64 4 23 92 114 26 29 155 183 101 142 222 54 187 109 45 6 95 67 35 212 93 103 142 57 207 117 174 228 201 24 101 176 100 9 141 241 144 37 8 34 198 125 138 246 178 126 255 108 161 128 How do you get a computer to recognise this as a cat
  • 24. Data How much data do you need? ? ? ? ? ?
  • 25. Data How much data do you need? Garbage in… …Garbage out
  • 26. Data How much data do you need? HorsesGoats ?
  • 27. Data How much data do you need? ? ?
  • 28. Data How much data do you need? ? ?
  • 29. Bias – AI ethics and responsibility
  • 30. Value realization is only possible through Continuous Delivery MLOps Data Science solutions need to be integrated with People, Process and Products Pilot PoC Experiment PoV MVP
  • 31. I have a model for you… How do I deploy, manage, monitor…Wall Of Confusion Data Science Ops Data Drift Model Decay Stale Models Concept Drift Traditional DS Delivery
  • 32. DevOps is the union of people, process, and products to enable continuous delivery of value. “ ” Build & Test Continuous Delivery Deploy Operate Monitor & Learn Plan & Track Develop People • Collaborate early and often • Cross-disciplinary teams • Share common goals and metrics • Shared responsibility Process • Agile Principles • Streamline feedback • Delivering value faster Products What is DevOps?
  • 33. The ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. MLOps
  • 34. Data Science is a Team Effort Architects Change Management Data Engineers Data Scientists Project Management App Developers UX Designers
  • 35. Conclusions  Data Scientists are Scientists, Engineers and Business Analysts  We work best with data of high volume, veracity and variety  We need to keep in mind ethical considerations and act responsibly when designing systems  MLOps is paramount for delivering customer value  Data Science is a team effort  Data Science is about turning data into impact
  • 36. Q&A

Editor's Notes

  • #3: aka.ms/benkeen is a short URL to my LinkedIn profile
  • #4: So today we’re going to cover a range of topics in our Data Science 101 Masterclass. We’ll start with what a data scientist is, what the job entails and what should be expected of a data scientist. Then we’ll talk about data – The Economist described Data as “The New Oil” in 2017 and that’s up to some debate, I’m not so sure I agree, but here we’ll be taking a look at a couple of questions about data that data scientists face most commonly. We’ll touch a bit on AI Ethics and Responsibility which, of course, could be an entire masterclass in itself. Finally we’ll cover an important emerging topic in data science – MLOps for operationalising data science
  • #5: Data science is not about making awesome visualisations, complicated models, or writing lots of code. Data scientists turn data into impact The job is to solve real problems using data
  • #6: First and foremost, data scientists are scientists – It’s right there in the name. What does this mean? [See next slide]
  • #7: I don’t come from a mathematics or computer science background myself - my PhD is in molecular genetics – but the skills required of a scientist are the same, whether I’m creating a predictive model as a data scientist now or I’m doing X-ray crystallography on proteins as a biochemist in my old life. [Read Slide Text]
  • #8: Data Scientists are also engineers – they design systems to fulfil functional objectives. In the next couple of slides, we’ll explore the tools data scientists use to do this.
  • #9: These are the tools for the job – just as with any other type of engineer, good data scientists will know when to use which tools for which tasks and not shoehorn things in just because they like them. These are the tools data scientists use to create the systems that fulfil those functional objectives. As a side note, this is not an exhaustive list. No single data scientist knows all of this in-depth and we’ll come back to that in a little bit (See types of data scientist). In the next slide I’m going to focus a little more down into this machine learning section as it’s probably the one that’s most associated with data science.
  • #10: Artificial intelligence is the ability for machines to mimic human behaviours – including things like recognition, behaviour, reasoning etc. Machine learning is a subset of artificial intelligence. Machine Learning is then application of maths and stats techniques that learn parameters from data. Artificial intelligence is not just machine learning – there are rules-based systems in which knowledge is explicitly coded from rules understood by experts within AI too and these can be powerful such as in the cases of classical computer vision and NLP syntax and semantics and can often be combined with machine learning techniques. Deep learning is a subset of machine learning, in which we use neural nets with many hidden layers to learn highly non-linear relationships. Deep learning is often used for things like computer vision, natural language processing, speech/sound recognition, bioinformatics (genetic data), time series. Just like Machine Learning not being all of AI, Deep learning is not all Machine Learning - For work where data is tabular and there are fewer samples, deep learning is often not the best modelling technique and you’ll find a lot of Kaggle competition winners will use other techniques such as gradient boosted decision trees. And sometimes, where relationships are less complex, a simple linear or logistic regression will do just as well. Again – to re-iterate – this is just a tool in the data science job
  • #11: Finally a data scientist is a business analyst. Over the next few slides we’ll see why this is so important to the role of a data scientist.
  • #12: Let’s take a simple example – We’ve been brought in to a predictive maintenance engagement and have been given data to go and train a model with no real context. We go away and train a classifier model that’s 99% accurate – happy days, it all looks good, we deploy it but the customer is not happy.
  • #13: Given some business context – we find out that machinery failure costs £500,000 but maintenance costs £1,000. We’ve predicted 4 cases in which maintenance of a machine was required that wasn’t – that’s a cost of £4,000. However, we predicted 2 cases in which maintenance of a machine wasn’t required, that’s a cost of £1,000,000. So now let’s take a look at how a data scientist could have dealt with this given this business context.
  • #14: Our prediction is yes the machine needs maintenance in the upper light blue rectangle and no it doesn’t need maintenance in the lower light red rectangle. The shape of the markers indicates whether the machine actually needs maintenance or not, blue circles indicate that actually the machine does need maintenance and red crosses mean it doesn’t. (Transition 1) Predicting “No” but actually needing maintenance is *very* expensive.
  • #15: We have 2 options Option 1 – We can change our model or parameters and re-train. In this example the sigmoid shown is shifted to the left, this has moved a number of middling values up into the “yes” rectangle Option 2 – We can change our threshold, or “decision boundary” – before we had our decision boundary at ~50%, shifting it down also moves those middling values into the “yes” rectangle Doing either of these has the effect of removing our false negative but introducing more false positives
  • #16: Making sure we don’t miss required maintenance here has reduced the cost by nearly £1,000,000. This is a simple and extreme example but the point stands that data scientists need to tie technical decision making with business value understanding.
  • #17: The previous example showed a classification – but how about a regression, where we’re predicting continuous variables and want to look at predictive maintenance from a perspective of remaining useful life. Let’s look at a simple linear regression. Say we want to look at the performance of machines over time and we are just given time and performance metrics. We come to the conclusion that the machines’ performance improves over time. Again our customer is not so happy. (Transition 1) Now we do some business analysis, and find out that actually there are different groups of machines represented from within this data and our original conclusion was wrong. Taken together, this shows that understanding the business context in which the data resides, is incredibly important to doing data science and without it we can cause more harm than good. It’s so important in data science projects to have data scientists engaged early on to ensure things like this don’t happen and we lose trust.
  • #18: So a data scientist is not just some subset of these roles, a data scientist encompasses all 3 roles.
  • #19: All of these people are data scientists and all of them will have significant overlap in skills with the others.
  • #20: Now that we know who are data scientist is, what their tools are and what their objective is – let’s talk about data No “Data Science 101” talk would be complete without a discussion about data. These are possibly the 2 questions we get asked most: - What data do you need? - How much data do you need. As consultants, our diplomatic answer is always – it depends.
  • #21: This question – what data do you need? - is highly use case dependent and requires BA Workshops. If you cast your mind back to a few slides ago, where we showed the regression was positive without information on the machine groups, and then it became negative. Without business analysis we wouldn’t know that we needed that group data. This is why it’s so important to have data scientists on board so early on in the engagement, so we can cover off some of these requirements. So keeping in mind that the machine learning aim is to mimic human behaviour, if it’s tough for an SME to identify or predict, it is most likely hard for a ML to be trained. There are, of course, as with everything, exceptions to this rule. Exceptions to the rule – is this consulting or is this a research task that we should be taking on?
  • #22: How much data do you need? Commonly people will just think about volume, they want a number. I can get you 100 data points and that will all be fine. However, we need to get people out of that thinking – we need to combine volume, with veracity (reliability) and variety. The data sample we’re given needs to be representative of the population and only with all 3 will we get this.
  • #23: Let’s take a computer vision problem - The computer vision examples almost always have cats. How do we know this is a cat? We have 140 million interconnected neurons in our primary visual cortex With 10s of billions of connections between them That’s just V1, we have v2, v3, v4 and v5 for fine tuning We are very good at learning pattern recognition as a result
  • #24: This is an easy task for us but incredibly difficult for a computer Computers just see pixel values We can’t just write a program to recognise cats – Too many options to encode and too many edge cases So we use machine learning to recognize patterns. I won’t go into the details of convolutional neural networks but as you can imagine, the transformation to go from these numbers here to the label of a cat is highly non-linear and complex. There’s no y = mx + c linear regression here. This kind of pattern recognition is going to need many, many examples to determine how to get from picture of cat to a label of cat. The InceptionV3 image recognition model from Google has just shy of 24 million parameters it needs to get right
  • #25: Image data might have more complex transformation but if we have tabular data, we still need enough data in order to reason about distributions of data. Let’s say we’re trying to classify our blue circles and red crosses. And we have a new sample – green question mark. (Transition 1) This data alone could be a sample of any number of distributions. Any of these decision boundaries might be valid – we need more volume to get a better idea of distribution (Transition 2) When we have more data, all of these distributions include those first 6 points but we can see how wildly different they are, our top example would classify this as a red cross but the next 2 would be blue circles
  • #26: If you feed a model with incorrect or unreliable data, the results you get will be unreliable or incorrect. Depending on the accuracy you need, for each incorrect data point you feed in, you might need 5, 10, 50, 100… similar correct examples to drown the effects of it out.
  • #27: Let’s train a model on these pictures of goats and these pictures of horses (I didn’t use cats!) (Transition 1) Now, given this picture of a horse, what do we think our model would predict? It’s never seen a side profile of a horse – it’s got 4 legs, it’s and a relatively rectangular body – it’s a goat. We could have hundreds of thousands of images like this, but we will still predict this is a goat. You may also have heard about the model that determined dogs from wolves not from the animal itself but the background, if there was snow – it was a wolf. Sample training data needs to be representative of the potential scoring population. If you’re doing crack detection for example, it’s better to have a thousand different images of cracks than a thousand images of the same crack. There is another conversation on how we go about getting this labelled data but it’s perhaps a topic for another day.
  • #28: This isn’t just true of image data – let’s take a look at some more tabular data. We have two classes of data – represented by blue circles and red crosses. We have a new sample represented by a green question mark and we want to know how to classify this. (Transition 1) If we have a lot of data but all our data is from the same two clusters of data – we may get a separation that looks something like this – this is something an SVC might look do to classify these. But notice that the green question mark is not near either of these clusters – we’ve trained a model the best we can based on the knowledge we have of this training data but this training data is clearly not representative of the population.
  • #29: Now we have sparser data, there’s much fewer points, they’re actually just a different sample from the same population but this data is a better representation of the population. (Transition 1) Now we’re a little more certain about where this green question mark should be placed. Although previously it would have been in the blue area, now it’s on the red side of our decision boundary. So variety of data is just as, if not more, important as volume of data.
  • #30: As part of this discussion around the types of data required of data scientists and why our samples need to be representative of the population, we also need to talk about AI ethics and responsibility, which will ultimately fall on the data scientists that design this system. Bias has different meanings in ML and stats – but here I’m talking about bias in which a model is skewed based on the data it is fed relative to accepted legal or moral principles. If you train a model to determine best candidates based on your historical successful hires, but all you’ve hired in the past is men, that model is going to be skewed to hiring men. This is an example where your training sample isn’t necessarily representative of the population. Similarly if you train a model mostly on data you’ve collected from white men, it’s going to perform better on white men. Again, you need a representative sample of the population. The Facebook example here is similar but also highlights another key issue. Even if you remove the actual labels that indicate protected classes like race, sex, religion. We need to consider proxies that might indicate these classes – hormone levels in medical records, certain words in CVs, sports activities or post codes. There are a number of techniques for reducing this kind of bias in models – including pre-processing, in-processing and post-processing algorithms, and class balancing algorithms like SMOTE can help augment under-represented classes but ultimately I think data scientists can best tackle this kind of bias will be to have a good domain understanding and a good understanding of the data that they are using to try and ensure they are not discriminating against any class.
  • #31: The final thing I want to discuss is MLOps as it’s one of the most important aspects of a data scientist’s role. Whilst experimentation, proof of concepts and proof of values are important – value realization for our customers is only possible through continuous delivery. For successful value realisation, data science solutions need to be integrated with people, process and products.
  • #32: Historically there has been a disconnect between data scientists and other developers, in which a model is made, perhaps through data science experimentation and then thrown over a wall to developers to deploy. Model requirements in this manner are often poorly understood by developers and changes can be difficult, resulting in model drift.
  • #33: Modern data science delivery integrates machine learning and DevOps, using tools designed for continuous integration and continuous delivery. Although the goal is to enable each of the technical delivery tasks shown in the top right here, there is a focus on people, process and products in order to make this a success. (Transition 1) People: Project managers, architects, data engineers, data scientists, developers, and testers should all be involved in a use case from an early stage. There is not a specific KPI for data scientists such as an RMSE value, and a different goal for developers such as API response times – the whole team shares common goals. (Transition 2) Process: We follow agile methodology principles of short sprints of 2 or 3 weeks, tracking the story points and burndown of a sprint to use in planning for the next sprints as well as retrospectives to feedback to each other what’s going well, what’s not going well and what actions we can take to maintain velocity. (Transition 3) Products: We should use products that enhance our productivity, not products shoehorned in because they are an individual’s favourite tool. Knowledge sharing through wikis and Teams is encouraged across the team so that we can make sure everyone can contribute to a range of tasks.
  • #34: MLOps is the integration of machine learning into DevOps processes and, as we see on screen, it is the ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. The aim of data science modelling is not necessarily to be right today but to be less wrong each day through an iterative feedback cycle. Monitoring models and, for those that have done scrum principles or operations management training, the principles of Kaizen (a Japanese term that means “continuous improvement”) are therefore of paramount importance.
  • #35: Data science isn’t something that happens in a silo. This is a team effort among a team that share common goals. Not all projects are going to need all of these resources but all projects require the concerted effort of a team to make them a success. We want to work with others in order to make sure our engagements are successful.