Data science 101 Masterclass

Download as PPTX, PDF

1 like122 views

The document outlines a masterclass on Data Science, covering the roles, skills, and responsibilities of data scientists as well as the importance of MLOps and AI ethics. It emphasizes that data scientists must possess a strong combination of scientific, engineering, and analytical skills while also integrating data science solutions within organizational processes. The session underscores the collaborative nature of data science, advocating for continuous delivery and responsible deployment of machine learning models.

Career

Data science 101 Masterclass

1. AC&AI EMEA Masterclass Data Science 101 Wednesday 2nd December 2020 Ben Keen Shahzia Holtom

2. Introductions aka.ms/benkeen

3. AGENDA What is a Data Scientist? Data AI Ethics & Responsibility MLOps

4. What is a Data Scientist? Data Impact

5. What makes a Data Scientist? Scientist

6. Scientist Strong understanding of scientific method & hypothesis testing Asks clarifying questions and remains sceptical and objective Strong critical thinking, root cause analysis, and research skills Bases decisions on data and statistical analysis

7. What makes a Data Scientist? Scientist Engineer

8. Statistics Machine Learning Data Storage Visualisation Optimisation Data ProcessingData Manipulation Programming Data Lakes Azure Storage SQL Server MySQL PostgreSQL Oracle DB Azure Data Warehouse HDFS MongoDB Neo4jAzure Cosmos DB Cassandra Word2Vec SQLite Spark/Databricks Azure Data Factory Airflow Kubernetes Azure Event Hub Azure Service Bus Kafka Hadoop Logstash/Elasticsearch NiFi Docker Swarm Python statsmodels scipy Scikit-learn PyTorch spark.ml SAS R ggplot2 TensorflowKeras Scala Perl MATLAB Node.js M VBA JavaScript Julia Jupyter Weka Azure Machine Learning MLFlow SPSS Bayesian Statistics ONNX XGBoost Continuous Distributions PMCC/Spearman’s Rank Monte Carlo Methods χ2 Probability Theory Skewness/Curtosis Hypothesis Testing Covariance matplotlib Power BID3.js Highcharts plotly sankeymatic Tableau seaborn Bokeh React-vis Dash CanvasJS Chart.js Excel ISOMAP PIL ScraPy / BS4 LibROSA Flink lifetimes Bonsai dplyr NumPy pandas Powershell Bash NLTK spaCy OpenCV Gensim Azure Cognitive Services pytz Dijkstra Gradient Descent Ant Colony Optimisation Particle Swarm Optimisation Evolutionary Algorithms Mixed-integer linear programming Differential Calculus Simulated Annealing Least Squares DAX

9. Tools for the job Artificial Intelligence Machine Learning Deep Learning Artificial Intelligence The ability for machines to mimic human behaviours. See “Computing Machinery and Intelligence”, Turing, 1950. Machine Learning The application of mathematical and statistical techniques that learn parameters from data rather than being explicitly programmed. Deep Learning Subset of machine learning in which neural networks with many layers are used to learn highly non-linear relationships from large amounts of training data.

10. What makes a Data Scientist? Scientist EngineerBusiness Analyst

11. A Simple Example 38 2 4 556

12. A Simple Example Business Context: Machinery failure costs £500,000 but maintenance costs £1,000 Total Cost: £1,004,000 38 2 4 556

13. Yes No A Simple Example Expensive

14. Yes No Yes No A Simple Example Yes No

15. Yes No Yes No A Simple Example Yes No New Total Cost: £60,000 New Accuracy = 90% 40 0 60 490

16. Another Simple Example

17. What makes a Data Scientist? Scientist EngineerBusiness Analyst

18. Types of Data Scientist ML Engineer Applied DS Research DS Full Stack DS Data Vis. Expert • Operationalisation of models • Focus on MLOps, Automated Tests, CI/CD, ETL • Focus on A/B Testing, Modelling and Experimentation • View to contributing to a product • Uses Tried & Tested Techniques • Experimentation with view to expand community knowledge and understanding of algorithms • Uses novel techniques • Generalist • Works across modelling, ETL, operationalisation and app development • May be less focused on depth of modelling understanding • Focus on storytelling with data • Wizard with graphing libraries, including D3.js

19. Data

20. Data What data do you need?

21. Data How much data do you need?

22. Data How much data do you need? How do we know this is a cat? We have 140 million neurons in V1 And we have V2, V3, V4, V5 and V6

23. Data How much data do you need? 88 239 33 178 38 122 208 115 215 36 119 203 229 65 52 64 4 23 92 114 26 29 155 183 101 142 222 54 187 109 45 6 95 67 35 212 93 103 142 57 207 117 174 228 201 24 101 176 100 9 141 241 144 37 8 34 198 125 138 246 178 126 255 108 161 128 How do you get a computer to recognise this as a cat

24. Data How much data do you need? ? ? ? ? ?

25. Data How much data do you need? Garbage in… …Garbage out

26. Data How much data do you need? HorsesGoats ?

27. Data How much data do you need? ? ?

28. Data How much data do you need? ? ?

29. Bias – AI ethics and responsibility

30. Value realization is only possible through Continuous Delivery MLOps Data Science solutions need to be integrated with People, Process and Products Pilot PoC Experiment PoV MVP

31. I have a model for you… How do I deploy, manage, monitor…Wall Of Confusion Data Science Ops Data Drift Model Decay Stale Models Concept Drift Traditional DS Delivery

32. DevOps is the union of people, process, and products to enable continuous delivery of value. “ ” Build & Test Continuous Delivery Deploy Operate Monitor & Learn Plan & Track Develop People • Collaborate early and often • Cross-disciplinary teams • Share common goals and metrics • Shared responsibility Process • Agile Principles • Streamline feedback • Delivering value faster Products What is DevOps?

33. The ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. MLOps

34. Data Science is a Team Effort Architects Change Management Data Engineers Data Scientists Project Management App Developers UX Designers

35. Conclusions  Data Scientists are Scientists, Engineers and Business Analysts  We work best with data of high volume, veracity and variety  We need to keep in mind ethical considerations and act responsibly when designing systems  MLOps is paramount for delivering customer value  Data Science is a team effort  Data Science is about turning data into impact

36. Q&A

37. Thank you

Editor's Notes

#3: aka.ms/benkeen is a short URL to my LinkedIn profile
#4: So today we’re going to cover a range of topics in our Data Science 101 Masterclass. We’ll start with what a data scientist is, what the job entails and what should be expected of a data scientist. Then we’ll talk about data – The Economist described Data as “The New Oil” in 2017 and that’s up to some debate, I’m not so sure I agree, but here we’ll be taking a look at a couple of questions about data that data scientists face most commonly. We’ll touch a bit on AI Ethics and Responsibility which, of course, could be an entire masterclass in itself. Finally we’ll cover an important emerging topic in data science – MLOps for operationalising data science
#5: Data science is not about making awesome visualisations, complicated models, or writing lots of code. Data scientists turn data into impact The job is to solve real problems using data
#6: First and foremost, data scientists are scientists – It’s right there in the name. What does this mean? [See next slide]
#7: I don’t come from a mathematics or computer science background myself - my PhD is in molecular genetics – but the skills required of a scientist are the same, whether I’m creating a predictive model as a data scientist now or I’m doing X-ray crystallography on proteins as a biochemist in my old life. [Read Slide Text]
#8: Data Scientists are also engineers – they design systems to fulfil functional objectives. In the next couple of slides, we’ll explore the tools data scientists use to do this.
#9: These are the tools for the job – just as with any other type of engineer, good data scientists will know when to use which tools for which tasks and not shoehorn things in just because they like them. These are the tools data scientists use to create the systems that fulfil those functional objectives. As a side note, this is not an exhaustive list. No single data scientist knows all of this in-depth and we’ll come back to that in a little bit (See types of data scientist). In the next slide I’m going to focus a little more down into this machine learning section as it’s probably the one that’s most associated with data science.
#10: Artificial intelligence is the ability for machines to mimic human behaviours – including things like recognition, behaviour, reasoning etc. Machine learning is a subset of artificial intelligence. Machine Learning is then application of maths and stats techniques that learn parameters from data. Artificial intelligence is not just machine learning – there are rules-based systems in which knowledge is explicitly coded from rules understood by experts within AI too and these can be powerful such as in the cases of classical computer vision and NLP syntax and semantics and can often be combined with machine learning techniques. Deep learning is a subset of machine learning, in which we use neural nets with many hidden layers to learn highly non-linear relationships. Deep learning is often used for things like computer vision, natural language processing, speech/sound recognition, bioinformatics (genetic data), time series. Just like Machine Learning not being all of AI, Deep learning is not all Machine Learning - For work where data is tabular and there are fewer samples, deep learning is often not the best modelling technique and you’ll find a lot of Kaggle competition winners will use other techniques such as gradient boosted decision trees. And sometimes, where relationships are less complex, a simple linear or logistic regression will do just as well. Again – to re-iterate – this is just a tool in the data science job
#11: Finally a data scientist is a business analyst. Over the next few slides we’ll see why this is so important to the role of a data scientist.
#12: Let’s take a simple example – We’ve been brought in to a predictive maintenance engagement and have been given data to go and train a model with no real context. We go away and train a classifier model that’s 99% accurate – happy days, it all looks good, we deploy it but the customer is not happy.
#13: Given some business context – we find out that machinery failure costs £500,000 but maintenance costs £1,000. We’ve predicted 4 cases in which maintenance of a machine was required that wasn’t – that’s a cost of £4,000. However, we predicted 2 cases in which maintenance of a machine wasn’t required, that’s a cost of £1,000,000. So now let’s take a look at how a data scientist could have dealt with this given this business context.
#14: Our prediction is yes the machine needs maintenance in the upper light blue rectangle and no it doesn’t need maintenance in the lower light red rectangle. The shape of the markers indicates whether the machine actually needs maintenance or not, blue circles indicate that actually the machine does need maintenance and red crosses mean it doesn’t. (Transition 1) Predicting “No” but actually needing maintenance is *very* expensive.
#15: We have 2 options Option 1 – We can change our model or parameters and re-train. In this example the sigmoid shown is shifted to the left, this has moved a number of middling values up into the “yes” rectangle Option 2 – We can change our threshold, or “decision boundary” – before we had our decision boundary at ~50%, shifting it down also moves those middling values into the “yes” rectangle Doing either of these has the effect of removing our false negative but introducing more false positives
#16: Making sure we don’t miss required maintenance here has reduced the cost by nearly £1,000,000. This is a simple and extreme example but the point stands that data scientists need to tie technical decision making with business value understanding.
#17: The previous example showed a classification – but how about a regression, where we’re predicting continuous variables and want to look at predictive maintenance from a perspective of remaining useful life. Let’s look at a simple linear regression. Say we want to look at the performance of machines over time and we are just given time and performance metrics. We come to the conclusion that the machines’ performance improves over time. Again our customer is not so happy. (Transition 1) Now we do some business analysis, and find out that actually there are different groups of machines represented from within this data and our original conclusion was wrong. Taken together, this shows that understanding the business context in which the data resides, is incredibly important to doing data science and without it we can cause more harm than good. It’s so important in data science projects to have data scientists engaged early on to ensure things like this don’t happen and we lose trust.
#18: So a data scientist is not just some subset of these roles, a data scientist encompasses all 3 roles.
#19: All of these people are data scientists and all of them will have significant overlap in skills with the others.
#20: Now that we know who are data scientist is, what their tools are and what their objective is – let’s talk about data No “Data Science 101” talk would be complete without a discussion about data. These are possibly the 2 questions we get asked most: - What data do you need? - How much data do you need. As consultants, our diplomatic answer is always – it depends.
#21: This question – what data do you need? - is highly use case dependent and requires BA Workshops. If you cast your mind back to a few slides ago, where we showed the regression was positive without information on the machine groups, and then it became negative. Without business analysis we wouldn’t know that we needed that group data. This is why it’s so important to have data scientists on board so early on in the engagement, so we can cover off some of these requirements. So keeping in mind that the machine learning aim is to mimic human behaviour, if it’s tough for an SME to identify or predict, it is most likely hard for a ML to be trained. There are, of course, as with everything, exceptions to this rule. Exceptions to the rule – is this consulting or is this a research task that we should be taking on?
#22: How much data do you need? Commonly people will just think about volume, they want a number. I can get you 100 data points and that will all be fine. However, we need to get people out of that thinking – we need to combine volume, with veracity (reliability) and variety. The data sample we’re given needs to be representative of the population and only with all 3 will we get this.
#23: Let’s take a computer vision problem - The computer vision examples almost always have cats. How do we know this is a cat? We have 140 million interconnected neurons in our primary visual cortex With 10s of billions of connections between them That’s just V1, we have v2, v3, v4 and v5 for fine tuning We are very good at learning pattern recognition as a result
#24: This is an easy task for us but incredibly difficult for a computer Computers just see pixel values We can’t just write a program to recognise cats – Too many options to encode and too many edge cases So we use machine learning to recognize patterns. I won’t go into the details of convolutional neural networks but as you can imagine, the transformation to go from these numbers here to the label of a cat is highly non-linear and complex. There’s no y = mx + c linear regression here. This kind of pattern recognition is going to need many, many examples to determine how to get from picture of cat to a label of cat. The InceptionV3 image recognition model from Google has just shy of 24 million parameters it needs to get right
#25: Image data might have more complex transformation but if we have tabular data, we still need enough data in order to reason about distributions of data. Let’s say we’re trying to classify our blue circles and red crosses. And we have a new sample – green question mark. (Transition 1) This data alone could be a sample of any number of distributions. Any of these decision boundaries might be valid – we need more volume to get a better idea of distribution (Transition 2) When we have more data, all of these distributions include those first 6 points but we can see how wildly different they are, our top example would classify this as a red cross but the next 2 would be blue circles
#26: If you feed a model with incorrect or unreliable data, the results you get will be unreliable or incorrect. Depending on the accuracy you need, for each incorrect data point you feed in, you might need 5, 10, 50, 100… similar correct examples to drown the effects of it out.
#27: Let’s train a model on these pictures of goats and these pictures of horses (I didn’t use cats!) (Transition 1) Now, given this picture of a horse, what do we think our model would predict? It’s never seen a side profile of a horse – it’s got 4 legs, it’s and a relatively rectangular body – it’s a goat. We could have hundreds of thousands of images like this, but we will still predict this is a goat. You may also have heard about the model that determined dogs from wolves not from the animal itself but the background, if there was snow – it was a wolf. Sample training data needs to be representative of the potential scoring population. If you’re doing crack detection for example, it’s better to have a thousand different images of cracks than a thousand images of the same crack. There is another conversation on how we go about getting this labelled data but it’s perhaps a topic for another day.
#28: This isn’t just true of image data – let’s take a look at some more tabular data. We have two classes of data – represented by blue circles and red crosses. We have a new sample represented by a green question mark and we want to know how to classify this. (Transition 1) If we have a lot of data but all our data is from the same two clusters of data – we may get a separation that looks something like this – this is something an SVC might look do to classify these. But notice that the green question mark is not near either of these clusters – we’ve trained a model the best we can based on the knowledge we have of this training data but this training data is clearly not representative of the population.
#29: Now we have sparser data, there’s much fewer points, they’re actually just a different sample from the same population but this data is a better representation of the population. (Transition 1) Now we’re a little more certain about where this green question mark should be placed. Although previously it would have been in the blue area, now it’s on the red side of our decision boundary. So variety of data is just as, if not more, important as volume of data.
#30: As part of this discussion around the types of data required of data scientists and why our samples need to be representative of the population, we also need to talk about AI ethics and responsibility, which will ultimately fall on the data scientists that design this system. Bias has different meanings in ML and stats – but here I’m talking about bias in which a model is skewed based on the data it is fed relative to accepted legal or moral principles. If you train a model to determine best candidates based on your historical successful hires, but all you’ve hired in the past is men, that model is going to be skewed to hiring men. This is an example where your training sample isn’t necessarily representative of the population. Similarly if you train a model mostly on data you’ve collected from white men, it’s going to perform better on white men. Again, you need a representative sample of the population. The Facebook example here is similar but also highlights another key issue. Even if you remove the actual labels that indicate protected classes like race, sex, religion. We need to consider proxies that might indicate these classes – hormone levels in medical records, certain words in CVs, sports activities or post codes. There are a number of techniques for reducing this kind of bias in models – including pre-processing, in-processing and post-processing algorithms, and class balancing algorithms like SMOTE can help augment under-represented classes but ultimately I think data scientists can best tackle this kind of bias will be to have a good domain understanding and a good understanding of the data that they are using to try and ensure they are not discriminating against any class.
#31: The final thing I want to discuss is MLOps as it’s one of the most important aspects of a data scientist’s role. Whilst experimentation, proof of concepts and proof of values are important – value realization for our customers is only possible through continuous delivery. For successful value realisation, data science solutions need to be integrated with people, process and products.
#32: Historically there has been a disconnect between data scientists and other developers, in which a model is made, perhaps through data science experimentation and then thrown over a wall to developers to deploy. Model requirements in this manner are often poorly understood by developers and changes can be difficult, resulting in model drift.
#33: Modern data science delivery integrates machine learning and DevOps, using tools designed for continuous integration and continuous delivery. Although the goal is to enable each of the technical delivery tasks shown in the top right here, there is a focus on people, process and products in order to make this a success. (Transition 1) People: Project managers, architects, data engineers, data scientists, developers, and testers should all be involved in a use case from an early stage. There is not a specific KPI for data scientists such as an RMSE value, and a different goal for developers such as API response times – the whole team shares common goals. (Transition 2) Process: We follow agile methodology principles of short sprints of 2 or 3 weeks, tracking the story points and burndown of a sprint to use in planning for the next sprints as well as retrospectives to feedback to each other what’s going well, what’s not going well and what actions we can take to maintain velocity. (Transition 3) Products: We should use products that enhance our productivity, not products shoehorned in because they are an individual’s favourite tool. Knowledge sharing through wikis and Teams is encouraged across the team so that we can make sure everyone can contribute to a range of tasks.
#34: MLOps is the integration of machine learning into DevOps processes and, as we see on screen, it is the ability to continuously integrate, automatically test, build, deploy and monitor Machine Learning artifacts such as Data & Training pipelines and models. The aim of data science modelling is not necessarily to be right today but to be less wrong each day through an iterative feedback cycle. Monitoring models and, for those that have done scrum principles or operations management training, the principles of Kaizen (a Japanese term that means “continuous improvement”) are therefore of paramount importance.
#35: Data science isn’t something that happens in a silo. This is a team effort among a team that share common goals. Not all projects are going to need all of these resources but all projects require the concerted effort of a team to make them a success. We want to work with others in order to make sure our engagements are successful.

Data science 101 Masterclass

More Related Content

What's hot (20)

Similar to Data science 101 Masterclass (20)

Recently uploaded (20)

Data science 101 Masterclass

Editor's Notes