SlideShare a Scribd company logo
Data Preparation and the Importance of How Machines Learn by Rebecca Vickery
Data Preparation
and the
Importance of how
Machines Learn
Rebecca Vickery, Data Scientist,
Holiday Extras
Machine learning
Source: Google images
Simple ML workflow
Get data >> baseline model >> model selection
>> model tuning >> predict
Simple ML workflow
Get data >>
Features/Inputs What we want to predict
Simple ML workflow
Baseline model >>
Accuracy score
Perfect = 1.0
0.44
Simple ML workflow
Model selection >>
Best model = Random Forest
Simple ML workflow
Hyperparameter optimisation >>
Best score = 1.0
Best Params =
{'max_depth': 5,
'min_samples_leaf':
1,
'min_samples_split':
10, 'n_estimators':
500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
Data Preparation and the Importance of How Machines Learn by Rebecca Vickery
Source: thedailybeast.com
Actual ML workflow
Get data >> data preparation
>> feature engineering >> baseline model
>> model selection >> model tuning
>> predict
Label encoding
Problem
Source: flaticon.com
4 is bigger than 1
so there must be a
relationship
between these
rows
Source: flaticon.com
1 = neutered male
2 = spayed female
3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables
366 different unique values
= 366 new features
Solution: Feature engineering?
Single
Colour
Multi
Colour
Problem: We will lose a lot of information
Source: thetelegraph.com
Solution: Weight of evidence
For each colour (e.g. Tan):
WOE = log ( ( pi /p) / ( ni / n) )
pi = number of times Tan appears in positive class (1)
p = total number of positive classes (1)
ni = number of times Tan appears in negative class (0)
n = total number of negative classes (0)
Solution: Weight of evidence
Output is a positive or negative number between -1 and
+1
Solution(s)
WOE is one of many solutions for this
Problem(s)
Source: Photo by Louis Reed on Unsplash
Scikit-learn pipelines
Solution
pip install category_encoders
Pipeline example
Less time
But still some work to do
“There are only two Machine
Learning approaches that win
competitions: Handcrafted &
Neural Networks.”
Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for
listening
Find me at...

More Related Content

PPTX
Humans to the Rescue: Troubleshooting AI Systems with Human-in-the-loop
PDF
OpenML 2019
PDF
What is the N+1 Query Problem and How to Solve It
PDF
"using sagemaker to build and deploy ml models in production" - MJ Berends AW...
PDF
Le Machine Learning de A à Z
PPTX
Ember
PPTX
Getting Started with Azure AutoML
PDF
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...
Humans to the Rescue: Troubleshooting AI Systems with Human-in-the-loop
OpenML 2019
What is the N+1 Query Problem and How to Solve It
"using sagemaker to build and deploy ml models in production" - MJ Berends AW...
Le Machine Learning de A à Z
Ember
Getting Started with Azure AutoML
Unleashing the Power of Machine Learning Prototyping Using Azure AutoML and P...

Similar to Data Preparation and the Importance of How Machines Learn by Rebecca Vickery (20)

PDF
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
PDF
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
PPTX
Oleksandr Krakovetskyi: Azure Machine Learning Servise for Data Specialists
PDF
QuerySurge AI webinar
PPTX
Machine Learning Contents.pptx
PDF
Maisa Penha - Art of Possible.pdf
PPT
How To Automate Part 1
PPTX
FINAL REVIEW
PPTX
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
PDF
Artificial Intelligence with Python | Edureka
PDF
Apprentissage statistique et analyse prédictive en Python avec scikit-learn p...
PPT
Data Mining with JDM API by Regina Wang (4/11)
PPTX
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
PPTX
Serverless Machine Learning - Hanoi Google Next 2019
PDF
Getting started with Machine Learning
PDF
The Rise of the Machines - A Primer to Machine Learning and Predictive Analyt...
PPT
Ds final project jwm
PPTX
Presentation
PPT
MACHINE LEARNING LIFE CYCLE
DOCX
Hybrid feature selection using correlation coefficient and particle swarm opt...
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Oleksandr Krakovetskyi: Azure Machine Learning Servise for Data Specialists
QuerySurge AI webinar
Machine Learning Contents.pptx
Maisa Penha - Art of Possible.pdf
How To Automate Part 1
FINAL REVIEW
Borys Rybak “Azure Machine Learning Studio & Azure Workbench & R + Python”
Artificial Intelligence with Python | Edureka
Apprentissage statistique et analyse prédictive en Python avec scikit-learn p...
Data Mining with JDM API by Regina Wang (4/11)
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
Serverless Machine Learning - Hanoi Google Next 2019
Getting started with Machine Learning
The Rise of the Machines - A Primer to Machine Learning and Predictive Analyt...
Ds final project jwm
Presentation
MACHINE LEARNING LIFE CYCLE
Hybrid feature selection using correlation coefficient and particle swarm opt...
Ad

More from Alex Cachia (20)

PPTX
No Onions, No Tiers - An Introduction to Vertical Slice Architecture by Bill ...
PPTX
Supporting IT by David Meares
PPTX
OWASP Top 10 2021 - let's take a closer look by Glenn Wilson
PDF
If you think open source is not for you, think again by Jane Chakravorty
PDF
Chaos Engineering – why we should all practice breaking things on purpose by ...
PPTX
A brief overview of the history and practice of user experience by Ian Westbrook
PPTX
Return the carriage, feed the line by Aaron Taylor
PPTX
Treating your career path and training like leveling up in games by Raymond C...
PPTX
Digital forensics and giving evidence by Jonathan Haddock
PPTX
Software Security by Glenn Wilson
PPTX
Why Rust? by Edd Barrett (codeHarbour December 2019)
PPTX
Issue with tracking? Fail that build! by Steve Coppin-Smith (codeHarbour Nove...
PPTX
Hack your voicemail with Javascript by Chris Willmott (codeHarbour October 2019)
PPTX
Developing for Africa by Jonathan Haddock (codeHarbour October 2019)
PDF
Revving up with Reinforcement Learning by Ricardo Sueiras
PPTX
Blockchain For Your Business by Kenneth Cox (codeHarbour July 2019)
PPTX
Seeking Simplicity by Phil Nash (codeHarbour June 2019)
PPTX
Sharing Data is Caring Data by Mark Terry (codeHarbour June 2019)
PPTX
Managing technical debt by Chris Willmott (codeHarbour April 2019)
PPTX
Telephone Systems and Voice over IP by Bob Eager (codeHarbour April 2019)
No Onions, No Tiers - An Introduction to Vertical Slice Architecture by Bill ...
Supporting IT by David Meares
OWASP Top 10 2021 - let's take a closer look by Glenn Wilson
If you think open source is not for you, think again by Jane Chakravorty
Chaos Engineering – why we should all practice breaking things on purpose by ...
A brief overview of the history and practice of user experience by Ian Westbrook
Return the carriage, feed the line by Aaron Taylor
Treating your career path and training like leveling up in games by Raymond C...
Digital forensics and giving evidence by Jonathan Haddock
Software Security by Glenn Wilson
Why Rust? by Edd Barrett (codeHarbour December 2019)
Issue with tracking? Fail that build! by Steve Coppin-Smith (codeHarbour Nove...
Hack your voicemail with Javascript by Chris Willmott (codeHarbour October 2019)
Developing for Africa by Jonathan Haddock (codeHarbour October 2019)
Revving up with Reinforcement Learning by Ricardo Sueiras
Blockchain For Your Business by Kenneth Cox (codeHarbour July 2019)
Seeking Simplicity by Phil Nash (codeHarbour June 2019)
Sharing Data is Caring Data by Mark Terry (codeHarbour June 2019)
Managing technical debt by Chris Willmott (codeHarbour April 2019)
Telephone Systems and Voice over IP by Bob Eager (codeHarbour April 2019)
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release

Data Preparation and the Importance of How Machines Learn by Rebecca Vickery

Editor's Notes

  • #4: Before I get into talking more about ml will recap what it is. For the purposes of this talk I am only going to cover supervised ml. In ML provide algorithm with labelled examples. Algo learns mapping. Uses this mapping/pattern/rules to predict unlabelled data.
  • #5: This mapping of inputs to outputs is based on the algorithm performing a large number of mathematical computations very quickly. The maths behind most machine learning models is mostly based on linear algebra and calculus. I am not going to talk in detail about the maths, I think there is 1 v simple equation in the talk. But the fact that in ml machines learn using maths is important to the talk.
  • #6: So to build up this pictue of how machines learn I’ll walk through what a simple ml workflow looks like. I’ll be showing a few code examples. This is all python based, I mainly work in python. And will be using the open source ml library scikit-learn. Is anyone familiar?
  • #8: Create a dumb model. In this case I have created a model that predicts everything as the most frequently occurring class. This is so that you know that the improvements you make or cleverer models you create really are clever and aren’t doing anything stupid.
  • #11: This is a basic workflow on toy data but in the real world things are never that simple. For example I have yet to ever ever a 100% accuracy score on real data. The other thing to note is that because this is a toy data set all the features were numerical. In real life this is also very rarely always the case. You will most often be dealing with different data types.
  • #12: So what happens when we have a data set like this. This data is taken from the website kaggle - an ml comp website. It contains attributes of various pets that have been placed in an animal shelter. The goal of this competition is to predict wether the animal will have a successful outcome or not. All the data in this is categorical or string apart from the target variable. Introduce kaggle.
  • #13: So this is what happens when we try to run ml on this. We get an error. Because the algorithms only understand numbers. So we need to translate all these features into numbers. A language that the machine can understand and learn from.
  • #14: And this is not a simple process because as much as ml is often hyped as being this wonder tool. If humans don’t think very carefully about the data that they are feeding the machine then machines can be really dumb. Machines can learn patterns in data but they cannot think beyond the data and the format that data comes in. So humans need to do the thinking. This is why when talking about data science this joke is often made. But it is actually very true.
  • #15: https://www.thedailybeast.com/why-doctors-arent-afraid-of-better-more-efficient-ai-diagnosing-cancer Machines can learn but they can’t think… yet Human needs to do the thinking - humans supply the context
  • #16: In reality this is what a real ml workflow looks like. The data preparation part, especially when we are dealing with something like the animal shelter data set is one of the most time consuming parts.
  • #17: So how do we do this conversion? One solution is called label encoding. Talk through what it is.
  • #18: The problem is that all the machine sees is numbers and the relationships between them. It does not have any context beyond that. It doesn’t know that this is a mapping to something meaningful from a human perspective. So we need a better way to represent these numbers.
  • #19: One solution is known as one hot encoding. Explain.
  • #20: But one hot encoding can’t just be used for all categorical features. For example with the age upon outcome there is a relationship between the rows. 1 year is smaller then 2 years. It is important that the machine is able to capture this context too. So with this feature it makes sense to map it to its equivalent numerical representation in this case into days.
  • #21: One hot encoding also doesn’t work for features where there is high cardinaity. Or in other words there are a large number of unique values in the feature. For example color has 366 different values. If we did one hot encoding we would create 366 new columns which would make the dataset extremely wide and sparse. This is a problem because it can add a lot of noise to the data and lead to overfitting.
  • #22: One solution to this is to engineer new intuitive features. So again this goes back to the importance of humans doing the thinking and why when people list out skills data scientists need they list domain knowledge. You can also use data analysis to try to understand some relationships in the data to derive these features and you will also need to do some trial and error. So one example of a feature we could engineer could be single colour pets vs multi colour. My cats - stuff. Maybe there is a relationship there.
  • #23: The problem with this approach is that with the best will in the world, the most intuitive or domain knowledge, lots of data analysis. By just doing feature engineering you are losing one of the advantages of ml in that it can if used correctly pick up on hidden patterns that a human could not see. With feature engineering we may miss things like this. Has anyone heard about this. That black cats are finding it much harder to find homes because they don’t photograph well.
  • #24: There are solutions to this. We can use maths to calculate features that attempt to capture the patterns in these features. One example of this is weight of evidence. Number between -1 and +1, if number is positive then the appearance of tan is a positive influencer of the positive outcome. The machine can learn this relationship if it is present across enough samples.
  • #26: There are many different methods to compute these.
  • #27: In machine learning we need to experiment with different techniques and combinations of techniques to find the most optimal solution for the data set. If we have to manually code all these solutions then this can be very time consuming.
  • #28: Fortunately there are two solutions for this. Scikit-learn has a feature called pipelines. Pipelines allow you to chain together steps. These steps can be a number of things but the important one in this example is the preprocessing steps. We can chain together all the different methods for preprocessing. Apply them to the desired columns and then when we perform model fitting the preprocessing is applied at the same tine.
  • #29: To code something like weight of evidence is a lot of code, a lot of logic to work through making sure calculations are correct and so on. If we had to repeat this for to try all these different encoders it would take a very long time. Fortunately there is a python library called category encoders that does this work for us.
  • #30: Let’s look at and an example. Talk through.
  • #31: So this makes this process quite a lot easier but there is still a lot of work to do on feature engineering.
  • #32: The ceo of kaggle once said this having observed winning and losing solutions in Kaggle competitions. By handcrafted he is talking about feature engineering. One competition had a dataset containing a number of features about cars at auction, for example mileage, age, make, model, colour and the task was to predict which ones would be a good buy and which would be a lemon. The winning entry grouped the cars into unsual colours (so not commonly occurring) and usual colours. And this turned out to be one of the most predictive features and won them the competition.