SlideShare a Scribd company logo
3
Most read
4
Most read
Machine Learning
Methods and Analysis
Smita Agrawal
Popular AI Methodologies
What is Machine Learning?
 Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data
 Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
Machine
Learning
Classification
Churn Prediction
Fraud Detection
Image Classification
Regression
Market Mix Modelling
ARPU Forecasting
Life/Age Expectancy
Advertising Popularity Prediction
Dimensionality Reduction
Meaningful Compression
Feature Selection
Clustering
Customer Profiling
Targeted Marketing
Recommender Systems
Real-time Decisions
Robot Navigations
Self-Learning Models
The crux of
Machine
Learning lies in
“History Repeats
itself!”
How to be Machine Learning enabled?
A Superficial View of steps
1. Data Curation
Data curation involves gathering data
across relevant attributes such that they
can distinguish and thereby help us learn
more about why something is happening
2. Processing Data
3. Resampling
Resampling data so that data is free of
any biased or over-powered
characteristics of data are selected such
that all the characteristics are balanced
Leveraging measures like Mean,
Median or Mode to process
curated data so that any outliers
or anomalies can be addressed
4. Variable Selection
Not all the variables in the curated data
really distinguish the characteristics.
Variable selection is a critical step which
enables in retaining these variables
5. Predictive Model
A suitable Machine Learning can now
be applied to the data obtained from
step 4 to store the patterns observed
6. Generate Predictions
This is the most exciting step as it is the
stage where we can predict with certain
confidence whether the data points
belong to a characteristic
Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine!
Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
 Emergence of big data has created tremendous opportunities for businesses to gain real-time insights
 Make more informed decisions by leveraging data from the exploding number of digital systems
 However, as often is the case with disruptive technologies, the innovations behind big data have created a critical
problem – one that we call Data Drift
 Data drift creates serious challenges to fully harness the insights available from big data
Data drift is defined as:
The unpredictable, unannounced and unending mutation of data characteristics caused by the operation,
maintenance and modernization of the systems that produce the data
Some Severe Impacts -
 Erodes data fidelity
 Operational reliability
 Ultimately the productivity of your data scientists and engineers
 It increases your costs
 Delays time to analysis
 Decreases the productivity and agility of your data engineers
 Leads to poor decision-making by data scientists and the line of business
Source: Streamsets
Data Drift
Fail Safe Mechanism - Prepping Validation for Automated Predictions
1. Ensure variable data types match in validation data with that in the historical data
2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in
the daily data
3. Measure the mean and most frequent value for numerical variables in historical and validation data
4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe
mechanism
5. There would be no predictions generated for these records as your predictive model does not know that new value as there
were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions
6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model
7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across
variables to ensure that none of the data curation stage was broken which could have a consequent impact on the
predictions
8. The variation should be compared to the historical data’s modal values
9. This should be followed as a real-time(daily) practice even while automatically updating the model
Some Common Data Drift Measures
Population Stability Index
Kolmogorov - Smirnov Statistic
Kullback - Leibler Divergence
Histogram Intersection
01
02
03
04
Smita Agrawal
Thank you!!!

More Related Content

PPTX
Data drift and machine learning
PDF
ML Drift - How to find issues before they become problems
PPTX
Machine Learning for Product Managers
PPTX
A predictive analytics primer
PDF
7 steps to Predictive Analytics
PDF
The galaxy of data analysis - School of ai Port Harcourt meetup
PDF
Predictive Modelling
PPSX
Transforming Business with Intelligent Data
Data drift and machine learning
ML Drift - How to find issues before they become problems
Machine Learning for Product Managers
A predictive analytics primer
7 steps to Predictive Analytics
The galaxy of data analysis - School of ai Port Harcourt meetup
Predictive Modelling
Transforming Business with Intelligent Data

What's hot (20)

PDF
Ayasdi strata
PPTX
Guide to data analytics
PDF
Predicting diabetes using a machine learning approach linked in
PPTX
Machine Learning in Healthcare: A Case Study
PPTX
Carma internet research module sample size considerations
PPT
Old Presentation on Security Metrics 2005
PPTX
01 deloitte predictive analytics analytics summit-09-30-14_092514
PDF
BigML Education - Anomaly Detection
PDF
The Data Quality Formula
PDF
Big data and Process Safety
PPTX
Medical data diagnosis
PPTX
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
PDF
Make clinical prediction models great again
PPTX
Mohammed AL Madhani
PDF
0940 diamondsponsor de
DOCX
White paper on medical devices
PPTX
Security Administration Vii 2 Statistical Analysis
PPTX
Houston, we have a problem! Using live data to warn of current and upcoming i...
PPTX
Big data chicago v2 5 14 14
PDF
Simulation pitfalls p302023
Ayasdi strata
Guide to data analytics
Predicting diabetes using a machine learning approach linked in
Machine Learning in Healthcare: A Case Study
Carma internet research module sample size considerations
Old Presentation on Security Metrics 2005
01 deloitte predictive analytics analytics summit-09-30-14_092514
BigML Education - Anomaly Detection
The Data Quality Formula
Big data and Process Safety
Medical data diagnosis
Data mining-implementation-to-predict-sales-using-time-series-method By Raiha...
Make clinical prediction models great again
Mohammed AL Madhani
0940 diamondsponsor de
White paper on medical devices
Security Administration Vii 2 Statistical Analysis
Houston, we have a problem! Using live data to warn of current and upcoming i...
Big data chicago v2 5 14 14
Simulation pitfalls p302023
Ad

Similar to Data drift and machine learning (20)

PDF
Test Data Management Explained: Why It’s the Backbone of Quality Testing
PDF
Machine Learning in Autonomous Data Warehouse
PDF
Big Data Tools PowerPoint Presentation Slides
PPTX
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
PDF
Introduction to Machine Learning and Data Science using Autonomous Database ...
PDF
What is Data Observability.pdf
PDF
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
PDF
Introduction to Machine Learning and Data Science using the Autonomous databa...
PDF
How Organizations are Using AI for Data Management
PDF
Unlock the power of MLOps.pdf
PDF
Unlock the power of MLOps.pdf
PDF
Unlock the power of MLOps.pdf
PDF
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
PPTX
Enterprise Test Data Generation.pptx
PDF
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
 
PDF
How to generate Synthetic Data for an effective App Testing strategy.pdf
PPTX
Data_analyst_types of data, Structured, Unstructured and Semi-structured Data
PDF
The power of AI and ML in Testing .
PDF
MetaSuite and_hp_quality_center_enterprise
PDF
Machine Learning for Business - Eight Best Practices for Getting Started
Test Data Management Explained: Why It’s the Backbone of Quality Testing
Machine Learning in Autonomous Data Warehouse
Big Data Tools PowerPoint Presentation Slides
AI Class Topic 3: Building Machine Learning Predictive Systems (Predictive Ma...
Introduction to Machine Learning and Data Science using Autonomous Database ...
What is Data Observability.pdf
A Comprehensive Introduction to Anomaly Detection in Machine Learning | USAII®
Introduction to Machine Learning and Data Science using the Autonomous databa...
How Organizations are Using AI for Data Management
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
ARTIFICIAL INTELLIGENCE FOR DATA MANAGEMENT
Enterprise Test Data Generation.pptx
Strata Rx 2013 - Data Driven Drugs: Predictive Models to Improve Product Qual...
 
How to generate Synthetic Data for an effective App Testing strategy.pdf
Data_analyst_types of data, Structured, Unstructured and Semi-structured Data
The power of AI and ML in Testing .
MetaSuite and_hp_quality_center_enterprise
Machine Learning for Business - Eight Best Practices for Getting Started
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Lecture1 pattern recognition............
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Introduction to Business Data Analytics.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Lecture1 pattern recognition............
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
Introduction to Business Data Analytics.
Miokarditis (Inflamasi pada Otot Jantung)
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Data drift and machine learning

  • 1. Machine Learning Methods and Analysis Smita Agrawal
  • 3. What is Machine Learning?  Machine learning is simply a set of computer programs that can teach themselves to grow and adopt when exposed to new data  Machine learning is broadly categorised into Supervised, Unsupervised and Reinforcement Learning techniques Supervised Learning Unsupervised Learning Reinforcement Learning Machine Learning Classification Churn Prediction Fraud Detection Image Classification Regression Market Mix Modelling ARPU Forecasting Life/Age Expectancy Advertising Popularity Prediction Dimensionality Reduction Meaningful Compression Feature Selection Clustering Customer Profiling Targeted Marketing Recommender Systems Real-time Decisions Robot Navigations Self-Learning Models The crux of Machine Learning lies in “History Repeats itself!”
  • 4. How to be Machine Learning enabled? A Superficial View of steps 1. Data Curation Data curation involves gathering data across relevant attributes such that they can distinguish and thereby help us learn more about why something is happening 2. Processing Data 3. Resampling Resampling data so that data is free of any biased or over-powered characteristics of data are selected such that all the characteristics are balanced Leveraging measures like Mean, Median or Mode to process curated data so that any outliers or anomalies can be addressed 4. Variable Selection Not all the variables in the curated data really distinguish the characteristics. Variable selection is a critical step which enables in retaining these variables 5. Predictive Model A suitable Machine Learning can now be applied to the data obtained from step 4 to store the patterns observed 6. Generate Predictions This is the most exciting step as it is the stage where we can predict with certain confidence whether the data points belong to a characteristic Data is an oil to our Machine Learning’s engine. However, only oil will not ignite the engine! Data Oil with the right levers will ignite the engine and enable us in achieving our imperfectly perfect predictions!
  • 5.  Emergence of big data has created tremendous opportunities for businesses to gain real-time insights  Make more informed decisions by leveraging data from the exploding number of digital systems  However, as often is the case with disruptive technologies, the innovations behind big data have created a critical problem – one that we call Data Drift  Data drift creates serious challenges to fully harness the insights available from big data Data drift is defined as: The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data Some Severe Impacts -  Erodes data fidelity  Operational reliability  Ultimately the productivity of your data scientists and engineers  It increases your costs  Delays time to analysis  Decreases the productivity and agility of your data engineers  Leads to poor decision-making by data scientists and the line of business Source: Streamsets Data Drift
  • 6. Fail Safe Mechanism - Prepping Validation for Automated Predictions 1. Ensure variable data types match in validation data with that in the historical data 2. Match all the unique values in categorical variables with the historical data to check for any new values that have come in the daily data 3. Measure the mean and most frequent value for numerical variables in historical and validation data 4. Any new values that are observed in categorical variables in the validation data will be handled as part of fail safe mechanism 5. There would be no predictions generated for these records as your predictive model does not know that new value as there were no instances of this value in the history when the model was built and causes a fatal error leading to no predictions 6. Generate a report for all the excluded records by variables to further analyse the trend drifts and enhance the model 7. Ongoing Data Quality Check: Check whether there is a behavioural shift of more than ± 5% in the modal value across variables to ensure that none of the data curation stage was broken which could have a consequent impact on the predictions 8. The variation should be compared to the historical data’s modal values 9. This should be followed as a real-time(daily) practice even while automatically updating the model
  • 7. Some Common Data Drift Measures Population Stability Index Kolmogorov - Smirnov Statistic Kullback - Leibler Divergence Histogram Intersection 01 02 03 04