SlideShare a Scribd company logo
Guiding through a typical
Machine Learning Pipeline
2
ML Pipeline
The Standard Machine Learning Pipeline is derived from the CRISP-DM
Model
Datasets
Data Retrieval
Data Preparation & Feature Engineering
Modeling
Model Evaluation &
Tuning
Deployment &
Monitoring
ML
Algorithm
Satisfactory
Perfor-
mance?
Data Processing
& Wrangling
Feature
Extraction &
Engineering
Feature Scaling
& Selection
No
Yes
1
2
3 4
5
Source: Practical Machine Learning with Python
3
ML Pipeline
Data Retrieval
Raw Data Set
Data Retrieval is mainly data collection,
extraction and acquisition from various data
sources and data stores.
Data Sources or Formats, e.g.:
• CSV
• JSON
• XML
• SQL
• SQLite
• Web Scraping (DOM, HTML)
Data Descriptions:
• Numeric
• Text
• Categorical (Nominal, Ordinal)
More data beats clever algorithms, but
better data beats more data.
Peter Norvig
“
“
1 2 3 4 5
Source: Practical Machine Learning with Python
4
ML Pipeline
Data Preparation & Feature Engineering
Data outcome labels
Dataset Features
Feature set with categorical variables
• In this step the data is pre-processed by cleaning,
wrangling (munging) and manipulation as needed.
• Initial exploratory data analysis is also carried out.
• Data Wrangling
• Data Understanding
• Filtering
• Typecasting
• Data Transformation
• Imputing Missing Values
• Handling Duplicates
• Handling Categorical Data
• Normalizing Values
• String Manipulations
• Data Summarization
• Data Visualization
• Feature Engineering, Scaling, Selection
• Dimensionality Reduction
Data Visualization
Purpose
Methods
1 2 3 4 5
Source: Practical Machine Learning with Python
5
Modelling Procedure
ML Pipeline
Modeling
In the process of modeling, data
features are usually fed to a ML
method or algorithm and train
the model, typically to optimize a
specific cost function in most
cases with the objective of
reducing errors and generalizing
the representations learned from
the data.
Model Types
• Linear models
• Logistic Regression
• Naïve Bayes
• Support Vector Machines
• Non parametric models
• K-Nearest Neighbors
• Tree based models
• Decision tree
• Ensemble methods
• Random forests
• Gradient Boosted Machines
• Neural Networks
• Densely Neural networks (DNN)
• Convolutional Neural networks (CNN)
• Recurrent Neural networks (RNN)
Regression models
• Simple linear regression
• Multiple linear regression
• Non linear regression
Clustering models
• Partition based clustering
• Hierarchical clustering
• Density based clustering
Classification models
• Binary Classification
• Multi-Class Classification
• Multi Label Classification
Activation
Function
Initializing
Parameters
Cost function, Metric
definition
Train with # of
epochs
Evaluate model with test
data
1 2 3 4 5
Source: Practical Machine Learning with Python
6
ML Pipeline
Evaluation & Tuning Methods [1]
Models have various parameters that are tuned in a process
called hyper parameter optimization to gate models with the best
and optimal results.
3-fold cross validation
ROC curve for binary and multi-class model evaluation
Classification models can be evaluated and tested on validation
datasets (k-fold cross) and based on metrics like:
• Accuracy
• Confusion matrix, ROC
Regression models can be evaluated by:
• Coefficient of Determination, R2
• Mean Squared Error
Clustering Models can be validated by:
• Homogeneity
• Completeness
• V-measures (combination)
• Silhouette Coefficient
• Calinski-Harabaz Index
Purpose
Methods
1 2 3 4 5
Source: Practical Machine Learning with Python
7
ML Pipeline
Evaluation & Tuning Methods [2]
Bias Variance Trade-Off
• Finding the best balance between Bias and Variance
Errors.
• Bias Error is the difference between expected and
predicted value of the model estimator. It is caused
by the underlying data and patterns.
• Variance errors arises due to model sensitivity of
outliers and random noise.
Bias Variance Trade Off
Underfitting
• Underfitting is seen as a parameter setup resulting in
a low variance and high bias.
Overfitting
• Overfitting is seen as a parameter setup resulting in
a high variance and low bias.
Grid Search
Simplest hyper-parameter
optimization method. Tries out a
predefined grid of hyper parameter
set to find the best.
Randomized Search
This is a modification of Grid
Search and uses a randomized
grid of hyper-parameter settings
to find the best one.
1 2 3 4 5
Source: Practical Machine Learning with Python
8
ML Pipeline
Deployment & Monitoring
Selected models are deployed in
production and are constantly
monitored based on their predictions
and results.
Deployment Persistence
Model Persistence is the simplest was of deploying
A model. The final model will persist on permanent
media Like hard drive. A new program must route
real-life data to the persistent model which creates
the predicted output.
Custom Development
Another option to deploy a model is by developing
the implementation of model prediction method
separately. The output is just the values of
parameters that were learned. Method for the
software development domain.
In-House Model Deployment
Due to data protection reasons a lot of enterprises
do not want to expose their data on which models
need to be built and deployed. Models can be easily
integrated internally with web dev frameworks, APIs
or micro-services on top of the prediction models.
Model Deployment as a Service
Model is open accessible and can be integrated via
a cloud based API request.
1 2 3 4 5
Source: Practical Machine Learning with Python
9
Michael Gerke
Detecon International GmbH
Sternengasse 14-16
50676 Cologne (Germany)
Phone: +49 221 91611138
Mobile: +49 160 6907433
Email: Michael.Gerke@detecon.com
ML Pipeline
Contact
Special Thanks to the author team:
• Dipanjan Sarkar
• Raghav Bali
• Tushar Sharma

More Related Content

PDF
Real World End to End machine Learning Pipeline
PDF
An Introduction to Anomaly Detection
PPTX
Machine Learning
PDF
Machine learning
PPTX
Overfitting & Underfitting
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PPT
Machine learning
PDF
Machine Learning Pipelines
Real World End to End machine Learning Pipeline
An Introduction to Anomaly Detection
Machine Learning
Machine learning
Overfitting & Underfitting
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine learning
Machine Learning Pipelines

What's hot (20)

PDF
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
PPTX
Data Science Training | Data Science For Beginners | Data Science With Python...
PPTX
Machine Learning Algorithms
PPTX
Anomaly detection with machine learning at scale
PPTX
Feature Selection in Machine Learning
PPTX
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
PPTX
Automated Machine Learning
PPTX
Machine learning overview
PPTX
Machine learning
PPTX
Machine learning ppt.
PPTX
Supervised Machine Learning Techniques
PPTX
Support Vector Machine ppt presentation
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
K Nearest Neighbor Algorithm
PPTX
Deep learning
PPTX
Deep Learning Explained
PDF
Deep Feed Forward Neural Networks and Regularization
PDF
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
PDF
Hyperparameter Optimization for Machine Learning
PPTX
An Introduction to XAI! Towards Trusting Your ML Models!
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
Data Science Training | Data Science For Beginners | Data Science With Python...
Machine Learning Algorithms
Anomaly detection with machine learning at scale
Feature Selection in Machine Learning
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Automated Machine Learning
Machine learning overview
Machine learning
Machine learning ppt.
Supervised Machine Learning Techniques
Support Vector Machine ppt presentation
Performance Metrics for Machine Learning Algorithms
K Nearest Neighbor Algorithm
Deep learning
Deep Learning Explained
Deep Feed Forward Neural Networks and Regularization
What is Deep Learning | Deep Learning Simplified | Deep Learning Tutorial | E...
Hyperparameter Optimization for Machine Learning
An Introduction to XAI! Towards Trusting Your ML Models!
Ad

Similar to Guiding through a typical Machine Learning Pipeline (20)

PDF
The Power of Auto ML and How Does it Work
PPTX
Practical data science
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
AlphaPy: A Data Science Pipeline in Python
PDF
AlphaPy
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
Machine Learning and AI at Oracle
PDF
What are the Unique Challenges and Opportunities in Systems for ML?
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
MachineLearning Seminar PPT.pptx
PPTX
Machine learning
PPTX
ML Ops.pptx
PPTX
Machine Learning With ML.NET
PPTX
Everything you need to know about AutoML
PPTX
MachineLearningSparkML.pptx
PPTX
Recommender System Using AZURE ML
PPTX
Bangla Hand Written Digit Recognition presentation slide .pptx
PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
WhyR? Analiza sentymentu
The Power of Auto ML and How Does it Work
Practical data science
MLOps and Data Quality: Deploying Reliable ML Models in Production
Python for Machine Learning_ A Comprehensive Overview.pptx
AlphaPy: A Data Science Pipeline in Python
AlphaPy
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Machine Learning and AI at Oracle
What are the Unique Challenges and Opportunities in Systems for ML?
Combining Machine Learning frameworks with Apache Spark
MachineLearning Seminar PPT.pptx
Machine learning
ML Ops.pptx
Machine Learning With ML.NET
Everything you need to know about AutoML
MachineLearningSparkML.pptx
Recommender System Using AZURE ML
Bangla Hand Written Digit Recognition presentation slide .pptx
Combining Machine Learning Frameworks with Apache Spark
WhyR? Analiza sentymentu
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Computer network topology notes for revision
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Analytics and business intelligence.pdf
Computer network topology notes for revision
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Clinical guidelines as a resource for EBP(1).pdf
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
1_Introduction to advance data techniques.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx

Guiding through a typical Machine Learning Pipeline

  • 1. Guiding through a typical Machine Learning Pipeline
  • 2. 2 ML Pipeline The Standard Machine Learning Pipeline is derived from the CRISP-DM Model Datasets Data Retrieval Data Preparation & Feature Engineering Modeling Model Evaluation & Tuning Deployment & Monitoring ML Algorithm Satisfactory Perfor- mance? Data Processing & Wrangling Feature Extraction & Engineering Feature Scaling & Selection No Yes 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 3. 3 ML Pipeline Data Retrieval Raw Data Set Data Retrieval is mainly data collection, extraction and acquisition from various data sources and data stores. Data Sources or Formats, e.g.: • CSV • JSON • XML • SQL • SQLite • Web Scraping (DOM, HTML) Data Descriptions: • Numeric • Text • Categorical (Nominal, Ordinal) More data beats clever algorithms, but better data beats more data. Peter Norvig “ “ 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 4. 4 ML Pipeline Data Preparation & Feature Engineering Data outcome labels Dataset Features Feature set with categorical variables • In this step the data is pre-processed by cleaning, wrangling (munging) and manipulation as needed. • Initial exploratory data analysis is also carried out. • Data Wrangling • Data Understanding • Filtering • Typecasting • Data Transformation • Imputing Missing Values • Handling Duplicates • Handling Categorical Data • Normalizing Values • String Manipulations • Data Summarization • Data Visualization • Feature Engineering, Scaling, Selection • Dimensionality Reduction Data Visualization Purpose Methods 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 5. 5 Modelling Procedure ML Pipeline Modeling In the process of modeling, data features are usually fed to a ML method or algorithm and train the model, typically to optimize a specific cost function in most cases with the objective of reducing errors and generalizing the representations learned from the data. Model Types • Linear models • Logistic Regression • Naïve Bayes • Support Vector Machines • Non parametric models • K-Nearest Neighbors • Tree based models • Decision tree • Ensemble methods • Random forests • Gradient Boosted Machines • Neural Networks • Densely Neural networks (DNN) • Convolutional Neural networks (CNN) • Recurrent Neural networks (RNN) Regression models • Simple linear regression • Multiple linear regression • Non linear regression Clustering models • Partition based clustering • Hierarchical clustering • Density based clustering Classification models • Binary Classification • Multi-Class Classification • Multi Label Classification Activation Function Initializing Parameters Cost function, Metric definition Train with # of epochs Evaluate model with test data 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 6. 6 ML Pipeline Evaluation & Tuning Methods [1] Models have various parameters that are tuned in a process called hyper parameter optimization to gate models with the best and optimal results. 3-fold cross validation ROC curve for binary and multi-class model evaluation Classification models can be evaluated and tested on validation datasets (k-fold cross) and based on metrics like: • Accuracy • Confusion matrix, ROC Regression models can be evaluated by: • Coefficient of Determination, R2 • Mean Squared Error Clustering Models can be validated by: • Homogeneity • Completeness • V-measures (combination) • Silhouette Coefficient • Calinski-Harabaz Index Purpose Methods 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 7. 7 ML Pipeline Evaluation & Tuning Methods [2] Bias Variance Trade-Off • Finding the best balance between Bias and Variance Errors. • Bias Error is the difference between expected and predicted value of the model estimator. It is caused by the underlying data and patterns. • Variance errors arises due to model sensitivity of outliers and random noise. Bias Variance Trade Off Underfitting • Underfitting is seen as a parameter setup resulting in a low variance and high bias. Overfitting • Overfitting is seen as a parameter setup resulting in a high variance and low bias. Grid Search Simplest hyper-parameter optimization method. Tries out a predefined grid of hyper parameter set to find the best. Randomized Search This is a modification of Grid Search and uses a randomized grid of hyper-parameter settings to find the best one. 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 8. 8 ML Pipeline Deployment & Monitoring Selected models are deployed in production and are constantly monitored based on their predictions and results. Deployment Persistence Model Persistence is the simplest was of deploying A model. The final model will persist on permanent media Like hard drive. A new program must route real-life data to the persistent model which creates the predicted output. Custom Development Another option to deploy a model is by developing the implementation of model prediction method separately. The output is just the values of parameters that were learned. Method for the software development domain. In-House Model Deployment Due to data protection reasons a lot of enterprises do not want to expose their data on which models need to be built and deployed. Models can be easily integrated internally with web dev frameworks, APIs or micro-services on top of the prediction models. Model Deployment as a Service Model is open accessible and can be integrated via a cloud based API request. 1 2 3 4 5 Source: Practical Machine Learning with Python
  • 9. 9 Michael Gerke Detecon International GmbH Sternengasse 14-16 50676 Cologne (Germany) Phone: +49 221 91611138 Mobile: +49 160 6907433 Email: Michael.Gerke@detecon.com ML Pipeline Contact Special Thanks to the author team: • Dipanjan Sarkar • Raghav Bali • Tushar Sharma