SlideShare a Scribd company logo
Practical Data Science
Implementation on AWS
Ding Li 2021.8
2
1. Analyze Datasets and Train
ML Models using AutoML
3
Data Science and Cloud
4
Register Data with AWS Glue and Query Data with Athena
5
Data Visualization
6
Statistical Bias and SageMaker Clarify
Covariant Drift: distribution of the independent variables or the features can change.
Prior Probability Drift: data distribution of your labels or the targeted variables might change.
Concept Drift: relationship between the features and the labels can change. Concept drift also
called as concept shift can happen when the definition of the label itself changes based
on
a particular feature like age or geographical location.
Measure
Class Imbalance (CI)
• Measures the imbalance in the number of examples that are provided for different facet values.
• Does a particular product category have disproportionately large number of total reviews than
any other category in the dataset?
Difference in Proportions of Labels (DPL)
• Measures the imbalance of positive outcomes between the different facet values.
• If a particular product category has disproportionately higher ratings than other categories.
Amazon SageMaker Clarify
7
Feature Importance SHAP
Rank the individual features in the order of their importance and
contribution to the final model.
SHAP (SHapley Additive exPlanations) GitHub paper YouTube
A game theoretic approach to explain the output of any machine
learning model. It connects optimal credit allocation with local
explanations using the classic Shapley values from game theory and
their related extensions
New Data Flow
Import Data
Add Data Analysis
Feature Importance
8
• Auto ML allows for experts to focus on those hard problems that can't be solved through Auto ML.
• Auto ML can reduce the repetitive work, experts can apply their domain to analyze the results
9
Automatic data pre-processing and feature engineering
• Automatic data pre-processing and feature engineering automatically fills in the missing data, provides statistical insights about columns in your dataset, and automatically
extracts information from non-numeric columns, such as date and time information from timestamps.
• Automatic ML model selection automatically infers the type of predictions that best suit your data, such as binary classification, multi-class classification, or regression. SageMaker
Autopilot then explores high-performing algorithms such as gradient boosting decision tree, feedforward deep neural networks, and logistic regression, and trains and optimizes hundreds of models based
on these algorithms to find the model that best fits your data.
• Model leaderboard can view the list of models, ranked by metrics such as accuracy, precision, recall, and area under the curve (AUC), review model details such as the impact of features on
predictions, and deploy the model that is best suited to your use case.
10
Amazon SageMaker Built-in Algorithms
11
Explore the Use Case and Analyze the Dataset:
• AWS Data Wrangler
• AWS Glue
• Amazon Athena
• Matplotlib
• Seaborn
• Pandas
• Numpy
Data Bias and Feature Importance:
• Measure Pretraining Bias - Amazon SageMaker
• SHAP
Automated Machine Learning:
• Amazon SageMaker Autopilot
Built-in algorithms:
• Elastic Machine Learning Algorithms in Amazon SageMaker
• Word2Vec algorithm
• GloVe algorithm
• FastText algorithm
• Transformer architecture, "Attention Is All You Need"
• BlazingText algorithm
• ELMo algorithm
• GPT model architecture
• BERT model architecture
• Built-in algorithms
• Amazon SageMaker BlazingText
12
2. Build, Train, and Deploy ML
Pipelines using BERT
13
• Dataset best fits the algorithm
• Improve ML model performance
Feature Engineering Steps
Feature Engineering Pipeline
Split Dataset
Feature Engineering
14
BERT Embedding
SageMaker Processing with scikit-learn
Parameters: code, processingInput, processingOutput
15
Feature Store – Reuse the feature engineering results
Centralized Reusable Discoverable
16
17
18
19
20
21
22
Artifact
• the output of a step or task can be consumed the next
step in a pipeline or deployed directly for consumption
SageMaker Pipelines
23
24
Feature Engineering and Feature Store:
• RoBERTa: A Robustly Optimized BERT Pretraining Approach
• Fundamental Techniques of Feature Engineering for Machine Learning
Train, Debug, and Profile a Machine Learning Model:
• PyTorch Hub
• TensorFlow Hub
• Hugging Face open-source NLP transformers library
• RoBERTa model
• Amazon SageMaker Model Training (Developer Guide)
• Amazon SageMaker Debugger: A system for real-time insights into machine learning model training
• The science behind SageMaker’s cost-saving Debugger
• Amazon SageMaker Debugger (Developer Guide)
• Amazon SageMaker Debugger (GitHub)
Deploy End-To-End Machine Learning Pipelines:
• A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
25
3. Optimize ML Models and Deploy
Human-in-the-Loop Pipelines
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Advanced model training, tuning, and evaluation:
• Hyperband
• Bayesian Optimization
• Amazon SageMaker Automatic Model Tuning
Advanced model deployment, and monitoring:
• A/B Testing
• Autoscaling
• Multi-armed bandit
• Batch Transform
• Inference Pipeline
• Model Monitor
Data labeling and human-in-the-loop pipelines:
• Towards Automated Data Quality Management for Machine Learning
• Amazon SageMaker Ground Truth Developer Guide
• Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs
• Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide

More Related Content

PPTX
Web mining
PDF
Using Azure Cognitive Search to Dive into the CIA Archives
PPTX
Data reduction
PPT
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
PDF
Learn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce Data
PDF
Data Visualization Techniques
PPTX
Anomaly detection with machine learning at scale
PPTX
Automated Machine Learning (Auto ML)
Web mining
Using Azure Cognitive Search to Dive into the CIA Archives
Data reduction
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Learn How to Use Microsoft Power BI for Office 365 to Analyze Salesforce Data
Data Visualization Techniques
Anomaly detection with machine learning at scale
Automated Machine Learning (Auto ML)

What's hot (20)

PPTX
Introduction to Looker Studio.pptx
PPTX
Systematic Migration of Monolith to Microservices
PPTX
Tableau Desktop Material
PPTX
Kdd process
PDF
data mining
PPTX
Data analytics with python introductory
PDF
design principles for visualization
PDF
Feature Engineering in Machine Learning
PPTX
Information retrieval s
PDF
Anomaly detection Workshop slides
PDF
Data Science, Machine Learning and Neural Networks
PPTX
Hands-On With Reactive Web Design
PPT
Machine learning
PPTX
Data mining
PDF
Introduction to machine learning
PPTX
Introduction to DAX
PPTX
Types of Machine Learning
PPTX
Data integration
PDF
Neo4j: What's Under the Hood & How Knowing This Can Help You
PPTX
Data discretization
Introduction to Looker Studio.pptx
Systematic Migration of Monolith to Microservices
Tableau Desktop Material
Kdd process
data mining
Data analytics with python introductory
design principles for visualization
Feature Engineering in Machine Learning
Information retrieval s
Anomaly detection Workshop slides
Data Science, Machine Learning and Neural Networks
Hands-On With Reactive Web Design
Machine learning
Data mining
Introduction to machine learning
Introduction to DAX
Types of Machine Learning
Data integration
Neo4j: What's Under the Hood & How Knowing This Can Help You
Data discretization
Ad

Similar to Practical data science (20)

PDF
Machine Learning and AI at Oracle
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
PDF
The Power of Auto ML and How Does it Work
PPTX
Machine learning
PPTX
Python for Machine Learning_ A Comprehensive Overview.pptx
PDF
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
PDF
Guiding through a typical Machine Learning Pipeline
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
PPTX
MLIntro_ADA.pptx
PDF
AI/ML Infra Meetup | ML explainability in Michelangelo
PPTX
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
PDF
The Data Science Process - Do we need it and how to apply?
PDF
MLOPS By Amazon offered and free download
PPTX
Building machine learning inference pipelines at scale (March 2019)
PDF
AlphaPy: A Data Science Pipeline in Python
PDF
AlphaPy
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
PDF
.Net development with Azure Machine Learning (AzureML) Nov 2014
PPTX
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Machine Learning and AI at Oracle
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
The Power of Auto ML and How Does it Work
Machine learning
Python for Machine Learning_ A Comprehensive Overview.pptx
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
Guiding through a typical Machine Learning Pipeline
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
MLIntro_ADA.pptx
AI/ML Infra Meetup | ML explainability in Michelangelo
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
The Data Science Process - Do we need it and how to apply?
MLOPS By Amazon offered and free download
Building machine learning inference pipelines at scale (March 2019)
AlphaPy: A Data Science Pipeline in Python
AlphaPy
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
.Net development with Azure Machine Learning (AzureML) Nov 2014
Unlocking DataDriven Talent Intelligence Transforming TALENTX with Industry P...
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Ad

More from Ding Li (13)

PPTX
Software architecture for data applications
PPTX
Seismic data analysis with u net
PPTX
Titanic survivor prediction by machine learning
PPTX
Find nuclei in images with U-net
PPTX
Digit recognizer by convolutional neural network
PPTX
Reinforcement learning
PPTX
Recommendation system
PPTX
Generative adversarial networks
PPTX
AI to advance science research
PPTX
Machine learning with graph
PPTX
Natural language processing and transformer models
PPTX
Great neck school budget 2016-2017 analysis
PPTX
Business Intelligence and Big Data in Cloud
Software architecture for data applications
Seismic data analysis with u net
Titanic survivor prediction by machine learning
Find nuclei in images with U-net
Digit recognizer by convolutional neural network
Reinforcement learning
Recommendation system
Generative adversarial networks
AI to advance science research
Machine learning with graph
Natural language processing and transformer models
Great neck school budget 2016-2017 analysis
Business Intelligence and Big Data in Cloud

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ISS -ESG Data flows What is ESG and HowHow
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
.pdf is not working space design for the following data for the following dat...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Reliability_Chapter_ presentation 1221.5784
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Practical data science

  • 2. 2 1. Analyze Datasets and Train ML Models using AutoML
  • 4. 4 Register Data with AWS Glue and Query Data with Athena
  • 6. 6 Statistical Bias and SageMaker Clarify Covariant Drift: distribution of the independent variables or the features can change. Prior Probability Drift: data distribution of your labels or the targeted variables might change. Concept Drift: relationship between the features and the labels can change. Concept drift also called as concept shift can happen when the definition of the label itself changes based on a particular feature like age or geographical location. Measure Class Imbalance (CI) • Measures the imbalance in the number of examples that are provided for different facet values. • Does a particular product category have disproportionately large number of total reviews than any other category in the dataset? Difference in Proportions of Labels (DPL) • Measures the imbalance of positive outcomes between the different facet values. • If a particular product category has disproportionately higher ratings than other categories. Amazon SageMaker Clarify
  • 7. 7 Feature Importance SHAP Rank the individual features in the order of their importance and contribution to the final model. SHAP (SHapley Additive exPlanations) GitHub paper YouTube A game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions New Data Flow Import Data Add Data Analysis Feature Importance
  • 8. 8 • Auto ML allows for experts to focus on those hard problems that can't be solved through Auto ML. • Auto ML can reduce the repetitive work, experts can apply their domain to analyze the results
  • 9. 9 Automatic data pre-processing and feature engineering • Automatic data pre-processing and feature engineering automatically fills in the missing data, provides statistical insights about columns in your dataset, and automatically extracts information from non-numeric columns, such as date and time information from timestamps. • Automatic ML model selection automatically infers the type of predictions that best suit your data, such as binary classification, multi-class classification, or regression. SageMaker Autopilot then explores high-performing algorithms such as gradient boosting decision tree, feedforward deep neural networks, and logistic regression, and trains and optimizes hundreds of models based on these algorithms to find the model that best fits your data. • Model leaderboard can view the list of models, ranked by metrics such as accuracy, precision, recall, and area under the curve (AUC), review model details such as the impact of features on predictions, and deploy the model that is best suited to your use case.
  • 11. 11 Explore the Use Case and Analyze the Dataset: • AWS Data Wrangler • AWS Glue • Amazon Athena • Matplotlib • Seaborn • Pandas • Numpy Data Bias and Feature Importance: • Measure Pretraining Bias - Amazon SageMaker • SHAP Automated Machine Learning: • Amazon SageMaker Autopilot Built-in algorithms: • Elastic Machine Learning Algorithms in Amazon SageMaker • Word2Vec algorithm • GloVe algorithm • FastText algorithm • Transformer architecture, "Attention Is All You Need" • BlazingText algorithm • ELMo algorithm • GPT model architecture • BERT model architecture • Built-in algorithms • Amazon SageMaker BlazingText
  • 12. 12 2. Build, Train, and Deploy ML Pipelines using BERT
  • 13. 13 • Dataset best fits the algorithm • Improve ML model performance Feature Engineering Steps Feature Engineering Pipeline Split Dataset Feature Engineering
  • 14. 14 BERT Embedding SageMaker Processing with scikit-learn Parameters: code, processingInput, processingOutput
  • 15. 15 Feature Store – Reuse the feature engineering results Centralized Reusable Discoverable
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22 Artifact • the output of a step or task can be consumed the next step in a pipeline or deployed directly for consumption SageMaker Pipelines
  • 23. 23
  • 24. 24 Feature Engineering and Feature Store: • RoBERTa: A Robustly Optimized BERT Pretraining Approach • Fundamental Techniques of Feature Engineering for Machine Learning Train, Debug, and Profile a Machine Learning Model: • PyTorch Hub • TensorFlow Hub • Hugging Face open-source NLP transformers library • RoBERTa model • Amazon SageMaker Model Training (Developer Guide) • Amazon SageMaker Debugger: A system for real-time insights into machine learning model training • The science behind SageMaker’s cost-saving Debugger • Amazon SageMaker Debugger (Developer Guide) • Amazon SageMaker Debugger (GitHub) Deploy End-To-End Machine Learning Pipelines: • A Chat with Andrew on MLOps: From Model-centric to Data-centric AI
  • 25. 25 3. Optimize ML Models and Deploy Human-in-the-Loop Pipelines
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41 Advanced model training, tuning, and evaluation: • Hyperband • Bayesian Optimization • Amazon SageMaker Automatic Model Tuning Advanced model deployment, and monitoring: • A/B Testing • Autoscaling • Multi-armed bandit • Batch Transform • Inference Pipeline • Model Monitor Data labeling and human-in-the-loop pipelines: • Towards Automated Data Quality Management for Machine Learning • Amazon SageMaker Ground Truth Developer Guide • Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs • Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide