Practical data science

Practical Data Science
Implementation on AWS
Ding Li 2021.8

2
1. Analyze Datasets and Train
ML Models using AutoML

4
Register Data with AWS Glue and Query Data with Athena

6
Statistical Bias and SageMaker Clarify
Covariant Drift: distribution of the independent variables or the features can change.
Prior Probability Drift: data distribution of your labels or the targeted variables might change.
Concept Drift: relationship between the features and the labels can change. Concept drift also
called as concept shift can happen when the definition of the label itself changes based
on
a particular feature like age or geographical location.
Measure
Class Imbalance (CI)
• Measures the imbalance in the number of examples that are provided for different facet values.
• Does a particular product category have disproportionately large number of total reviews than
any other category in the dataset?
Difference in Proportions of Labels (DPL)
• Measures the imbalance of positive outcomes between the different facet values.
• If a particular product category has disproportionately higher ratings than other categories.
Amazon SageMaker Clarify

7
Feature Importance SHAP
Rank the individual features in the order of their importance and
contribution to the final model.
SHAP (SHapley Additive exPlanations) GitHub paper YouTube
A game theoretic approach to explain the output of any machine
learning model. It connects optimal credit allocation with local
explanations using the classic Shapley values from game theory and
their related extensions
New Data Flow
Import Data
Add Data Analysis
Feature Importance

8
• Auto ML allows for experts to focus on those hard problems that can't be solved through Auto ML.
• Auto ML can reduce the repetitive work, experts can apply their domain to analyze the results

9
Automatic data pre-processing and feature engineering
• Automatic data pre-processing and feature engineering automatically fills in the missing data, provides statistical insights about columns in your dataset, and automatically
extracts information from non-numeric columns, such as date and time information from timestamps.
• Automatic ML model selection automatically infers the type of predictions that best suit your data, such as binary classification, multi-class classification, or regression. SageMaker
Autopilot then explores high-performing algorithms such as gradient boosting decision tree, feedforward deep neural networks, and logistic regression, and trains and optimizes hundreds of models based
on these algorithms to find the model that best fits your data.
• Model leaderboard can view the list of models, ranked by metrics such as accuracy, precision, recall, and area under the curve (AUC), review model details such as the impact of features on
predictions, and deploy the model that is best suited to your use case.

10
Amazon SageMaker Built-in Algorithms

11
Explore the Use Case and Analyze the Dataset:
• AWS Data Wrangler
• AWS Glue
• Amazon Athena
• Matplotlib
• Seaborn
• Pandas
• Numpy
Data Bias and Feature Importance:
• Measure Pretraining Bias - Amazon SageMaker
• SHAP
Automated Machine Learning:
• Amazon SageMaker Autopilot
Built-in algorithms:
• Elastic Machine Learning Algorithms in Amazon SageMaker
• Word2Vec algorithm
• GloVe algorithm
• FastText algorithm
• Transformer architecture, "Attention Is All You Need"
• BlazingText algorithm
• ELMo algorithm
• GPT model architecture
• BERT model architecture
• Built-in algorithms
• Amazon SageMaker BlazingText

12
2. Build, Train, and Deploy ML
Pipelines using BERT

13
• Dataset best fits the algorithm
• Improve ML model performance
Feature Engineering Steps
Feature Engineering Pipeline
Split Dataset
Feature Engineering

14
BERT Embedding
SageMaker Processing with scikit-learn
Parameters: code, processingInput, processingOutput

15
Feature Store – Reuse the feature engineering results
Centralized Reusable Discoverable

22
Artifact
• the output of a step or task can be consumed the next
step in a pipeline or deployed directly for consumption
SageMaker Pipelines

24
Feature Engineering and Feature Store:
• RoBERTa: A Robustly Optimized BERT Pretraining Approach
• Fundamental Techniques of Feature Engineering for Machine Learning
Train, Debug, and Profile a Machine Learning Model:
• PyTorch Hub
• TensorFlow Hub
• Hugging Face open-source NLP transformers library
• RoBERTa model
• Amazon SageMaker Model Training (Developer Guide)
• Amazon SageMaker Debugger: A system for real-time insights into machine learning model training
• The science behind SageMaker’s cost-saving Debugger
• Amazon SageMaker Debugger (Developer Guide)
• Amazon SageMaker Debugger (GitHub)
Deploy End-To-End Machine Learning Pipelines:
• A Chat with Andrew on MLOps: From Model-centric to Data-centric AI

25
3. Optimize ML Models and Deploy
Human-in-the-Loop Pipelines

41
Advanced model training, tuning, and evaluation:
• Hyperband
• Bayesian Optimization
• Amazon SageMaker Automatic Model Tuning
Advanced model deployment, and monitoring:
• A/B Testing
• Autoscaling
• Multi-armed bandit
• Batch Transform
• Inference Pipeline
• Model Monitor
Data labeling and human-in-the-loop pipelines:
• Towards Automated Data Quality Management for Machine Learning
• Amazon SageMaker Ground Truth Developer Guide
• Create high-quality instructions for Amazon SageMaker Ground Truth labeling jobs
• Amazon SageMaker Augmented AI (Amazon A2I) Developer Guide

Practical data science

More Related Content

What's hot (20)

Similar to Practical data science (20)

More from Ding Li (13)

Recently uploaded (20)

Practical data science