SlideShare a Scribd company logo
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
My Nguyen – Solutions Architect – Amazon Web Services Vietnam
AWS’s philosophy on
designing
MLOps platform
Dec 2020
© 2019, Amazon Web Services, Inc. or its Affiliates.
Agenda
• What is MLOps?
• DevOps vs MLOps
• DevOps practices inheritance
• Machine learning development lifecycle
• Unique driving factors to MLOps
• Personas
• Unique challenges faced by ML workload
• MLOps practices on Amazon SageMaker
• Complete separation of steps (and their environments)
• Versioning & tracking
• Pipeline automation
• Continuous improvement
• Demo
• QnA
2
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
What is MLOps?
Operationalizing machine learning workloads
© 2019, Amazon Web Services, Inc. or its Affiliates.
DevOps vs MLOps 4
© 2019, Amazon Web Services, Inc. or its Affiliates.
Notes: Technology is just a piece of the overall picture 5
© 2019, Amazon Web Services, Inc. or its Affiliates.
DevOps practices inheritance
• Communication & collaboration
• Continuous integration
• Continuous delivery/deployment
• Microservices design
• Infrastructure-as-code & configuration-as-code
• Continuous monitoring & logging
6
© 2019, Amazon Web Services, Inc. or its Affiliates.
Machine learning development lifecycle 7
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
Unique driving factors to MLOps
© 2019, Amazon Web Services, Inc. or its Affiliates.
Personas
• Business stakeholder
• Data scientist
• Domain expert
• Data engineer
• Security engineer
• Machine learning/DevOps engineer
• Software engineer
All with different skillsets & priorities
9
© 2019, Amazon Web Services, Inc. or its Affiliates.
Unique challenges
• Data:
• The need to utilize production data in development activities
• Dependencies on data pipelines
• Longer experiment lifecycles
• Output of model artifacts:
• Independent lifecycles between model and integrated applications/systems
• Monitoring & tracking of experiments and models
• Unique metrics for performance evaluation
10
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
MLOps practices on Amazon SageMaker
© 2019, Amazon Web Services, Inc. or its Affiliates.
Complete separation of steps
101011010
010101010
000011110
Data processing Explore
& Build
Train
&Validate
Deploy Monitor
12
© 2019, Amazon Web Services, Inc. or its Affiliates.
Versioning & tracking of every steps 13
© 2019, Amazon Web Services, Inc. or its Affiliates.
Pipeline automation
Metaflow Apache Airflow AWS Step FunctionsKubeflowFlyte
14
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker workflow
The notebook: An entry-point / studio / IDE
Notebook: Explore and Interact
Data Scientists
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
15
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
SageMaker workflow
Prepare data and script; find or build container image(s)
Notebook: Explore and Interact
Training Data
Custom Code
Training Image
Framework Code
Data Scientists
16
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
SageMaker workflow
Run a training job to create a model artifact
Notebook: Explore and Interact
Training Job
Custom
model.tar.gz
Training Data
Custom Code Training Image
Framework CodeFrameworkData
Data Scientists
17
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
SageMaker workflow
Deploy the model to a real-time inference endpoint
Notebook: Explore and Interact
Inference Endpoint
Custom
Inference Image
model.tar.gz
Training Data
Framework Code
Training Image
Framework Code
FrameworkModel
Data Scientists
Inference Requests
Custom Code
18
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
SageMaker workflow
(…Or run a batch transform job)
Notebook: Explore and Interact
Transform Job
Custom
Inference Image
model.tar.gz Framework Code
Training Image
Framework Code
FrameworkModel
Data Scientists
Input Data
Custom Code
Results
19
© 2019, Amazon Web Services, Inc. or its Affiliates.
SageMaker Container
Runtime
Elastic Container
Registry (ECR)
Simple Storage
Service (S3)
SageMaker workflow
Notebook: Explore and Interact
Training Job
Endpoint /Transformer
Custom
Custom
Inference Image
model.tar.gz
Training Data
Custom Code
Framework Code
Training Image
Framework Code
FrameworkModel
FrameworkData
Data Scientists
Inference Requests
20
© 2019, Amazon Web Services, Inc. or its Affiliates.
Continuous improvement
SageMaker
Hosting
Services
SageMaker
Batch
Transform
SageMaker
Notebooks
SageMaker
Autopilot
SageMaker
Experiments
SageMaker
GroundTruth
SageMaker
Processing
SageMaker
Model
Monitor
Amazon
Augmented
AI
SageMaker
Training
SageMaker
Debugger
SageMaker
Hyperparameter
Tuning
SageMaker Studio, the First Fully Integrated Development
Environment For Machine Learning
21
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
Demo
Transformation from local notebook to SageMaker workflow
© 2019, Amazon Web Services, Inc. or its Affiliates.
The bigger picture 23
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
QnA
References:
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf
https://github.com/aws-samples/aws-stepfunctions-byoc-mlops-using-data-science-sdk
https://github.com/apac-ml-tfc/sagemaker-workshop-101
© 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates.
Thank you!
My Nguyen - https://www.linkedin.com/in/mynguyen6512/

More Related Content

PPTX
MLOps with Azure DevOps
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PDF
Apply MLOps at Scale by H&M
PDF
MLOps Using MLflow
PDF
Ml ops on AWS
PDF
ML-Ops: Philosophy, Best-Practices and Tools
PDF
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
MLOps with Azure DevOps
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Apply MLOps at Scale by H&M
MLOps Using MLflow
Ml ops on AWS
ML-Ops: Philosophy, Best-Practices and Tools
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Using MLOps to Bring ML to Production/The Promise of MLOps

What's hot (20)

PDF
Apply MLOps at Scale
PDF
Large Language Models Bootcamp
PPTX
Using Generative AI
PDF
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
PDF
Leveraging Generative AI & Best practices
PPTX
MLOps in action
PPTX
MLOps - The Assembly Line of ML
PDF
Ml ops intro session
PPTX
Generative AI.pptx
PPTX
introduction Azure OpenAI by Usama wahab khan
PPTX
Google Vertex AI
PDF
Using the power of Generative AI at scale
PDF
Generative AI
PDF
What is MLOps
PDF
MLOps for production-level machine learning
PDF
Automatic machine learning (AutoML) 101
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
Generative-AI-in-enterprise-20230615.pdf
PDF
Api presentation
PDF
The Evolution of AutoML
Apply MLOps at Scale
Large Language Models Bootcamp
Using Generative AI
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
Leveraging Generative AI & Best practices
MLOps in action
MLOps - The Assembly Line of ML
Ml ops intro session
Generative AI.pptx
introduction Azure OpenAI by Usama wahab khan
Google Vertex AI
Using the power of Generative AI at scale
Generative AI
What is MLOps
MLOps for production-level machine learning
Automatic machine learning (AutoML) 101
Vertex AI: Pipelines for your MLOps workflows
Generative-AI-in-enterprise-20230615.pdf
Api presentation
The Evolution of AutoML
Ad

Similar to Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform (20)

PDF
MLops workshop AWS
PPTX
Amazon SageMaker for MLOps Presentation.
PPTX
Where ml ai_heavy
PDF
AI LLM Inference and SageMaker Pipeline in AWS
PDF
MLOPS By Amazon offered and free download
PPTX
WhereML a Serverless ML Powered Location Guessing Twitter Bot
PDF
Mcl345 re invent_sagemaker_dmbanga
PDF
Amazon SageMaker Build, Train and Deploy Your ML Models
PPTX
Automate your Amazon SageMaker Workflows (July 2019)
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PDF
Amazon SageMaker workshop
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
PPTX
Deep Dive Amazon SageMaker
PDF
Amir sadoughi developing large-scale machine learning algorithms on amazon ...
PPTX
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
PDF
AWS reinvent 2019 recap - Riyadh - AI And ML - Ahmed Raafat
PDF
Machine Learning with Amazon SageMaker
PDF
Build Machine Learning Models with Amazon SageMaker (April 2019)
PDF
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
PPTX
MLOps_with_SageMaker_Template_EN idioma inglés
MLops workshop AWS
Amazon SageMaker for MLOps Presentation.
Where ml ai_heavy
AI LLM Inference and SageMaker Pipeline in AWS
MLOPS By Amazon offered and free download
WhereML a Serverless ML Powered Location Guessing Twitter Bot
Mcl345 re invent_sagemaker_dmbanga
Amazon SageMaker Build, Train and Deploy Your ML Models
Automate your Amazon SageMaker Workflows (July 2019)
AI Stack on AWS: Amazon SageMaker and Beyond
Amazon SageMaker workshop
Amazon reInvent 2020 Recap: AI and Machine Learning
Deep Dive Amazon SageMaker
Amir sadoughi developing large-scale machine learning algorithms on amazon ...
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS reinvent 2019 recap - Riyadh - AI And ML - Ahmed Raafat
Machine Learning with Amazon SageMaker
Build Machine Learning Models with Amazon SageMaker (April 2019)
Data Summer Conf 2018, “Build, train, and deploy machine learning models at s...
MLOps_with_SageMaker_Template_EN idioma inglés
Ad

More from Grokking VN (20)

PDF
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
PDF
Grokking Techtalk #45: First Principles Thinking
PDF
Grokking Techtalk #42: Engineering challenges on building data platform for M...
PDF
Grokking Techtalk #43: Payment gateway demystified
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
PDF
Grokking Techtalk #39: Gossip protocol and applications
PDF
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
PDF
Grokking Techtalk #38: Escape Analysis in Go compiler
PPTX
Grokking Techtalk #37: Data intensive problem
PPTX
Grokking Techtalk #37: Software design and refactoring
PDF
Grokking TechTalk #35: Efficient spellchecking
PDF
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
PDF
Grokking TechTalk #33: High Concurrency Architecture at TIKI
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
PDF
SOLID & Design Patterns
PDF
Grokking TechTalk #31: Asynchronous Communications
PDF
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
PDF
Grokking TechTalk #27: Optimal Binary Search Tree
PDF
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Software design and refactoring
Grokking TechTalk #35: Efficient spellchecking
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
SOLID & Design Patterns
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #26: Kotlin, Understand the Magic

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
sap open course for s4hana steps from ECC to s4
Encapsulation_ Review paper, used for researhc scholars
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform

  • 1. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. My Nguyen – Solutions Architect – Amazon Web Services Vietnam AWS’s philosophy on designing MLOps platform Dec 2020
  • 2. © 2019, Amazon Web Services, Inc. or its Affiliates. Agenda • What is MLOps? • DevOps vs MLOps • DevOps practices inheritance • Machine learning development lifecycle • Unique driving factors to MLOps • Personas • Unique challenges faced by ML workload • MLOps practices on Amazon SageMaker • Complete separation of steps (and their environments) • Versioning & tracking • Pipeline automation • Continuous improvement • Demo • QnA 2
  • 3. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. What is MLOps? Operationalizing machine learning workloads
  • 4. © 2019, Amazon Web Services, Inc. or its Affiliates. DevOps vs MLOps 4
  • 5. © 2019, Amazon Web Services, Inc. or its Affiliates. Notes: Technology is just a piece of the overall picture 5
  • 6. © 2019, Amazon Web Services, Inc. or its Affiliates. DevOps practices inheritance • Communication & collaboration • Continuous integration • Continuous delivery/deployment • Microservices design • Infrastructure-as-code & configuration-as-code • Continuous monitoring & logging 6
  • 7. © 2019, Amazon Web Services, Inc. or its Affiliates. Machine learning development lifecycle 7
  • 8. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. Unique driving factors to MLOps
  • 9. © 2019, Amazon Web Services, Inc. or its Affiliates. Personas • Business stakeholder • Data scientist • Domain expert • Data engineer • Security engineer • Machine learning/DevOps engineer • Software engineer All with different skillsets & priorities 9
  • 10. © 2019, Amazon Web Services, Inc. or its Affiliates. Unique challenges • Data: • The need to utilize production data in development activities • Dependencies on data pipelines • Longer experiment lifecycles • Output of model artifacts: • Independent lifecycles between model and integrated applications/systems • Monitoring & tracking of experiments and models • Unique metrics for performance evaluation 10
  • 11. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. MLOps practices on Amazon SageMaker
  • 12. © 2019, Amazon Web Services, Inc. or its Affiliates. Complete separation of steps 101011010 010101010 000011110 Data processing Explore & Build Train &Validate Deploy Monitor 12
  • 13. © 2019, Amazon Web Services, Inc. or its Affiliates. Versioning & tracking of every steps 13
  • 14. © 2019, Amazon Web Services, Inc. or its Affiliates. Pipeline automation Metaflow Apache Airflow AWS Step FunctionsKubeflowFlyte 14
  • 15. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker workflow The notebook: An entry-point / studio / IDE Notebook: Explore and Interact Data Scientists SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) 15
  • 16. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) SageMaker workflow Prepare data and script; find or build container image(s) Notebook: Explore and Interact Training Data Custom Code Training Image Framework Code Data Scientists 16
  • 17. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) SageMaker workflow Run a training job to create a model artifact Notebook: Explore and Interact Training Job Custom model.tar.gz Training Data Custom Code Training Image Framework CodeFrameworkData Data Scientists 17
  • 18. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) SageMaker workflow Deploy the model to a real-time inference endpoint Notebook: Explore and Interact Inference Endpoint Custom Inference Image model.tar.gz Training Data Framework Code Training Image Framework Code FrameworkModel Data Scientists Inference Requests Custom Code 18
  • 19. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) SageMaker workflow (…Or run a batch transform job) Notebook: Explore and Interact Transform Job Custom Inference Image model.tar.gz Framework Code Training Image Framework Code FrameworkModel Data Scientists Input Data Custom Code Results 19
  • 20. © 2019, Amazon Web Services, Inc. or its Affiliates. SageMaker Container Runtime Elastic Container Registry (ECR) Simple Storage Service (S3) SageMaker workflow Notebook: Explore and Interact Training Job Endpoint /Transformer Custom Custom Inference Image model.tar.gz Training Data Custom Code Framework Code Training Image Framework Code FrameworkModel FrameworkData Data Scientists Inference Requests 20
  • 21. © 2019, Amazon Web Services, Inc. or its Affiliates. Continuous improvement SageMaker Hosting Services SageMaker Batch Transform SageMaker Notebooks SageMaker Autopilot SageMaker Experiments SageMaker GroundTruth SageMaker Processing SageMaker Model Monitor Amazon Augmented AI SageMaker Training SageMaker Debugger SageMaker Hyperparameter Tuning SageMaker Studio, the First Fully Integrated Development Environment For Machine Learning 21
  • 22. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. Demo Transformation from local notebook to SageMaker workflow
  • 23. © 2019, Amazon Web Services, Inc. or its Affiliates. The bigger picture 23
  • 24. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. QnA References: https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Machine-Learning-Lens.pdf https://github.com/aws-samples/aws-stepfunctions-byoc-mlops-using-data-science-sdk https://github.com/apac-ml-tfc/sagemaker-workshop-101
  • 25. © 2019, Amazon Web Services, Inc. or its Affiliates.© 2019, Amazon Web Services, Inc. or its Affiliates. Thank you! My Nguyen - https://www.linkedin.com/in/mynguyen6512/

Editor's Notes

  • #5: Build trên nền Non trẻ hơn
  • #7: Also pipeline-as-code & policy-as-code
  • #10: Different skillset & priorities
  • #11: Also pipeline-as-code & policy-as-code
  • #12: Code versioning controls Shared environments, IDE – Jupyter Note/Lab Infrastructure as code Self-service environment SaaS
  • #13: Most importantly: training & processing Separation of source, environments, etc. Security Experiment lifecycles Pricing Efficiency
  • #14: Reproduceability is hard End-to-end tracability Dashboard ->
  • #15: Netflix built metaflow Lyft build Flyte Kubeflow Apache Airflow Important factor: skill set & enforce Metaflow Netflix built metaflow Netflix is a huge customer of AWS In production since 2018 Made open source by Netflix & AWS in 2019 What is it? Basic concepts of metaflow Deploying to AWS is easy Flyte A K8s native distributed workflow orchestrator used at Lyft for: Data science Pricing Fraud detection Locations ETA and more Enables highly concurrent, scalable workflows for ML and data processing Core concepts of Flyte – task, DAG, workflows, control flow specification. Actual task can be in any language – tasks executed as containers. Provisions necessary resources dynamically, executes tasks as docker containers, and de-provisions resources when tasks are complete to control costs. Supports execution across 100s of machines e.g. production model training Kubeflow, Airflow are fairly popular Airflow Amazon SageMaker with Apache Airflow 1.10.1. If you use Airflow, you can use SageMaker Workflow in Apache Airflow More details from https://sagemaker.readthedocs.io/en/stable/using_workflow.html Many customers want to use the fully managed capabilities of Amazon SageMaker for machine learning, but also want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines. SageMaker addresses this requirement by letting Kubernetes users train and deploy models in SageMaker using SageMaker-Kubeflow operations and pipelines. With operators and pipelines, Kubernetes users can access fully managed SageMaker ML tools and engines, natively from Kubeflow. This eliminates the need to manually manage and optimize ML infrastructure in Kubernetes while still preserving control of overall orchestration through Kubernetes. Using SageMaker operators and pipelines for Kubernetes, you can get the benefits of a fully managed service for machine learning in Kubernetes, without migrating workloads. If you use Kubernetes, you can use SageMaker Operators for Kubernetes You can install the Sagemaker Operator for Kubernetes using the provided Helm Chart Once you have this operator installed, K8s users can natively invoke SageMaker features like model training, Hyperparameter Tuning and Batch Transform jobs They can also setup model serving using SageMaker Model Hosting Services https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_operators_for_kubernetes.html#what-is-an-operator https://eksworkshop.com/advanced/420_kubeflow/pipelines/ We see customers build serverless ML workflows using AWS Step Functions Open source - Step Functions Data Science SDK for SageMaker Create workflows to pre-process data, train/deploy models using SageMaker Data pre-processing can be done using AWS Glue SageMaker functionality like model training, HPO and end point creation is accessible Use the SDK to create and visualize the workflows Scale workflows without having to worry about infrastructure https://aws.amazon.com/about-aws/whats-new/2019/11/introducing-aws-step-functions-data-science-sdk-amazon-sagemaker/ Many good tools exist. You can run any of the tools we saw earlier on AWS. Remember - Tools are meant to make your life easier Don’t get fixated on the tools. Work backwards from the problem you are trying to solve. So think about your existing s/w engg workflows and tools Ask yourself, which tools will best augment what you already have Ask yourself, which tools are your people most comfortable with AWS approach is use the tools that work for you
  • #16: Easy to think of SageMaker as Notebook. The key thing to remember is that the notebook UI we see a lot in the demos is just a part of the SageMaker platform – and an optional part at that! The notebook is the front-end environment in which we’ll experiment with our data and code. Keep that instance low-cost resource. Value of separation… When we’re ready to try and train or deploy a model, we’ll be spinning up separate, dedicated infrastructure in the SageMaker container runtime – which means we have lots of flexibility to choose resources cost-effectively and only pay for what we need. All managed The orchestration that SageMaker gives us to make this happen is closely integrated to these other two services: The images defining our containers will need to be stored in Amazon ECR (there’s not currently an integration for external registries like DockerHub – but if you have a particular technology in mind our service team would appreciate the feedback! …And the preferred storage platform for not just our input data but also model artifacts and other stuff generated in the workflow will be Amazon S3. Why? <The generic S3 pitch – it’s got everything you need for a data lake> Most integrated service, arguably most mature, tiers, security models, high durability Recaping: 4 things …So let’s look at how that end-to-end process works.
  • #17: To start with I have: The data that I want to train on (prepared and loaded to S3) – pre-processed already, in Notebook, but also option for other services like Glue or Processing Jobs to … The training script I’d like to run (e.g. defining neural network shape and fitting routine – on the notebook instance where I’m working) minimum code One of the pre-prepared SageMaker framework container images somewhere in Amazon ECR – maybe TensorFlow, PyTorch, or MXNet repeatable, controlled, re-producable
  • #18: So what’s happening when we start a training job by calling “estimator.fit()” in those examples from before? We’re gonna start seeing a lot of arrows here, so the cool thing to remember is that all of the arrows are things *SageMaker is doing for you* - not things you need to do yourself! First, assuming you provide a custom code script (or folder of code), the SageMaker SDK is going to zip that up and upload it to a new location in S3. So you can’t forget to check your working version in to git, and you won’t lose track of that version that worked well in the middle of your experiments: The results are going to be traceable to the code that created them. Next, SageMaker is going to spin up whatever infrastructure you asked for in the fit() request, and pull down the docker image to run on it SageMaker will also start downloading your source data from S3 into the container – no messing about with S3 API calls in your script – your code can read it from folder, just as if you were running locally. Env params… As the container fires up, that framework application does a load of helpful prep but one particularly important thing: It installs any additional inline dependencies specified for your custom code, then starts it up and passes in the parameters of the training job. Your code runs, prints status to the console, and saves the trained model to disk just like you normally would… But SageMaker takes care of zipping and uploading that final model to S3 – and also other output mechanisms like sending the logs to CloudWatch and collecting metrics. Pay only for … So the benefit we’ve gained here is that our custom code can be quite simple: Load a CSV from file, make a random forest, save it to file, etc. We can even add specify additional dependencies via a requirements.txt file… and SageMaker plus the framework container will orchestrate these overhead tasks to give us this nice lineage-traceable workflow with all of the cool features we talked about earlier – with no extra code complexity required on our part.
  • #19: When it’s time to deploy that model to an inference endpoint, we simply reference: Our model artifact tarball from S3 An inference container (which might be the same one as for training, or might be a different image because the dependencies could be differently optimized for run-time) And maybe some custom code again: This time just defining some helper functions that we might want to customize from the built-in inference flow, such as how to de/serialize requests and responses, or how the model file(s) need to be loaded from disk into memory if the process is different from standard. How it’s optimized As in training, SageMaker will handle the creation of infrastructure and loading of these components for us. If we used the ‘estimator’ pattern from the high-level SageMaker SDK, all we need to call is a single estimator.deploy(…) function to make it happen. Again here the intent is that any custom code needed can be small: Just providing a few optional functions for serialization, model loading, etc… Rather than writing and having to maintain a model server, integrations with TorchServe or TensorFlow Serving, etc. Custom input format (JSON)…
  • #20: Not today, but… In SageMaker, batch transform jobs function pretty much identically to real time inference endpoints from a user code point of view: The batch transform engine handles reading your source data from S3, feeding it through your model, storing the results back to S3, and shutting down the resources again as soon as the job is done. Pay only for…
  • #21: Mechanism: how easiest for different personas? Skillset dependency – learning curve …So that’s our overview picture for framework containers: You write pretty minimal code just as you usually would for experimenting in your notebook. But instead of running that code locally, which can make things like infrastructure optimization, experiment tracking, and inference deployment tricky… SageMaker provides some nice streamlined, high-level APIs to trigger containerized training and inference jobs (or deploy endpoints) on separate infrastructure. At the fundamental level, the system is super flexible because you can make fully custom container images and model artifact tarballs… But the framework container images together with the SageMaker SDK library (for your notebook) enable this higher-level, container-plus-custom-code workflow. Same as the morning, just diff drawing Solve problems on experimenting, tracking, etc.
  • #22: Also lession learnt & best practices
  • #24: The Repeatable stage is generally focused on applying automation as the number of machine learning workloads running in production increases. In general, at this stage many of the activities in building, training and deploying machine learning models is automated. The introduction of automation reduces manual hand-offs between teams and reduces the operational overhead of previously manual/ad-hoc tasks. The ability to orchestrate machine learning workflows into automated machine learning also depends on having a data strategy and automated data processing tasks. Queue Management: Ability to manage, schedule, and prioritize tasks Resource Management: Access to horizontally scalable compute that can scale based on workflow task requirements Workflow Operators: Error handling, retry and conditional logic functions Workflow Logs: Centralized logs and configuration parameters for execution and task level logs The Reliable stage builds on the automation from the Repeatable stage but aims to ensure automation is balanced with practices aimed to increase quality, enable end-to-end traceability, increase reliability through automatic rollbacks, increase visibility into development and operational health, and ensure repeatability. In general, at this stage MLOps practices of Infrastructure-as-Code/Configuration-as-Code, Continuous Integration, Continuous Delivery/Deployment, and Continuous Monitoring are introduced.