SlideShare a Scribd company logo
Apply MLOps At Scale
Keven(Qi) Wang
Linkedin: https://www.linkedin.com/in/kevenqiwang/
Medium: https://medium.com/@kevenwang_33862
Lead AI Architect @ H&M
Agenda
AI journey @H&M
Quick facts and use cases
Reference Architecture gen1
ML process and ML training
Reference Architecture gen2
MLOps and Operationalize AI
AI journey @H&M
Quick facts and use cases
General Information
74 markets
5000+ stores
177,000 employees
More than
Over
Sales including VAT SEK 210 billion (2018)
E-commerce in 51 markets
Our Journey
2016
Exploration
Run initial PoCs
Test AA appetite &
applicability
2017
Initiation
Industrialize early use cases
Defining organization and
capability needs
Establishing the IT / data
environment
2018
Establish AA & AI
function
Roll-out & hand over of
successful pilots
Establishing AA-WoW,
team, governance
2019
AA Leader
Increasingly data &
algo-driven retail business
Analytical support
across entire value chain
Strong internal AA teams
Engage in partnership with
strong AI players
2022
AI Leader of the Fashion
Industry
Lead the frontier of AI at scale in
delivering customer value
Global leader in developing
talent pools and supporting
AI hubs and networks
AI-powered tools and capabilities
supporting core processes and business
decisions in all functions
World leading ecosystem of cutting edge
AI partners
Today
Algo library, IT platform, Business Impact
H&M use cases
Analytics and Data Platform
LogisticsProduction Sales MarketingDesign / Buying
Assortment quantification
Fashion Forecast
Allocation Markdown Online
Markdown Store
Personalized Promotions,
Recommendations &
Journeys
Movebox
Knowledge &
Best Practice
AI exploration
and Research
Rapid Dev
enablement
AI platform
AI @ H&M quick facts
100+ co-located
FTEs
Growing # of
colleagues
30+ different
nationalities
Several
nationalities
Combined
teams
Sprints
Standups
Product
mgmt.
Epics
Algo
Cloud
New ways of
working
Consultants
HAAL
Azure Databricks
Reference Architecture gen1
ML process and ML training
Starting point – fragemented architecture
ML Process and Tooling
Model Deployment
Model training
Data
acquisition
Data
preparation
Feature
Engineering
Model training
Model
repository
Unseen data
acquisition
Data
preparation
Transform
data into
feature
Model
prediction Results
Deployment orchestration
Datastorage
Training orchestration
Data Lake Store
Model and data versioning
Automated, e2e feedback loop
e2e monitoring
Interactive model development
Kubernetes
Container
Registry
Triggering
CI Orchestrator
Model
repository
Azure Databricks
1 Code commit
2 code static check,
unit test,
Packaging
3.2 Trigger pipeline
4.3 Commit model
5.1 Fetch model
5.2 Build container image
6 Push image
7 Auto deploy
PyCharm
3.1 Push
to DBFS
4.2 log model info
4.1 job execution
Automated model training pipeline 1
Scenario 1
• Geo location l1
• Product type p1
• Time t1
Scenario 2
• Geo location l2
• Product type p2
• Time t2
Scenario 3
• Geo location l3
• Product type p3
• Time t3
Scenario i
• Geo location li
• Product type pi
• Time ti
Scenario set
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Source
data
Prep
data
Feature
engine…
Train Optimize
Databricks Cluster
Databricks Cluster
Databricks Cluster
VM
VM
Container
Automated model training pipeline 2
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
set
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario
task 1
Source
data
Prep data
Feature
engine…
Train Optimize
DAG
Scenario
set
Scenario 1
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 2 Source
data
Prep data
Feature
engine…
Train Optimize
Scenario 3
Source
data
Prep data
Feature
engine…
Train Optimize
Scenario i
Source
data
Prep data
Feature
engine…
Train Optimize
Databricks
Cluster
Databricks
Cluster
Databricks
Cluster
Azure Kubernetes Service
Container RegistryAirflow
Logs
Airflow
dags
Persistent Volume
Airflow
Webserver
Airflow
Scheduler
Kubernetes Pod
Azure File share
Airflow MetaDB
Trick for Airflow dependency challenge
Actual
python method
Little trick:
python_callable
Call the function
without import the
module
For more detail, check this blog post:
https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d
15
General Information
Evolve to scale and industrialize across H&M
Make AI available
for product teams
across H&M Group
Facilitate scalability and
specialization
Continue to build word-class AI
products, engines and core
components
Proven the value in use
case by use case
Now: to reach next level we
need to industrialize and
scale AI across H&M
Reference Architecture gen2
MLOps and Operationalize AI
General Information
Version compatibility
Reproducibility
Approve process
Model format
Experiment strategy
Feedback loop
Model traceability
Model metadata
Deployment strategy
MLOps
Scalability
MLOps tech stack
Model development - Interactive VS Automated
▪ AI product lifecycle
▪ Notebook and Python modules
▪ Container as first class citizen
▪ Airflow VS Kubeflow
Model serving – deployment strategy
Router Model 1.1
Router
(canary)
Model 1.1
Model 1.2
Router
(shadow)
Model 1.1
Model 1.2
Router
Model A1
Model A2
Model A3
Router
Model A1
Model A2
Model A3
Reward
System
Release Strategies Experiment Strategy
A/B test
Experiment Strategy
Multi-armed Bandit
Model serving – Inference Graph
Router 1
(Multi-armed
Bandit)
Router 2
(A/B test)
Model B1
Model B2
Model A1
Model A2
Model A3
Input
Transformer
Output
Transformer
Model management and lifecycle
Staging ProductionModel AprovalBack TestModel Development
PR
pipeline
Back test
pipeline
Trainning
CI pipeline
CD – Staging
Pipeline
CD – prod
pipeline CI/CD pipeline
develop feature
Pull Req
Infra as code
#dev #stage #prod
Infra as code Infra as code
Take away
▪ Problem, Process and Architecture
▪ Platform approach
▪ Leverage cloud native service
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
MLOps Using MLflow
PDF
“Houston, we have a model...” Introduction to MLOps
PDF
Ml ops past_present_future
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
PDF
MLOps for production-level machine learning
PDF
Databricks Overview for MLOps
PDF
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
PDF
The A-Z of Data: Introduction to MLOps
MLOps Using MLflow
“Houston, we have a model...” Introduction to MLOps
Ml ops past_present_future
Using MLOps to Bring ML to Production/The Promise of MLOps
MLOps for production-level machine learning
Databricks Overview for MLOps
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
The A-Z of Data: Introduction to MLOps

What's hot (20)

PDF
MLflow Model Serving
PDF
Apply MLOps at Scale by H&M
PPTX
From Data Science to MLOps
PDF
What is MLOps
PDF
Drifting Away: Testing ML Models in Production
PDF
Vertex AI: Pipelines for your MLOps workflows
PDF
MLOps by Sasha Rosenbaum
PPTX
MLOps - The Assembly Line of ML
PDF
MLOps Virtual Event: Automating ML at Scale
PPTX
MLOps.pptx
PDF
Seamless MLOps with Seldon and MLflow
PDF
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PDF
Seamless End-to-End Production Machine Learning with Seldon and MLflow
PPTX
Google Vertex AI
PPTX
Azure data platform overview
PPTX
MLOps in action
PPTX
ML-Ops: From Proof-of-Concept to Production Application
PDF
ML-Ops how to bring your data science to production
PDF
Productionalizing Models through CI/CD Design with MLflow
MLflow Model Serving
Apply MLOps at Scale by H&M
From Data Science to MLOps
What is MLOps
Drifting Away: Testing ML Models in Production
Vertex AI: Pipelines for your MLOps workflows
MLOps by Sasha Rosenbaum
MLOps - The Assembly Line of ML
MLOps Virtual Event: Automating ML at Scale
MLOps.pptx
Seamless MLOps with Seldon and MLflow
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
MLOps and Data Quality: Deploying Reliable ML Models in Production
Seamless End-to-End Production Machine Learning with Seldon and MLflow
Google Vertex AI
Azure data platform overview
MLOps in action
ML-Ops: From Proof-of-Concept to Production Application
ML-Ops how to bring your data science to production
Productionalizing Models through CI/CD Design with MLflow
Ad

Similar to Apply MLOps at Scale (20)

PDF
Automated Production Ready ML at Scale
PDF
Denys Kovalenko "Scaling Data Science at Bolt"
PDF
[第43回 Machine Learning 15minutes! × 2] Azure AI Updates
PPTX
MongoDB.local Sydney 2019: Building Intelligent Apps with MongoDB & Google Cloud
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
PDF
Seldon: Deploying Models at Scale
PPTX
Building Intelligent Apps with MongoDB and Google Cloud - Jane Fine
PDF
Accelerate ML Deployment with H2O Driverless AI on AWS
PDF
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
PPTX
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
PPTX
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
PDF
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
PDF
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
PDF
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
PPTX
Building an ML model with zero code
PDF
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
PPTX
Machine learning at scale - Webinar By zekeLabs
PPTX
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
PDF
Reading the IBM AI Strategy for Business
PPTX
Empower customer success at LinkedIn with advanced analytics and great visual...
Automated Production Ready ML at Scale
Denys Kovalenko "Scaling Data Science at Bolt"
[第43回 Machine Learning 15minutes! × 2] Azure AI Updates
MongoDB.local Sydney 2019: Building Intelligent Apps with MongoDB & Google Cloud
Pinterest - Big Data Machine Learning Platform at Pinterest
Seldon: Deploying Models at Scale
Building Intelligent Apps with MongoDB and Google Cloud - Jane Fine
Accelerate ML Deployment with H2O Driverless AI on AWS
MLOps journey at Swisscom: AI Use Cases, Architecture and Future Vision
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
Rahat Yasir: Enterprise Data & AI Strategy & Platform Designing
GDG Cloud Southlake #3 Charles Adetiloye: Enterprise MLOps in Practice
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Building an ML model with zero code
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...
Machine learning at scale - Webinar By zekeLabs
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
Reading the IBM AI Strategy for Business
Empower customer success at LinkedIn with advanced analytics and great visual...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Lecture1 pattern recognition............
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Global journeys: estimating international migration
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Lecture1 pattern recognition............
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Foundation of Data Science unit number two notes
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Global journeys: estimating international migration
Database Infoormation System (DBIS).pptx
Moving the Public Sector (Government) to a Digital Adoption
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx

Apply MLOps at Scale

  • 1. Apply MLOps At Scale Keven(Qi) Wang Linkedin: https://www.linkedin.com/in/kevenqiwang/ Medium: https://medium.com/@kevenwang_33862 Lead AI Architect @ H&M
  • 2. Agenda AI journey @H&M Quick facts and use cases Reference Architecture gen1 ML process and ML training Reference Architecture gen2 MLOps and Operationalize AI
  • 3. AI journey @H&M Quick facts and use cases
  • 4. General Information 74 markets 5000+ stores 177,000 employees More than Over Sales including VAT SEK 210 billion (2018) E-commerce in 51 markets
  • 5. Our Journey 2016 Exploration Run initial PoCs Test AA appetite & applicability 2017 Initiation Industrialize early use cases Defining organization and capability needs Establishing the IT / data environment 2018 Establish AA & AI function Roll-out & hand over of successful pilots Establishing AA-WoW, team, governance 2019 AA Leader Increasingly data & algo-driven retail business Analytical support across entire value chain Strong internal AA teams Engage in partnership with strong AI players 2022 AI Leader of the Fashion Industry Lead the frontier of AI at scale in delivering customer value Global leader in developing talent pools and supporting AI hubs and networks AI-powered tools and capabilities supporting core processes and business decisions in all functions World leading ecosystem of cutting edge AI partners Today Algo library, IT platform, Business Impact
  • 6. H&M use cases Analytics and Data Platform LogisticsProduction Sales MarketingDesign / Buying Assortment quantification Fashion Forecast Allocation Markdown Online Markdown Store Personalized Promotions, Recommendations & Journeys Movebox Knowledge & Best Practice AI exploration and Research Rapid Dev enablement AI platform
  • 7. AI @ H&M quick facts 100+ co-located FTEs Growing # of colleagues 30+ different nationalities Several nationalities Combined teams Sprints Standups Product mgmt. Epics Algo Cloud New ways of working Consultants HAAL Azure Databricks
  • 8. Reference Architecture gen1 ML process and ML training
  • 9. Starting point – fragemented architecture
  • 10. ML Process and Tooling Model Deployment Model training Data acquisition Data preparation Feature Engineering Model training Model repository Unseen data acquisition Data preparation Transform data into feature Model prediction Results Deployment orchestration Datastorage Training orchestration Data Lake Store Model and data versioning Automated, e2e feedback loop e2e monitoring
  • 11. Interactive model development Kubernetes Container Registry Triggering CI Orchestrator Model repository Azure Databricks 1 Code commit 2 code static check, unit test, Packaging 3.2 Trigger pipeline 4.3 Commit model 5.1 Fetch model 5.2 Build container image 6 Push image 7 Auto deploy PyCharm 3.1 Push to DBFS 4.2 log model info 4.1 job execution
  • 12. Automated model training pipeline 1 Scenario 1 • Geo location l1 • Product type p1 • Time t1 Scenario 2 • Geo location l2 • Product type p2 • Time t2 Scenario 3 • Geo location l3 • Product type p3 • Time t3 Scenario i • Geo location li • Product type pi • Time ti Scenario set Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Source data Prep data Feature engine… Train Optimize Databricks Cluster Databricks Cluster Databricks Cluster VM VM Container
  • 13. Automated model training pipeline 2 Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario set Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize Scenario task 1 Source data Prep data Feature engine… Train Optimize DAG Scenario set Scenario 1 Source data Prep data Feature engine… Train Optimize Scenario 2 Source data Prep data Feature engine… Train Optimize Scenario 3 Source data Prep data Feature engine… Train Optimize Scenario i Source data Prep data Feature engine… Train Optimize Databricks Cluster Databricks Cluster Databricks Cluster Azure Kubernetes Service Container RegistryAirflow Logs Airflow dags Persistent Volume Airflow Webserver Airflow Scheduler Kubernetes Pod Azure File share Airflow MetaDB
  • 14. Trick for Airflow dependency challenge Actual python method Little trick: python_callable Call the function without import the module For more detail, check this blog post: https://medium.com/@kevenwang_33862/machine-learning-in-production-2-large-scale-ml-training-889cde94f26d
  • 15. 15 General Information Evolve to scale and industrialize across H&M Make AI available for product teams across H&M Group Facilitate scalability and specialization Continue to build word-class AI products, engines and core components Proven the value in use case by use case Now: to reach next level we need to industrialize and scale AI across H&M
  • 16. Reference Architecture gen2 MLOps and Operationalize AI
  • 17. General Information Version compatibility Reproducibility Approve process Model format Experiment strategy Feedback loop Model traceability Model metadata Deployment strategy MLOps Scalability
  • 19. Model development - Interactive VS Automated ▪ AI product lifecycle ▪ Notebook and Python modules ▪ Container as first class citizen ▪ Airflow VS Kubeflow
  • 20. Model serving – deployment strategy Router Model 1.1 Router (canary) Model 1.1 Model 1.2 Router (shadow) Model 1.1 Model 1.2 Router Model A1 Model A2 Model A3 Router Model A1 Model A2 Model A3 Reward System Release Strategies Experiment Strategy A/B test Experiment Strategy Multi-armed Bandit
  • 21. Model serving – Inference Graph Router 1 (Multi-armed Bandit) Router 2 (A/B test) Model B1 Model B2 Model A1 Model A2 Model A3 Input Transformer Output Transformer
  • 22. Model management and lifecycle Staging ProductionModel AprovalBack TestModel Development PR pipeline Back test pipeline Trainning CI pipeline CD – Staging Pipeline CD – prod pipeline CI/CD pipeline develop feature Pull Req Infra as code #dev #stage #prod Infra as code Infra as code
  • 23. Take away ▪ Problem, Process and Architecture ▪ Platform approach ▪ Leverage cloud native service
  • 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.