SlideShare a Scribd company logo
Kostiantyn Bokhan, PhD
CD4ML based on Azure and Kubeflow
1
Agenda
1. Introduction to CD4ML
2. Kubeflow
3. Use cases of kubeflow
4. Installing Kubeflow on Azure - tips and tricks
2
Introduction to CD4ML
Continuous Delivery for Machine Learning (CD4ML) is a software engineering
approach in which a cross-functional team produces machine learning
applications based on code, data, and models in small and safe increments
that can be reproduced and reliably released at any time, in short adaptation
cycles.
Danilo Sato, Arif Wider, Christoph Windheuser. Continuous Delivery for Machine
Learning: - https://martinfowler.com/articles/cd4ml.html
3
Introduction to CD4ML
MLOps: Continuous delivery and automation pipelines in machine learning
MLOps is an ML engineering culture and practice that aims at unifying ML
system development (Dev) and ML system operation (Ops)
Google: -
https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-
in-machine-learning
4
Introduction to CD4ML
5
Introduction to CD4ML
6
Introduction to CD4ML
MDLC
Model Development
Life Cycle
DLC
Data
Life Cycle
7
Introduction to CD4ML
Based on https://martinfowler.com/articles/cd4ml.html
Data
preparation
Model
Building
Model
Evaluation
Productionize
Model
Testing Deployment
Monitoring
and
Observability
Experimentation
Labeling
code
Training
code
Evaluating
code
Test
code
Application
code
Candidate
models
Chosen
models
Productionized
models
model
training
data
test
data
production
data
validation /
test data
metrics
code and
model in
production
CodeModelData
raw
data
8
Introduction to CD4ML
https://martinfowler.com/articles/cd4ml.html
9
Introduction to CD4ML
Continuous ML
CD4ML
Incremental
(continual) ML
Auto ML
10
Introduction to CD4ML
Pachyderm is an end to end model
versioning framework to help create
reproducible pipeline definitions, with
each processing step packaged in a Docker
container
Pachyderm
Amazon SageMaker is a fully managed
machine learning service. Developers can
quickly and easily build and train machine
learning models, and then directly deploy
them into a production-ready hosted
environment.
Amazon SageMaker
MLFlow is an open source platform to
manage the ML lifecycle, including
experimentation, reproducibility,
deployment, and a central model registry.
MLflow currently offers four components
MLFlow
The Kubeflow project is dedicated to
making deployments of machine learning
(ML) workflows on Kubernetes simple,
portable and scalable.
Kubeflow
AzureML - empower developers with a
wide range of productive experiences for
building, training, and deploying machine
learning models faster. Accelerate time to
market and foster team collaboration with
industry-leading MLOps—DevOps for
machine learning. Innovate on a secure,
trusted platform, designed for responsible
ML
AzureML
Lightbend, Streamlio’s Community
Edition, Polyaxon, MFlow, Daitaku, Domino
Data Science Platform, ParallelM MCenter,
Seldon, MLeap
Other
11
Kubeflow
Kubeflow Pipelines is a platform for
building and deploying portable and
scalable end-to-end ML workflows, based
on containers.
Kubeflow Pipelines
Use Katib for automated tuning of your
machine learning (ML) model’s
hyperparameters and architecture as well
as implementing AutoML at all.
Katib
The Jupyter Notebook is an open-source
web application that allows you to create
and share documents that contain live
code, equations, visualizations and
narrative text. Uses include: data cleaning
and transformation, numerical simulation,
statistical modeling, data visualization,
machine learning, and much more
The Jupyter Notebook
Kubeflow Fairing streamlines the process
of building, training, and deploying
machine learning (ML) training jobs in a
hybrid cloud environment.
Fairing
Kale is a Python package that aims at
automatically deploy a general purpose
Jupyter Notebook as a running Kubeflow
Pipelines instance, without requiring the
use the specific KFP DSL
Kale
The goal of the Metadata project is to
track and manage metadata of machine
learning workflows in Kubeflow.
Metadata
12
Kubeflow
TFJob is a Kubernetes custom resource
that you can use to run TensorFlow
training jobs on Kubernetes including
distributed jobs.
TFJob
Seldon is an open source platform for
deploying machine learning models on a
Kubernetes cluster.
Seldon
You can create and manage PyTorch jobs
like other built-in resources in Kubernetes
PyTorch jobs
The NVIDIA TensorRT Inference Server
provides a cloud inferencing solution
optimized for NVIDIA GPUs. The server
provides an inference service via an HTTP
or GRPC endpoint, allowing remote clients
to request inferencing for any model being
managed by the server.
The NVIDIA TensorRT Inference
Server
TensorFlow Serving is a flexible,
high-performance serving system for
machine learning models, designed for
production environments.
TensorFlow Serving
13
Kubeflow
https://www.kubeflow.org/docs/started/kubeflow-overview/ 14
Uses cases - the project background
AI for a Worldwide Logistic Platform
Uses cases - the project background
Object
detection
OCR
Language
modeling
NLP
Anomaly detection
Document matching
Template matchingPattern
recogntion
Segmentation
Classification
Mobile Apps IOT Apps SaaS
Uses cases - the project background
Goals of implementing CD4ML
17
● Integrated Infrastructure for AI experiments based on Jupyter Notebook service
● Automatization of all stages of deep machine learning development in scale:
○ Preprocessing, Dataset preparation, Augmentation
○ Model training and verification
○ Leveraging Automl:
■ Neural architecture search based on AutoKears
■ Training several models simultaneously
○ Optimization of model hyperparameters that is the most frequent task
● Tracking and analysis the results obtained, model versioning and metadata tracking
● Model as a service and Model continuous delivery
Uses cases
Prod
Staging
Dev
Mobile
Labeling
Experiments with Jupyter Notebook
18
Mb
Uses cases
Model building, training and validation
19
Mb
Uses cases
CI / CD based on the Kubeflow
Dev
Prod
Staging
Mobile
Labeling
20
Mb
Uses cases
CI / CD based on the Kubeflow for embedded (Jetson Nano)
Dev
Prod
Staging
Labeling
21
Mb
Uses cases
CI / CD based on the Kubeflow for mobile (Android & iOs)
Dev
Staging
Labeling
22
Mb
Uses cases
Mobile inference testing
23
Uses cases
24
Uses cases
Debug data analytics
25
Uses cases
Labeling quality and contracts testing
26
Labeling
Deployment Kubeflow on Azure - issues
Istio
KFserving
Knative
Uninstalling of Kubeflow
● Istio is outdated
● istioctl is not supported
● KFserving is outdated
● Tensorflow 2 is not supported
● It is impossible to override version of tensorflow
● Knative is outdated
● Embedded Knative is not support fresh versions of istio
● Istio deployment can be deleted
● Kubeflow can’t be uninstalled properly
27
Installing Kubeflow on Azure - tips and tricks
Deployment stages
28
Creating
AKS
1.16
Creating
& linking
ACR
Installing
Istio
1.5
Deploying
KNative
0.18
Installing
Kubeflow
1.1.0
Deploying
kfserving
0.4.0
Deploying
other
components
Installing Kubeflow on Azure - tips and tricks
Creating AKS
• Kubeflow is not fully tested with
kubernetes versions > 1.16
• nodepool-name examples:
npdevcpu - only for CPU tasks:
nodeSelector."agentpool"=npdevcpu
npdevstorage: only for storage services, e.g.
Rook, Minio etc
nodeSelector."agentpool"=npdevstorage
29
az aks create --resource-group aigroup
--name aicluster
--node-count 3
--vm-set-type VirtualMachineScaleSets
--nodepool-name npdevcpu
--load-balancer-sku standard
--kubernetes-version 1.16.15
--node-vm-size Standard_DS3_v2
--generate-ssh-keys
--service-principal "XXXXX"
--client-secret "XXXXX"
Installing Kubeflow on Azure - tips and tricks
Adding GPU node pool and install Nvidia drivers
npdevgpu - only for GPU tasks:
nodeSelector."agentpool"=npdevgpu
nvidia-device-plugin-ds.yaml is can be found in
the Azure AKS dcumentation
30
> az aks nodepool add
--cluster-name aicluster
--name npdevgpu
--resource-group aigroup
--node-count 3
--node-vm-size Standard_NC6
> kubectl create namespace gpu-resources
> kubectl apply -f nvidia-device-plugin-ds.yaml
Installing Kubeflow on Azure - tips and tricks
Creating an ACR and linking with the AKS
Note: if you are not a subscription owner you
can’t link the ACR with your AKS
31
# assumes ACR Admin Account is enabled
ACR_NAME=aiclusterRegistry.azurecr.io
ACR_UNAME=tokenname
ACR_PASSWD=tokenpassword
# Creating the secret
kubectl -n yournamespace create secret
docker-registry acr-secret 
--docker-server=$ACR_NAME 
--docker-username=$ACR_UNAME 
--docker-password=$ACR_PASSWD 
--docker-email=ignorethis@email.com
# Patching default serviceaccount
kubectl -n yournamespace patch serviceaccount default
-p '{"imagePullSecrets": [{"name": "acr-secret"}]}'
# Creating an ACR
az acr create --resource-group aigroup
--name aiclusterRegistry
--sku Premium
# Creating token
az acr token create -n MyToken -r aiclusterRegistry
--scope-map _repositories_admin
Installing Kubeflow on Azure - tips and tricks
Installing Istio
Note: Kubeflow is not support version of Istio >
1.5. Istio config should consider knative
requirements for istio
Istio can be installed with:
• istioctl tool
• helm
• Istio operator
32
# creating a namespace
kubectl create namespace istio-system --save-config
# installing istio
istioctl manifest apply --set profile=default
--set components.policy.enabled=true
--set addonComponents.kiali.enabled=true
--set addonComponents.grafana.enabled=true
--set addonComponents.tracing.enabled=true
--set values.global.defaultNodeSelector.
"agentpool"=npdevcpu
--set values.global.useMCP=false
--set values.global.proxy.autoInject=disabled
Installing Kubeflow on Azure - tips and tricks
Installing KNative
Note: KNative requirements for Istio are
outdated due to changes of config parameters of
Istio
33
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-crds.yaml
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-core.yaml
kubectl apply
--filename https://github.com/knative/net-istio/
releases/download/v0.18.0/release.yaml
# Optional, please refer the installation guide
kubectl apply
--filename https://github.com/knative/serving/
releases/download/v0.18.0/serving-default-domain.yaml
Installing Kubeflow on Azure - tips and tricks
Installing Kubeflow
Note: Due to some embedded components are
installed separately they should be removed from
the Kubeflow manifest -
kfctl_k8s_istio.v1.1.0.yaml:
• istio-stack
• knative
• kfserving
Important! A folder {clastername} created by kfctl
should be kept for uninstalling and reconfiguration
reasons
34
...
applications:
...
- kustomizeConfig:
repoRef:
name: manifests
path: application/v3
name: application
- kustomizeConfig:
repoRef:
name: manifests
path: stacks/kubernetes/application/istio-1-3-1-stack
name: istio-stack
- kustomizeConfig:
repoRef:
name: manifests
path:
stacks/kubernetes/application/cluster-local-gateway-1-3-1
name: cluster-local-gateway
...
# Installing kubeflow
kfctl apply -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
# Deleting kubeflow
kfctl delete -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
Installing Kubeflow on Azure - tips and tricks
35
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Go to http://localhost:8080
Installing Kubeflow on Azure - tips and tricks
Kfserving example - kfserving-tenzorflow-2.yaml
36
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "mnist"
spec:
default:
predictor:
tensorflow:
runtimeVersion: 2.3.0
storageUri: "https://modelstorage.blob.core.windows.net/mnist/"
> kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml
Installing Kubeflow on Azure - tips and tricks
Kfserving example - kfserving-tenzorflow-2.yaml
37
> kubectl label namespace mnist knative-eventing-injection=enabled
> kubectl label namespace mnist istio-injection=enabled
> kubectl label namespace mnist serving.kubeflow.org/inferenceservice=enabled
> kubectl label namespace mnist katib-metricscollector-injection=enabled
> kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml
Installing Kubeflow on Azure - tips and tricks
38
> curl -v http://mnist-predictor-default.mnist.1.1.1.1.xip.io/v1/models/mnist/metadata
{
"model_spec":{
"name": "mnist",
"signature_name": "",
"version": "1"
}
,
"metadata": {"signature_def": {
"signature_def": {
"serving_default": {
"inputs": {
"inputs": {
"dtype": "DT_STRING",
"tensor_shape": {
"dim": [],
"unknown_rank": true
},
"name": "tf_example:0"
}
},
...
ANY QUESTIONS?

More Related Content

PDF
Kubeflow Distributed Training and HPO
PDF
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
PDF
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
PDF
Using Deep Learning Toolkits with Kubernetes clusters
PDF
AI & Machine Learning Pipelines with Knative
PDF
Deploying deep learning models with Docker and Kubernetes
PDF
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
PDF
Scaling MLOps on NVIDIA DGX Systems
Kubeflow Distributed Training and HPO
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Using Deep Learning Toolkits with Kubernetes clusters
AI & Machine Learning Pipelines with Knative
Deploying deep learning models with Docker and Kubernetes
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Scaling MLOps on NVIDIA DGX Systems

What's hot (20)

PPTX
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
PDF
TFX: A tensor flow-based production-scale machine learning platform
PPTX
AI Pipeline Optimization using Kubeflow
PDF
Kubeflow
PDF
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
PDF
Kubeflow at Spotify (For the Kubeflow Summit)
PDF
running Tensorflow in Production
PDF
Serving models using KFServing
PDF
Yannis Zarkadas. Enterprise data science workflows on kubeflow
PDF
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
PDF
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
PDF
TinyML as-a-Service
PDF
Виктор Ерухимов Open VX mixar moscow sept'15
PDF
Power your move to the cloud 20180611
PDF
Netflix machine learning
PDF
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
PDF
How to develop your first cloud-native Applications with Java
PDF
Containerized architectures for deep learning
PDF
Curated "Cloud Design Patterns" for Call Center Platforms
When HPC meet ML/DL: Manage HPC Data Center with Kubernetes
TFX: A tensor flow-based production-scale machine learning platform
AI Pipeline Optimization using Kubeflow
Kubeflow
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
Kubeflow at Spotify (For the Kubeflow Summit)
running Tensorflow in Production
Serving models using KFServing
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
TinyML as-a-Service
Виктор Ерухимов Open VX mixar moscow sept'15
Power your move to the cloud 20180611
Netflix machine learning
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
How to develop your first cloud-native Applications with Java
Containerized architectures for deep learning
Curated "Cloud Design Patterns" for Call Center Platforms
Ad

Similar to Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow (20)

PPTX
MLOps in action
PDF
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
PDF
Scaling AI/ML with Containers and Kubernetes
PDF
MLOps with Kubernetes - Thiago Ramos.pdf
PPTX
ML_OPS unit 6 all information give with pythonn
PDF
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
PPTX
Cloud computing: highlights
PDF
Pitfalls of machine learning in production
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Machine Learning para devs com ML.NET
PPTX
Magdalena Stenius: MLOPS Will Change Machine Learning
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
PPTX
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
PDF
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
PDF
KubeCon & CloudNative Con 2024 Artificial Intelligent
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Serverless machine learning architectures at Helixa
PPTX
2018 11 14 Artificial Intelligence and Machine Learning in Azure
MLOps in action
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and E...
Scaling AI/ML with Containers and Kubernetes
MLOps with Kubernetes - Thiago Ramos.pdf
ML_OPS unit 6 all information give with pythonn
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Cloud computing: highlights
Pitfalls of machine learning in production
Infrastructure Agnostic Machine Learning Workload Deployment
Machine Learning para devs com ML.NET
Magdalena Stenius: MLOPS Will Change Machine Learning
OS for AI: Elastic Microservices & the Next Gen of ML
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
KubeCon & CloudNative Con 2024 Artificial Intelligent
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Serverless machine learning architectures at Helixa
2018 11 14 Artificial Intelligence and Machine Learning in Azure
Ad

More from IT Arena (20)

PDF
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
PDF
Dave Karow, Split. Powering Progressive Delivery With Data
PDF
Ihar Mahaniok, Angel Investor. Hunting unicorns for early stage investments
PDF
Yuriy Zaremba, AXDRAFT. How to sell your startup
PDF
John Griffin, Ford Credit Europe. Normalising failure and making way for succ...
PDF
Vitaliy Diatlenko, Uklon. Transforming your business with machine learning. T...
PDF
Chris Cassarino, SoftServe. Stop Fixating on Fixing – Solving the global enga...
PDF
Michael Labate, Intellias. EDI in the DNA: Why Equity, Diversity and Inclusio...
PDF
Beth Anne Katz, Microsoft. How to Product Manage Your Mental Health
PDF
Sally Foote, GoCompare & Look After My Bills. Magic Goggles: the tools you ne...
PDF
Colleen Graneto, Airbnb. 3 steps to better product decision making
PDF
Vasyl Zadvornyy, Prozorro. The Future of Governance: Can a Script Replace the...
PDF
Godard Abel, G2. The SaaS Trust Crisis
PDF
Zeb Evans, ClickUp. From $0 to $20M ARR in 2 Years: Bootstrapping to Natural ...
PPTX
Namir Anani, ICTC. Economic Resiliency in The Face of Adversity
PDF
Mada Seghete, Branch. Mobile Growth Trends
PDF
Julia Petryk, MacPaw. Product PR: a how-to guide
PDF
Yaroslav Ravlinko, Intellias. You don’t need Kubernetes. You need to understa...
PDF
Yaroslav Novytskyy, Anton Vasylenko, N-iX. Migrating to the cloud: options an...
PDF
Alexandra Motulskaya, Exadel. ML1: Creating a machine learning powered plugin...
Shalini Agarwal, LinkedIn. Engineering excellence: marathon, not a sprint
Dave Karow, Split. Powering Progressive Delivery With Data
Ihar Mahaniok, Angel Investor. Hunting unicorns for early stage investments
Yuriy Zaremba, AXDRAFT. How to sell your startup
John Griffin, Ford Credit Europe. Normalising failure and making way for succ...
Vitaliy Diatlenko, Uklon. Transforming your business with machine learning. T...
Chris Cassarino, SoftServe. Stop Fixating on Fixing – Solving the global enga...
Michael Labate, Intellias. EDI in the DNA: Why Equity, Diversity and Inclusio...
Beth Anne Katz, Microsoft. How to Product Manage Your Mental Health
Sally Foote, GoCompare & Look After My Bills. Magic Goggles: the tools you ne...
Colleen Graneto, Airbnb. 3 steps to better product decision making
Vasyl Zadvornyy, Prozorro. The Future of Governance: Can a Script Replace the...
Godard Abel, G2. The SaaS Trust Crisis
Zeb Evans, ClickUp. From $0 to $20M ARR in 2 Years: Bootstrapping to Natural ...
Namir Anani, ICTC. Economic Resiliency in The Face of Adversity
Mada Seghete, Branch. Mobile Growth Trends
Julia Petryk, MacPaw. Product PR: a how-to guide
Yaroslav Ravlinko, Intellias. You don’t need Kubernetes. You need to understa...
Yaroslav Novytskyy, Anton Vasylenko, N-iX. Migrating to the cloud: options an...
Alexandra Motulskaya, Exadel. ML1: Creating a machine learning powered plugin...

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?

Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow

  • 1. Kostiantyn Bokhan, PhD CD4ML based on Azure and Kubeflow 1
  • 2. Agenda 1. Introduction to CD4ML 2. Kubeflow 3. Use cases of kubeflow 4. Installing Kubeflow on Azure - tips and tricks 2
  • 3. Introduction to CD4ML Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles. Danilo Sato, Arif Wider, Christoph Windheuser. Continuous Delivery for Machine Learning: - https://martinfowler.com/articles/cd4ml.html 3
  • 4. Introduction to CD4ML MLOps: Continuous delivery and automation pipelines in machine learning MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops) Google: - https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines- in-machine-learning 4
  • 7. Introduction to CD4ML MDLC Model Development Life Cycle DLC Data Life Cycle 7
  • 8. Introduction to CD4ML Based on https://martinfowler.com/articles/cd4ml.html Data preparation Model Building Model Evaluation Productionize Model Testing Deployment Monitoring and Observability Experimentation Labeling code Training code Evaluating code Test code Application code Candidate models Chosen models Productionized models model training data test data production data validation / test data metrics code and model in production CodeModelData raw data 8
  • 10. Introduction to CD4ML Continuous ML CD4ML Incremental (continual) ML Auto ML 10
  • 11. Introduction to CD4ML Pachyderm is an end to end model versioning framework to help create reproducible pipeline definitions, with each processing step packaged in a Docker container Pachyderm Amazon SageMaker is a fully managed machine learning service. Developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. Amazon SageMaker MLFlow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components MLFlow The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Kubeflow AzureML - empower developers with a wide range of productive experiences for building, training, and deploying machine learning models faster. Accelerate time to market and foster team collaboration with industry-leading MLOps—DevOps for machine learning. Innovate on a secure, trusted platform, designed for responsible ML AzureML Lightbend, Streamlio’s Community Edition, Polyaxon, MFlow, Daitaku, Domino Data Science Platform, ParallelM MCenter, Seldon, MLeap Other 11
  • 12. Kubeflow Kubeflow Pipelines is a platform for building and deploying portable and scalable end-to-end ML workflows, based on containers. Kubeflow Pipelines Use Katib for automated tuning of your machine learning (ML) model’s hyperparameters and architecture as well as implementing AutoML at all. Katib The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more The Jupyter Notebook Kubeflow Fairing streamlines the process of building, training, and deploying machine learning (ML) training jobs in a hybrid cloud environment. Fairing Kale is a Python package that aims at automatically deploy a general purpose Jupyter Notebook as a running Kubeflow Pipelines instance, without requiring the use the specific KFP DSL Kale The goal of the Metadata project is to track and manage metadata of machine learning workflows in Kubeflow. Metadata 12
  • 13. Kubeflow TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes including distributed jobs. TFJob Seldon is an open source platform for deploying machine learning models on a Kubernetes cluster. Seldon You can create and manage PyTorch jobs like other built-in resources in Kubernetes PyTorch jobs The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. The NVIDIA TensorRT Inference Server TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving 13
  • 15. Uses cases - the project background AI for a Worldwide Logistic Platform
  • 16. Uses cases - the project background Object detection OCR Language modeling NLP Anomaly detection Document matching Template matchingPattern recogntion Segmentation Classification Mobile Apps IOT Apps SaaS
  • 17. Uses cases - the project background Goals of implementing CD4ML 17 ● Integrated Infrastructure for AI experiments based on Jupyter Notebook service ● Automatization of all stages of deep machine learning development in scale: ○ Preprocessing, Dataset preparation, Augmentation ○ Model training and verification ○ Leveraging Automl: ■ Neural architecture search based on AutoKears ■ Training several models simultaneously ○ Optimization of model hyperparameters that is the most frequent task ● Tracking and analysis the results obtained, model versioning and metadata tracking ● Model as a service and Model continuous delivery
  • 19. Uses cases Model building, training and validation 19 Mb
  • 20. Uses cases CI / CD based on the Kubeflow Dev Prod Staging Mobile Labeling 20 Mb
  • 21. Uses cases CI / CD based on the Kubeflow for embedded (Jetson Nano) Dev Prod Staging Labeling 21 Mb
  • 22. Uses cases CI / CD based on the Kubeflow for mobile (Android & iOs) Dev Staging Labeling 22 Mb
  • 25. Uses cases Debug data analytics 25
  • 26. Uses cases Labeling quality and contracts testing 26 Labeling
  • 27. Deployment Kubeflow on Azure - issues Istio KFserving Knative Uninstalling of Kubeflow ● Istio is outdated ● istioctl is not supported ● KFserving is outdated ● Tensorflow 2 is not supported ● It is impossible to override version of tensorflow ● Knative is outdated ● Embedded Knative is not support fresh versions of istio ● Istio deployment can be deleted ● Kubeflow can’t be uninstalled properly 27
  • 28. Installing Kubeflow on Azure - tips and tricks Deployment stages 28 Creating AKS 1.16 Creating & linking ACR Installing Istio 1.5 Deploying KNative 0.18 Installing Kubeflow 1.1.0 Deploying kfserving 0.4.0 Deploying other components
  • 29. Installing Kubeflow on Azure - tips and tricks Creating AKS • Kubeflow is not fully tested with kubernetes versions > 1.16 • nodepool-name examples: npdevcpu - only for CPU tasks: nodeSelector."agentpool"=npdevcpu npdevstorage: only for storage services, e.g. Rook, Minio etc nodeSelector."agentpool"=npdevstorage 29 az aks create --resource-group aigroup --name aicluster --node-count 3 --vm-set-type VirtualMachineScaleSets --nodepool-name npdevcpu --load-balancer-sku standard --kubernetes-version 1.16.15 --node-vm-size Standard_DS3_v2 --generate-ssh-keys --service-principal "XXXXX" --client-secret "XXXXX"
  • 30. Installing Kubeflow on Azure - tips and tricks Adding GPU node pool and install Nvidia drivers npdevgpu - only for GPU tasks: nodeSelector."agentpool"=npdevgpu nvidia-device-plugin-ds.yaml is can be found in the Azure AKS dcumentation 30 > az aks nodepool add --cluster-name aicluster --name npdevgpu --resource-group aigroup --node-count 3 --node-vm-size Standard_NC6 > kubectl create namespace gpu-resources > kubectl apply -f nvidia-device-plugin-ds.yaml
  • 31. Installing Kubeflow on Azure - tips and tricks Creating an ACR and linking with the AKS Note: if you are not a subscription owner you can’t link the ACR with your AKS 31 # assumes ACR Admin Account is enabled ACR_NAME=aiclusterRegistry.azurecr.io ACR_UNAME=tokenname ACR_PASSWD=tokenpassword # Creating the secret kubectl -n yournamespace create secret docker-registry acr-secret --docker-server=$ACR_NAME --docker-username=$ACR_UNAME --docker-password=$ACR_PASSWD --docker-email=ignorethis@email.com # Patching default serviceaccount kubectl -n yournamespace patch serviceaccount default -p '{"imagePullSecrets": [{"name": "acr-secret"}]}' # Creating an ACR az acr create --resource-group aigroup --name aiclusterRegistry --sku Premium # Creating token az acr token create -n MyToken -r aiclusterRegistry --scope-map _repositories_admin
  • 32. Installing Kubeflow on Azure - tips and tricks Installing Istio Note: Kubeflow is not support version of Istio > 1.5. Istio config should consider knative requirements for istio Istio can be installed with: • istioctl tool • helm • Istio operator 32 # creating a namespace kubectl create namespace istio-system --save-config # installing istio istioctl manifest apply --set profile=default --set components.policy.enabled=true --set addonComponents.kiali.enabled=true --set addonComponents.grafana.enabled=true --set addonComponents.tracing.enabled=true --set values.global.defaultNodeSelector. "agentpool"=npdevcpu --set values.global.useMCP=false --set values.global.proxy.autoInject=disabled
  • 33. Installing Kubeflow on Azure - tips and tricks Installing KNative Note: KNative requirements for Istio are outdated due to changes of config parameters of Istio 33 kubectl apply --filename https://github.com/knative/serving/ releases/download/v0.18.0/serving-crds.yaml kubectl apply --filename https://github.com/knative/serving/ releases/download/v0.18.0/serving-core.yaml kubectl apply --filename https://github.com/knative/net-istio/ releases/download/v0.18.0/release.yaml # Optional, please refer the installation guide kubectl apply --filename https://github.com/knative/serving/ releases/download/v0.18.0/serving-default-domain.yaml
  • 34. Installing Kubeflow on Azure - tips and tricks Installing Kubeflow Note: Due to some embedded components are installed separately they should be removed from the Kubeflow manifest - kfctl_k8s_istio.v1.1.0.yaml: • istio-stack • knative • kfserving Important! A folder {clastername} created by kfctl should be kept for uninstalling and reconfiguration reasons 34 ... applications: ... - kustomizeConfig: repoRef: name: manifests path: application/v3 name: application - kustomizeConfig: repoRef: name: manifests path: stacks/kubernetes/application/istio-1-3-1-stack name: istio-stack - kustomizeConfig: repoRef: name: manifests path: stacks/kubernetes/application/cluster-local-gateway-1-3-1 name: cluster-local-gateway ... # Installing kubeflow kfctl apply -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml # Deleting kubeflow kfctl delete -V -f kfctl_k8s_istio_fixed.v1.1.0.yaml
  • 35. Installing Kubeflow on Azure - tips and tricks 35 kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 Go to http://localhost:8080
  • 36. Installing Kubeflow on Azure - tips and tricks Kfserving example - kfserving-tenzorflow-2.yaml 36 apiVersion: "serving.kubeflow.org/v1alpha2" kind: "InferenceService" metadata: name: "mnist" spec: default: predictor: tensorflow: runtimeVersion: 2.3.0 storageUri: "https://modelstorage.blob.core.windows.net/mnist/" > kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml
  • 37. Installing Kubeflow on Azure - tips and tricks Kfserving example - kfserving-tenzorflow-2.yaml 37 > kubectl label namespace mnist knative-eventing-injection=enabled > kubectl label namespace mnist istio-injection=enabled > kubectl label namespace mnist serving.kubeflow.org/inferenceservice=enabled > kubectl label namespace mnist katib-metricscollector-injection=enabled > kubectl -n mnist apply -f kfserving-tenzorflow-2.yaml
  • 38. Installing Kubeflow on Azure - tips and tricks 38 > curl -v http://mnist-predictor-default.mnist.1.1.1.1.xip.io/v1/models/mnist/metadata { "model_spec":{ "name": "mnist", "signature_name": "", "version": "1" } , "metadata": {"signature_def": { "signature_def": { "serving_default": { "inputs": { "inputs": { "dtype": "DT_STRING", "tensor_shape": { "dim": [], "unknown_rank": true }, "name": "tf_example:0" } }, ...