SlideShare a Scribd company logo
Machine Learning on
Kubernetes
13 Dec 2017
Anirudh Ramanathan
Software Engineer on Kubernetes
Twitter: @anirudh4444
Disclaimer
I’m not a Machine Learning expert.
I work on infrastructure and distributed systems for a
living.
Kubernetes a year ago...
● Was used primarily for stateless workloads
● Needed an understanding of several core concepts to operate
● Applications had to be written to fit into core controller abstractions
Kubernetes today...
● Has abstractions to support Stateful applications and now data
processing and machine learning.
● Has a wide range of extension points including ones that allow API
extensions and custom controllers.
● Has support for building higher level abstractions and APIs to hide
infrastructure & operational complexity.
What’s changed?
● Workload controller abstractions moving to GA/stable.
● Custom Resource Definitions & Aggregated API Servers
● Kubernetes Operators
● Community support for external frameworks
● Work on scheduling and resource management (ongoing)
Machine Learning
Solving problems without explicitly knowing
how to create solutions
Machine Learning Infrastructure
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
Machine Learning Infrastructure
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
Kubeflow
https://github.com/google/kubeflow/
Our goal is not to recreate other services, but to provide a straightforward
way for spinning up best of breed OSS solutions.
● A JupyterHub to create & manage interactive Jupyter notebooks
● A Tensorflow Training Controller that can be configured to use CPUs
or GPUs, and adjusted to the size of a cluster with a single setting
● A TF Serving container
JupyterHub
● A single hub & proxy for managing interactive sessions
● Can run entirely within Kubernetes - notebooks are backed by
Kubernetes pods
● Can request required resources - CPUs, GPUs, etc
● Has pluggable authentication (oauth, kdc, etc)
Made possible by: https://github.com/jupyterhub/kubespawner
Tensorflow Training Controller
● A Kubernetes “operator” to help run distributed/non-distributed TF
training.
● Exposes an API through a CustomResourceDefinition
● Controller manages complexity of distributed training using
Tensorflow.
Made possible by: https://github.com/tensorflow/k8s
Tensorflow Serving
● A Kubernetes Deployment that can serve saved models
● Deployment - replicas can be scaled.
Future work:
● Custom metrics & Autoscaling
But there were so many stages!
● Clearly there are many other challenges faced by people building
Machine Learning infrastructure.
● How do I preprocess data?
● How do I describe my pipeline?
● How do I orchestrate my pipeline?
● We have some ideas.
Apache Spark
● Spark on Kubernetes is an ongoing effort since Dec 2016.
● It is being upstreamed into Spark and expected to land in Spark 2.3
(due sometime in January).
● The changes make Spark itself aware of a new Kubernetes Scheduler
that can directly run Spark applications for the user.
Apache Spark
Spark Core Kubernetes Scheduler Backend
Kubernetes
Cluster
add executors
rm executors
configuration
Apache Spark
Kubernetes Scheduler for Spark
● Spark 2.3 will support
○ Running Java/Scala jobs
○ Static allocation of executors
○ Some dependency management
● Our fork (github.com/apache-spark-on-k8s/spark) has several
additional features which we’re slowly upstreaming.
○ It’s being run by several organizations right now.
Apache Airflow
● A DAG scheduler.
● Has a rich ecosystem of “operators” to allow interacting with different
applications.
● Community working on a Kubernetes native executor for Airflow.
● Currently in the process of being upstreamed.
Apache Airflow
BashOperator(
task_id = ‘account-test’,
bash_command = ‘run-something.sh’,
dag = dag,
executor_config = {
‘request_memory’: ‘128Mi’,
‘limit_memory’: ‘128Mi’
‘image’: ‘airflow/scipy:1.1.5’
}
)
The operators can specify various Kubernetes executor constraints within each DAG step.
For example:
Putting it all together
HDFS
or GCS/S3
Spark
Airflow Pipeline
JupyterHub
Tensorflow
Other ML
Frameworks
Get Involved
Kubeflow
● Slack Channel (See https://github.com/google/kubeflow for joining instructions)
● Twitter (http://twitter.com/kubeflow)
● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss)
SIG Big Data
● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data)
● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data)
● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)
Questions?

More Related Content

PPTX
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
PPTX
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
PDF
Hydrosphere.io for ODSC: Webinar on Kubeflow
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PPTX
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
PDF
Automating machine learning lifecycle with kubeflow
PDF
Kubeflow
PDF
Webinar kubernetes and-spark
Tensorflow London 13: Barbara Fusinska 'Hassle Free, Scalable, Machine Learni...
Kubeflow: portable and scalable machine learning using Jupyterhub and Kuberne...
Hydrosphere.io for ODSC: Webinar on Kubeflow
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
Automating machine learning lifecycle with kubeflow
Kubeflow
Webinar kubernetes and-spark

What's hot (20)

PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
PDF
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
PDF
Machine learning at scale by Amy Unruh from Google
PDF
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
PDF
Kubeflow at Spotify (For the Kubeflow Summit)
PDF
running Tensorflow in Production
PDF
Productionizing Machine Learning Pipelines with Databricks and Azure ML
PDF
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
PDF
"Remote development of Quarkus applications"
PPTX
AI Pipeline Optimization using Kubeflow
PDF
"Kubernetes as Driver of Generic IT Automation"
PDF
How to set up Kubernetes for all your machine learning workflows
PPTX
Getting Started with Visual Studio Tools for AI
PDF
Yannis Zarkadas. Enterprise data science workflows on kubeflow
PPTX
Using AML Python SDK
PDF
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
PDF
Operator development made easy with helm
PDF
Build and Monitor Machine Learning Services in Kubernetes
PPTX
Boolan machine learning summit
PDF
Serverless with Knative - Mete Atamel (Google)
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Advanced Model Inferencing leveraging Kubeflow Serving, KNative and Istio
Machine learning at scale by Amy Unruh from Google
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
Kubeflow at Spotify (For the Kubeflow Summit)
running Tensorflow in Production
Productionizing Machine Learning Pipelines with Databricks and Azure ML
Kyryl Truskovskyi: Kubeflow for end2end machine learning lifecycle
"Remote development of Quarkus applications"
AI Pipeline Optimization using Kubeflow
"Kubernetes as Driver of Generic IT Automation"
How to set up Kubernetes for all your machine learning workflows
Getting Started with Visual Studio Tools for AI
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Using AML Python SDK
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Operator development made easy with helm
Build and Monitor Machine Learning Services in Kubernetes
Boolan machine learning summit
Serverless with Knative - Mete Atamel (Google)
Ad

Similar to Machine learning on kubernetes (20)

PDF
Containerized architectures for deep learning
PDF
Kubernetes: The Next Research Platform
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
PDF
PySpark on Kubernetes @ Python Barcelona March Meetup
PDF
Big data and Kubernetes
PDF
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
PDF
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
PPTX
Kubernetes data science and machine learning
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
PDF
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PDF
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
Scaling Apache Spark on Kubernetes at Lyft
PDF
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Democratizing machine learning on kubernetes
PDF
Meetup Kubernetes Rhein-Necker
Containerized architectures for deep learning
Kubernetes: The Next Research Platform
Intro - End to end ML with Kubeflow @ SignalConf 2018
PySpark on Kubernetes @ Python Barcelona March Meetup
Big data and Kubernetes
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Migrating Apache Spark ML Jobs to Spark + Tensorflow on Kubeflow
Kubernetes data science and machine learning
[Spark Summit 2017 NA] Apache Spark on Kubernetes
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
How To Build Efficient ML Pipelines From The Startup Perspective (GTC Silicon...
Running Apache Spark Jobs Using Kubernetes
Scaling Apache Spark on Kubernetes at Lyft
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Democratizing machine learning on kubernetes
Meetup Kubernetes Rhein-Necker
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation theory and applications.pdf
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Machine learning on kubernetes

  • 1. Machine Learning on Kubernetes 13 Dec 2017 Anirudh Ramanathan Software Engineer on Kubernetes Twitter: @anirudh4444
  • 2. Disclaimer I’m not a Machine Learning expert. I work on infrastructure and distributed systems for a living.
  • 3. Kubernetes a year ago... ● Was used primarily for stateless workloads ● Needed an understanding of several core concepts to operate ● Applications had to be written to fit into core controller abstractions
  • 4. Kubernetes today... ● Has abstractions to support Stateful applications and now data processing and machine learning. ● Has a wide range of extension points including ones that allow API extensions and custom controllers. ● Has support for building higher level abstractions and APIs to hide infrastructure & operational complexity.
  • 5. What’s changed? ● Workload controller abstractions moving to GA/stable. ● Custom Resource Definitions & Aggregated API Servers ● Kubernetes Operators ● Community support for external frameworks ● Work on scheduling and resource management (ongoing)
  • 6. Machine Learning Solving problems without explicitly knowing how to create solutions
  • 7. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  • 8. Machine Learning Infrastructure TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)
  • 9. Kubeflow https://github.com/google/kubeflow/ Our goal is not to recreate other services, but to provide a straightforward way for spinning up best of breed OSS solutions. ● A JupyterHub to create & manage interactive Jupyter notebooks ● A Tensorflow Training Controller that can be configured to use CPUs or GPUs, and adjusted to the size of a cluster with a single setting ● A TF Serving container
  • 10. JupyterHub ● A single hub & proxy for managing interactive sessions ● Can run entirely within Kubernetes - notebooks are backed by Kubernetes pods ● Can request required resources - CPUs, GPUs, etc ● Has pluggable authentication (oauth, kdc, etc) Made possible by: https://github.com/jupyterhub/kubespawner
  • 11. Tensorflow Training Controller ● A Kubernetes “operator” to help run distributed/non-distributed TF training. ● Exposes an API through a CustomResourceDefinition ● Controller manages complexity of distributed training using Tensorflow. Made possible by: https://github.com/tensorflow/k8s
  • 12. Tensorflow Serving ● A Kubernetes Deployment that can serve saved models ● Deployment - replicas can be scaled. Future work: ● Custom metrics & Autoscaling
  • 13. But there were so many stages! ● Clearly there are many other challenges faced by people building Machine Learning infrastructure. ● How do I preprocess data? ● How do I describe my pipeline? ● How do I orchestrate my pipeline? ● We have some ideas.
  • 14. Apache Spark ● Spark on Kubernetes is an ongoing effort since Dec 2016. ● It is being upstreamed into Spark and expected to land in Spark 2.3 (due sometime in January). ● The changes make Spark itself aware of a new Kubernetes Scheduler that can directly run Spark applications for the user.
  • 15. Apache Spark Spark Core Kubernetes Scheduler Backend Kubernetes Cluster add executors rm executors configuration
  • 16. Apache Spark Kubernetes Scheduler for Spark ● Spark 2.3 will support ○ Running Java/Scala jobs ○ Static allocation of executors ○ Some dependency management ● Our fork (github.com/apache-spark-on-k8s/spark) has several additional features which we’re slowly upstreaming. ○ It’s being run by several organizations right now.
  • 17. Apache Airflow ● A DAG scheduler. ● Has a rich ecosystem of “operators” to allow interacting with different applications. ● Community working on a Kubernetes native executor for Airflow. ● Currently in the process of being upstreamed.
  • 18. Apache Airflow BashOperator( task_id = ‘account-test’, bash_command = ‘run-something.sh’, dag = dag, executor_config = { ‘request_memory’: ‘128Mi’, ‘limit_memory’: ‘128Mi’ ‘image’: ‘airflow/scipy:1.1.5’ } ) The operators can specify various Kubernetes executor constraints within each DAG step. For example:
  • 19. Putting it all together HDFS or GCS/S3 Spark Airflow Pipeline JupyterHub Tensorflow Other ML Frameworks
  • 20. Get Involved Kubeflow ● Slack Channel (See https://github.com/google/kubeflow for joining instructions) ● Twitter (http://twitter.com/kubeflow) ● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss) SIG Big Data ● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data) ● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data) ● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)