Machine learning on kubernetes

Machine Learning on
Kubernetes
13 Dec 2017
Anirudh Ramanathan
Software Engineer on Kubernetes
Twitter: @anirudh4444

Disclaimer
I’m not a Machine Learning expert.
I work on infrastructure and distributed systems for a
living.

Kubernetes a year ago...
● Was used primarily for stateless workloads
● Needed an understanding of several core concepts to operate
● Applications had to be written to fit into core controller abstractions

Kubernetes today...
● Has abstractions to support Stateful applications and now data
processing and machine learning.
● Has a wide range of extension points including ones that allow API
extensions and custom controllers.
● Has support for building higher level abstractions and APIs to hide
infrastructure & operational complexity.

What’s changed?
● Workload controller abstractions moving to GA/stable.
● Custom Resource Definitions & Aggregated API Servers
● Kubernetes Operators
● Community support for external frameworks
● Work on scheduling and resource management (ongoing)

Machine Learning
Solving problems without explicitly knowing
how to create solutions

Machine Learning Infrastructure
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform (KDD 2017)

Kubeflow
https://github.com/google/kubeflow/
Our goal is not to recreate other services, but to provide a straightforward
way for spinning up best of breed OSS solutions.
● A JupyterHub to create & manage interactive Jupyter notebooks
● A Tensorflow Training Controller that can be configured to use CPUs
or GPUs, and adjusted to the size of a cluster with a single setting
● A TF Serving container

JupyterHub
● A single hub & proxy for managing interactive sessions
● Can run entirely within Kubernetes - notebooks are backed by
Kubernetes pods
● Can request required resources - CPUs, GPUs, etc
● Has pluggable authentication (oauth, kdc, etc)
Made possible by: https://github.com/jupyterhub/kubespawner

Tensorflow Training Controller
● A Kubernetes “operator” to help run distributed/non-distributed TF
training.
● Exposes an API through a CustomResourceDefinition
● Controller manages complexity of distributed training using
Tensorflow.
Made possible by: https://github.com/tensorflow/k8s

Tensorflow Serving
● A Kubernetes Deployment that can serve saved models
● Deployment - replicas can be scaled.
Future work:
● Custom metrics & Autoscaling

But there were so many stages!
● Clearly there are many other challenges faced by people building
Machine Learning infrastructure.
● How do I preprocess data?
● How do I describe my pipeline?
● How do I orchestrate my pipeline?
● We have some ideas.

Apache Spark
● Spark on Kubernetes is an ongoing effort since Dec 2016.
● It is being upstreamed into Spark and expected to land in Spark 2.3
(due sometime in January).
● The changes make Spark itself aware of a new Kubernetes Scheduler
that can directly run Spark applications for the user.

Apache Spark
Spark Core Kubernetes Scheduler Backend
Kubernetes
Cluster
add executors
rm executors
configuration

Apache Spark
Kubernetes Scheduler for Spark
● Spark 2.3 will support
○ Running Java/Scala jobs
○ Static allocation of executors
○ Some dependency management
● Our fork (github.com/apache-spark-on-k8s/spark) has several
additional features which we’re slowly upstreaming.
○ It’s being run by several organizations right now.

Apache Airflow
● A DAG scheduler.
● Has a rich ecosystem of “operators” to allow interacting with different
applications.
● Community working on a Kubernetes native executor for Airflow.
● Currently in the process of being upstreamed.

Apache Airflow
BashOperator(
task_id = ‘account-test’,
bash_command = ‘run-something.sh’,
dag = dag,
executor_config = {
‘request_memory’: ‘128Mi’,
‘limit_memory’: ‘128Mi’
‘image’: ‘airflow/scipy:1.1.5’
}
)
The operators can specify various Kubernetes executor constraints within each DAG step.
For example:

Putting it all together
HDFS
or GCS/S3
Spark
Airflow Pipeline
JupyterHub
Tensorflow
Other ML
Frameworks

Get Involved
Kubeflow
● Slack Channel (See https://github.com/google/kubeflow for joining instructions)
● Twitter (http://twitter.com/kubeflow)
● Mailing List (https://groups.google.com/forum/#!forum/kubeflow-discuss)
SIG Big Data
● Slack Channel (https://kubernetes.slack.com/messages/sig-big-data)
● Mailing list (https://groups.google.com/forum/#!forum/kubernetes-sig-big-data)
● Weekly meeting (https://github.com/kubernetes/community/tree/master/sig-big-data)

Machine learning on kubernetes

More Related Content

What's hot (20)

Similar to Machine learning on kubernetes (20)

Recently uploaded (20)

Machine learning on kubernetes