SlideShare a Scribd company logo
Overview and Kubernetes
integration
Jacob Tomlinson
Dask developer
Senior Software Engineer at NVIDIA
Jacob Tomlinson
Senior Software Engineer
NVIDIA
Dask’s Features
Overview
General purpose Python library for parallelism
Scales existing libraries, like Numpy, Pandas, and Scikit-Learn
Flexible enough to build complex and custom systems
Accessible for beginners, secure and trusted for institutions
PyData Community adoption
“Once Dask was working properly with
NumPy, it became clear that there was
huge demand for a lightweight parallelism
solution for Pandas DataFrames and
machine learning tools, such as
Scikit-Learn.
Dask then evolved quickly to support
these other projects where appropriate.”
Matthew Rocklin
Dask Creator
Source https://coiled.io/blog/history-dask/
Image from Jake VanderPlas’ keynote, PyCon 2017
Deferring Python execution
import dask
@dask.delayed
def inc(x):
return x + 1
@dask.delayed
def double(x):
return x * 2
@dask.delayed
def add(x, y):
return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total = dask.delayed(sum)(output)
Dask allows users to construct
custom graphs with the delayed and
futures APIs.
Distributed Task Graphs
Constructing tasks in a DAG allows
tasks to executed by a selection of
schedulers.
The distributed scheduler allows a
DAG to be shared by many workers
running over many machines to
spread out work.
Out-of-core computation
Dask’s data structures are chunked or
partitioned allowing them to be swapped
in and out of memory.
Operations run on chunks independently
and only communicate intermediate
results when necessary
Dask’s distributed scheduler
“For the first year of Dask’s life it was
focused on single-machine parallelism.
But inevitably, Dask was used on
problems that didn’t fit on a single
machine. This led us to develop a
distributed-memory scheduler for Dask
that supported the same API as the
existing single-machine scheduler.
For Dask users this was like magic.
Suddenly their existing workloads on
50GB datasets could run comfortably on
5TB (and then 50TB a bit later).”
Matthew Rocklin
Dask Creator
Source https://coiled.io/blog/history-dask/
Scheduler Dashboard
# Connect a Dask client
>>> from dask.distributed import Client
>>> client = Client(cluster)
# Do come computation
>>> import dask.array as da
>>> arr = da.random.random((10_000, 1_000, 1_000),
chunks=(1000, 1000, 100))
>>> result = arr.mean().compute()
Dashboard
Dask’s dashboard gives you key
insights into how your cluster is
performing.
You can view it in a browser or
directly within Jupyter Lab to see
how your graphs are executing.
You can also use the built in
profiler to understand where the
slow parts of your code are.
Elastic scaling
Dask’s adaptive scaling allows a Dask
scheduler to request additional workers
via whatever resource manager you are
using (Kubernetes, Cloud, etc).
This allows computations to burst out onto
more machines and complete the overall
graph in less time.
This is particularly effective when you
have multiple people running interactive
and embarrassingly parallel workloads on
shared resources.
Dask accelerates the existing Python ecosystem
Built alongside with the current community
import numpy as np
x = np.ones((1000, 1000))
x + x.T - x.mean(axis=0
import pandas as pd
df = pd.read_csv(“file.csv”)
df.groupby(“x”).y.mean()
from scikit_learn.linear_model 
import LogisticRegression
lr = LogisticRegression()
lr.fit(data, labels)
Numpy Pandas Scikit-Learn
14
RAPIDS
https://github.com/rapidsai
Jacob Tomlinson
Cloud Deployment Lead
RAPIDS
15
Minor Code Changes for Major Benefits
Abstracting Accelerated Compute through Familiar Interfaces
In [1]: import pandas as pd
In [2]: df = pd.read_csv(‘filepath’)
In [1]: from sklearn.ensemble import
RandomForestClassifier
In [2]: clf =
RandomForestClassifier(n_estimators=10
0,max_depth=8, random_state=0)
In [3]: clf.fit(x, y)
In [1]: import networkx as nx
In [2]: page_rank=nx.pagerank(graph)
In [1]: import cudf
In [2]: df = cudf.read_csv(‘filepath’)
In [1]: from cuml.ensemble import
RandomForestClassifier
In [2]: cuclf =
RandomForestClassifier(n_estimators=10
0,max_depth=8, random_state=0)
In [3]: cuclf.fit(x, y)
In [1]: import cugraph
In [2]:
page_rank=cugraph.pagerank(graph)
GPU
CPU
pandas scikit-learn NetworkX
cuDF cuML cuGraph
Average Speed-Ups: 150x Average Speed-Ups: 250x
Average Speed-Ups: 50x
16
Lightning-Fast End-to-End Performance
Reducing Data Science Processes from Hours to Seconds
*CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs.
A100s Provide More Power
than 100 CPU Nodes
16
More Cost-Effective than
Similar CPU Configuration
20x
Faster Performance than
Similar CPU Configuration
70x
17
RAPIDS on Kubernetes
Unified Cloud Deployments
GPU
Operator
Kubernetes
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Deploying Dask
A brief history and some context
19
19
19
Creating a Cluster manually
20
LocalCluster
● Convenience class to create
subprocesses
● Inspects local system and
creates workers to maximise
hardware use
● Has helper methods for
managing the cluster
21
dask-jobqueue
● Convenience class to create
HPC Dask Clusters
● Intended to be used from the
head node of an HPC
● Scheduler runs in subprocess
on the head node
● Workers are submitted as HPC
jobs to the queue
● Assumes network connectivity
between all nodes and head
node
22
dask-kubernetes (classic)
● Convenience class to create
Kubernetes Dask Clusters
● Intended to be used from within
the Kubernetes cluster
● Scheduler runs as subprocess
in user Pod
● Workers are created as Pods
(via service account auth)
● Assumes network connectivity
between all Pod IPs
23
23
23
Helm Chart
● Chart deploys a Dask Cluster
and a Jupyter service
● Scheduler, Workers and
Jupyter are all Deployments
● Jupyter is preconfigured to
connect to the Dask cluster
● Dask worker Deployment
presents a scaling challenge
due to semi-stateful nature of
Dask Workers
24
24
24
dask-gateway
● Dask cluster provisioning
service
● Has multiple backends
including HPC, Kubernetes and
Hadoop
● All Dask traffic is proxied via a
single ingress
● Users are abstracted away
front he underlying platform
Dask Operator
Kubernetes Native
26
26
26
Built with kopf
Dask is a Python community so it made
sense to build the controller in Python too.
We also evaluated the Operator
Framework for Golang but using it would
hugely reduce the number of active Dask
maintainers who could contribute.
27
# cluster.yaml
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
name: simple-cluster
spec:
worker:
replicas: 3
spec:
containers:
- name: worker
image: "ghcr.io/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-worker
- --name
- $(DASK_WORKER_NAME)
scheduler:
spec:
containers:
- name: scheduler
image: "ghcr.io/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-scheduler
ports:
- name: tcp-comm
containerPort: 8786
protocol: TCP
- name: http-dashboard
containerPort: 8787
protocol: TCP
readinessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 5
…
The Dask Operator has four custom
resource types that you can create via
kubectl.
● DaskCluster to create whole clusters.
● DaskWorkerGroup to create
additional groups of workers with
various configurations (high memory,
GPUs, etc).
● DaskJob to run end-to-end tasks like
a Kubernetes Job but with an
adjacent DaskCluster.
● DaskAutoscaler behaves like an HPA
but interacts with the Dask scheduler
to make scaling decisions
Create Dask Clusters
with kubectl
28
DaskJob
● Inspired by Kubeflow
PyTorchJob, et al
● DaskJob contains a Pod spec
to run the workload and a
nested DaskCluster resource
● Workload Pod is pre configured
to connect to the DaskCluster
● Users can submit a batch job
with attached autoscaling Dask
Cluster via kubectl
Create Dask Clusters with Python
# Install dask-kubernetes
$ pip install dask-kubernetes
# Launch a cluster
>>> from dask_kubernetes.operator 
import KubeCluster
>>> cluster = KubeCluster(name="demo")
# List the DaskCluster custom resource that
was created for us under the hood
$ kubectl get daskclusters
NAME AGE
demo 6m3s
Flyte
Integration success
31
31
31
32
32
32
Read Documentation: docs.dask.org
See Examples: examples.dask.org
Engage Community: github.com/dask

More Related Content

PDF
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
PDF
containerit at useR!2017 conference, Brussels
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
PDF
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
containerit at useR!2017 conference, Brussels
Apache spark sneha challa- google pittsburgh-aug 25th
What's New in Apache Spark 2.3 & Why Should You Care
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
Jump Start with Apache Spark 2.0 on Databricks
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...
Jörg Schad - Hybrid Cloud (Kubernetes, Spark, HDFS, …)-as-a-Service - Codemot...

Similar to k8s-batch-sig_-_Dask_on_Kubernetes.pptx__1_.pdf (20)

PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
DSDT Meetup Nov 2017
PDF
Dsdt meetup 2017 11-21
PPTX
Paris Data Geek - Spark Streaming
PDF
Commit to excellence - Java in containers
PDF
GDG Cloud Iasi - Docker For The Busy Developer.pdf
PDF
Improving Apache Spark Downscaling
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PDF
Shifter: Containers in HPC Environments
PPTX
Spark Study Notes
PDF
Docker Containers- Data Engineers' Arsenal.pdf
PPTX
RR & Docker @ MuensteR Meetup (Sep 2017)
PDF
Unified Big Data Processing with Apache Spark
PPTX
Dask: Scaling Python
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PDF
Apache Spark Introduction - CloudxLab
PDF
Kubernetes for the PHP developer
PPTX
Typesafe spark- Zalando meetup
PDF
LAMP Stack (Reloaded) - Infrastructure as Code with Terraform & Packer
PPTX
ETL with SPARK - First Spark London meetup
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DSDT Meetup Nov 2017
Dsdt meetup 2017 11-21
Paris Data Geek - Spark Streaming
Commit to excellence - Java in containers
GDG Cloud Iasi - Docker For The Busy Developer.pdf
Improving Apache Spark Downscaling
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Shifter: Containers in HPC Environments
Spark Study Notes
Docker Containers- Data Engineers' Arsenal.pdf
RR & Docker @ MuensteR Meetup (Sep 2017)
Unified Big Data Processing with Apache Spark
Dask: Scaling Python
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Apache Spark Introduction - CloudxLab
Kubernetes for the PHP developer
Typesafe spark- Zalando meetup
LAMP Stack (Reloaded) - Infrastructure as Code with Terraform & Packer
ETL with SPARK - First Spark London meetup
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Ad

k8s-batch-sig_-_Dask_on_Kubernetes.pptx__1_.pdf

  • 1. Overview and Kubernetes integration Jacob Tomlinson Dask developer Senior Software Engineer at NVIDIA
  • 4. General purpose Python library for parallelism Scales existing libraries, like Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions
  • 5. PyData Community adoption “Once Dask was working properly with NumPy, it became clear that there was huge demand for a lightweight parallelism solution for Pandas DataFrames and machine learning tools, such as Scikit-Learn. Dask then evolved quickly to support these other projects where appropriate.” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/ Image from Jake VanderPlas’ keynote, PyCon 2017
  • 6. Deferring Python execution import dask @dask.delayed def inc(x): return x + 1 @dask.delayed def double(x): return x * 2 @dask.delayed def add(x, y): return x + y data = [1, 2, 3, 4, 5] output = [] for x in data: a = inc(x) b = double(x) c = add(a, b) output.append(c) total = dask.delayed(sum)(output) Dask allows users to construct custom graphs with the delayed and futures APIs.
  • 7. Distributed Task Graphs Constructing tasks in a DAG allows tasks to executed by a selection of schedulers. The distributed scheduler allows a DAG to be shared by many workers running over many machines to spread out work.
  • 8. Out-of-core computation Dask’s data structures are chunked or partitioned allowing them to be swapped in and out of memory. Operations run on chunks independently and only communicate intermediate results when necessary
  • 9. Dask’s distributed scheduler “For the first year of Dask’s life it was focused on single-machine parallelism. But inevitably, Dask was used on problems that didn’t fit on a single machine. This led us to develop a distributed-memory scheduler for Dask that supported the same API as the existing single-machine scheduler. For Dask users this was like magic. Suddenly their existing workloads on 50GB datasets could run comfortably on 5TB (and then 50TB a bit later).” Matthew Rocklin Dask Creator Source https://coiled.io/blog/history-dask/
  • 10. Scheduler Dashboard # Connect a Dask client >>> from dask.distributed import Client >>> client = Client(cluster) # Do come computation >>> import dask.array as da >>> arr = da.random.random((10_000, 1_000, 1_000), chunks=(1000, 1000, 100)) >>> result = arr.mean().compute()
  • 11. Dashboard Dask’s dashboard gives you key insights into how your cluster is performing. You can view it in a browser or directly within Jupyter Lab to see how your graphs are executing. You can also use the built in profiler to understand where the slow parts of your code are.
  • 12. Elastic scaling Dask’s adaptive scaling allows a Dask scheduler to request additional workers via whatever resource manager you are using (Kubernetes, Cloud, etc). This allows computations to burst out onto more machines and complete the overall graph in less time. This is particularly effective when you have multiple people running interactive and embarrassingly parallel workloads on shared resources.
  • 13. Dask accelerates the existing Python ecosystem Built alongside with the current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn
  • 15. 15 Minor Code Changes for Major Benefits Abstracting Accelerated Compute through Familiar Interfaces In [1]: import pandas as pd In [2]: df = pd.read_csv(‘filepath’) In [1]: from sklearn.ensemble import RandomForestClassifier In [2]: clf = RandomForestClassifier(n_estimators=10 0,max_depth=8, random_state=0) In [3]: clf.fit(x, y) In [1]: import networkx as nx In [2]: page_rank=nx.pagerank(graph) In [1]: import cudf In [2]: df = cudf.read_csv(‘filepath’) In [1]: from cuml.ensemble import RandomForestClassifier In [2]: cuclf = RandomForestClassifier(n_estimators=10 0,max_depth=8, random_state=0) In [3]: cuclf.fit(x, y) In [1]: import cugraph In [2]: page_rank=cugraph.pagerank(graph) GPU CPU pandas scikit-learn NetworkX cuDF cuML cuGraph Average Speed-Ups: 150x Average Speed-Ups: 250x Average Speed-Ups: 50x
  • 16. 16 Lightning-Fast End-to-End Performance Reducing Data Science Processes from Hours to Seconds *CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs. A100s Provide More Power than 100 CPU Nodes 16 More Cost-Effective than Similar CPU Configuration 20x Faster Performance than Similar CPU Configuration 70x
  • 17. 17 RAPIDS on Kubernetes Unified Cloud Deployments GPU Operator Kubernetes GPU GPU GPU GPU GPU GPU GPU GPU
  • 18. Deploying Dask A brief history and some context
  • 20. 20 LocalCluster ● Convenience class to create subprocesses ● Inspects local system and creates workers to maximise hardware use ● Has helper methods for managing the cluster
  • 21. 21 dask-jobqueue ● Convenience class to create HPC Dask Clusters ● Intended to be used from the head node of an HPC ● Scheduler runs in subprocess on the head node ● Workers are submitted as HPC jobs to the queue ● Assumes network connectivity between all nodes and head node
  • 22. 22 dask-kubernetes (classic) ● Convenience class to create Kubernetes Dask Clusters ● Intended to be used from within the Kubernetes cluster ● Scheduler runs as subprocess in user Pod ● Workers are created as Pods (via service account auth) ● Assumes network connectivity between all Pod IPs
  • 23. 23 23 23 Helm Chart ● Chart deploys a Dask Cluster and a Jupyter service ● Scheduler, Workers and Jupyter are all Deployments ● Jupyter is preconfigured to connect to the Dask cluster ● Dask worker Deployment presents a scaling challenge due to semi-stateful nature of Dask Workers
  • 24. 24 24 24 dask-gateway ● Dask cluster provisioning service ● Has multiple backends including HPC, Kubernetes and Hadoop ● All Dask traffic is proxied via a single ingress ● Users are abstracted away front he underlying platform
  • 26. 26 26 26 Built with kopf Dask is a Python community so it made sense to build the controller in Python too. We also evaluated the Operator Framework for Golang but using it would hugely reduce the number of active Dask maintainers who could contribute.
  • 27. 27 # cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: simple-cluster spec: worker: replicas: 3 spec: containers: - name: worker image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-worker - --name - $(DASK_WORKER_NAME) scheduler: spec: containers: - name: scheduler image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 … The Dask Operator has four custom resource types that you can create via kubectl. ● DaskCluster to create whole clusters. ● DaskWorkerGroup to create additional groups of workers with various configurations (high memory, GPUs, etc). ● DaskJob to run end-to-end tasks like a Kubernetes Job but with an adjacent DaskCluster. ● DaskAutoscaler behaves like an HPA but interacts with the Dask scheduler to make scaling decisions Create Dask Clusters with kubectl
  • 28. 28 DaskJob ● Inspired by Kubeflow PyTorchJob, et al ● DaskJob contains a Pod spec to run the workload and a nested DaskCluster resource ● Workload Pod is pre configured to connect to the DaskCluster ● Users can submit a batch job with attached autoscaling Dask Cluster via kubectl
  • 29. Create Dask Clusters with Python # Install dask-kubernetes $ pip install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.operator import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo 6m3s
  • 33. Read Documentation: docs.dask.org See Examples: examples.dask.org Engage Community: github.com/dask