SlideShare a Scribd company logo
Machine Learning
With Apache Spark
On Kubernetes
Erik Erlandson, Red Hat, Inc.
eje@redhat.com
@ManyAngled
configuration
data collection
feature extraction process
management
analysis tools
monitoring
serving
infrastructure
machine
resource
management
data
verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
ML In An Application Context
configuration
data collection
feature extraction process
management
analysis tools
monitoring
serving
infrastructure
machine
resource
management
data
verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
ML In An Application Context
configuration
data collection
feature extraction process
management
analysis tools
monitoring
serving
infrastructure
machine
resource
management
data
verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
ML In An Application Context
configuration
data collection
feature extraction process
management
analysis tools
monitoring
serving
infrastructure
machine
resource
management
data
verification
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
ML In An Application Context
feature
engineering
model
training
and tuning
data
collection
and cleaning
model
validation
model
deployment
monitoring,
validation
codifying
problem
and metrics
codifying
problem
and metrics
feature
engineering
model
training
and tuning
model
validation
data
collection
and cleaning
model
deployment
monitoring,
validation
ML Workflow
feature
engineering
model
training
and tuning
data
collection
and cleaning
model
validation
model
deployment
monitoring,
validation
codifying
problem
and metrics
codifying
problem
and metrics
feature
engineering
model
training
and tuning
model
validation
data
collection
and cleaning
model
deployment
monitoring,
validation
ML Workflow
feature
engineering
model
training
and tuning
data
collection
and cleaning
model
validation
model
deployment
monitoring,
validation
codifying
problem
and metrics
codifying
problem
and metrics
feature
engineering
model
training
and tuning
model
validation
data
collection
and cleaning
model
deployment
monitoring,
validation
ML Workflow
Temporibus quis nihil vel consequatur.
Veniam rerum sapiente qui sunt pariatur sit
deleniti veniam. Itaque quae est nulla
necessitatibus qui voluptate assumenda.
Earum et sapiente voluptatem et et. Unde
tempora temporibus molestias occaecati
labore rem. Facilis aut at fuga eaque ipsam.
Nulla dolorem impedit nulla dolor non rerum
vel. Quia quo dolores natus fuga ipsum. Eum
ea ad laboriosam nemo corporis vel.
feature
engineering
model
training
and tuning
model
validation
model
deployment
monitoring,
validation
data
collection
and cleaning
data
collection
and cleaning
codifying
problem
and metrics
ML Workflow
feature
engineering
model
training
and tuning
model
validation
model
deployment
monitoring,
validation
data
collection
and cleaning
codifying
problem
and metrics
feature
engineering
data
collection
and cleaning
f( ) = 0.67 0.57 0.84 0.08 0.42 0.01
ML Workflow
data scientists
application developersdata engineers
federate
trainmodels
events
databases
file, object
storage
management
web and
mobile
reporting
developer UItransform
transform
transform
archive
data scientists
application developersdata engineers
federate
trainmodels
events
databases
file, object
storage
management
web and
mobile
reporting
developer UItransform
transform
transform
archive
data scientists
application developersdata engineers
federate
trainmodels
events
databases
file, object
storage
management
web and
mobile
reporting
developer UItransform
transform
transform
archive
Spark’s Compute Model
0 1 2
Executor
3 4 5
Executor
6 7 8
ExecutorDriver
0 1 2 3 4 5 6 7 8App
Logical View
Physical View
Spark’s Compute Model
0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor
Spark’s Compute Model
0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor
λx: x * 2
Spark’s Compute Model
0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor
λx: x * 2 λx: x * 2 λx: x * 2 λx: x * 2
Spark’s Compute Model
0 2 4
ExecutorDriver
12 14 16
Executor
λx: x * 2 λx: x * 2 λx: x * 2
6 8 10
Executor
λx: x * 2
Spark on Kubernetes
0 2 4
ExecutorDriver
12 14 16
Executor
λx: x * 2 λx: x * 2 λx: x * 2
6 8 10
Executor
λx: x * 2
Driver Pod Executor Pod Executor Pod Executor Pod
Spark Structured Streaming
records.show(5)
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
20
records.groupBy($"user_id")
.agg(avg($"wordcount").alias("avg"))
.orderBy($"avg".desc)
.show(5)
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+
Spark Structured Streaming
21
val r =
records.groupBy($"user_id")
.agg(avg($"wordcount").alias("avg"))
.orderBy($"avg".desc)
val query = r.writeStream //…
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+
val query = records
.writeStream //…
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
Spark Structured Streaming
22
val r =
records.groupBy($"user_id")
.agg(avg($"wordcount").alias("avg"))
.orderBy($"avg".desc)
val query = r.writeStream //…
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+
val query = records
.writeStream //…
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
CSV
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV
Kafka
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV
Kafka
...
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
SQL DB
CSV
Kafka
...
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
DataFrame
API
SQL DB
CSV
Kafka
...
Spark Structured Streaming
DataFrame
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
DataFrame
API
SQL DB
CSV
Kafka
...User
Defined
Open Data Hub
Open Data Hub

More Related Content

PDF
Spring Boot to Quarkus: A real app migration experience | DevNation Tech Talk
PDF
Kubernetes extensibility: crd & operators
PDF
Operator SDK for K8s using Go
PDF
Crafting Kubernetes Operators
PDF
An intro to Kubernetes operators
PDF
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
PDF
Load Balancing in the Cloud using Nginx & Kubernetes
PDF
The Kubernetes Operator Pattern - ContainerConf Nov 2017
Spring Boot to Quarkus: A real app migration experience | DevNation Tech Talk
Kubernetes extensibility: crd & operators
Operator SDK for K8s using Go
Crafting Kubernetes Operators
An intro to Kubernetes operators
Level-up your gaming telemetry using Kafka Streams | DevNation Tech Talk
Load Balancing in the Cloud using Nginx & Kubernetes
The Kubernetes Operator Pattern - ContainerConf Nov 2017

What's hot (19)

PDF
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
PDF
Know your app: Add metrics to Java with Micrometer | DevNation Tech Talk
PDF
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
PDF
Kubernetes Operators: Rob Szumski
PDF
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
PDF
Create Great CNCF User-Base from Lessons Learned from Other Open Source Commu...
PDF
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
PPTX
ONAP MultiCloud/K8s Casablanca
PDF
How to integrate Kubernetes in OpenStack: You need to know these project
PDF
Servlet vs Reactive Stacks in 5 Use Cases
PDF
Brief intro to K8s controller and operator
PDF
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
PDF
How to Prepare for CKA Exam
PDF
Spring Cloud and Netflix OSS overview v1
PPTX
GW Tester
PDF
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
PDF
Istio By Example (extended version)
PDF
KubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
PPTX
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Know your app: Add metrics to Java with Micrometer | DevNation Tech Talk
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
Kubernetes Operators: Rob Szumski
To the moon and beyond with Java 17 APIs! | DevNation Tech Talk
Create Great CNCF User-Base from Lessons Learned from Other Open Source Commu...
Managing Stateful Services with the Operator Pattern in Kubernetes - Kubernet...
ONAP MultiCloud/K8s Casablanca
How to integrate Kubernetes in OpenStack: You need to know these project
Servlet vs Reactive Stacks in 5 Use Cases
Brief intro to K8s controller and operator
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
How to Prepare for CKA Exam
Spring Cloud and Netflix OSS overview v1
GW Tester
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Istio By Example (extended version)
KubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
Ofir Makmal - Intro To Kubernetes Operators - Google Cloud Summit 2018 Tel Aviv
Ad

Similar to Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk (20)

PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
Bds session 13 14
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PPTX
Large Scale Machine learning with Spark
PPTX
Introduction to Apache Spark
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Spark from the Surface
PDF
Big Data Analytics and Ubiquitous computing
PPTX
Intro to Apache Spark
PDF
20170126 big data processing
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
In Memory Analytics with Apache Spark
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
Unified Big Data Processing with Apache Spark
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
End-to-end Data Pipeline with Apache Spark
PPTX
APACHE SPARK.pptx
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Intro to Apache Spark by CTO of Twingo
Bds session 13 14
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
Large Scale Machine learning with Spark
Introduction to Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Spark from the Surface
Big Data Analytics and Ubiquitous computing
Intro to Apache Spark
20170126 big data processing
Simplifying Big Data Analytics with Apache Spark
In Memory Analytics with Apache Spark
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Unified Big Data Processing with Apache Spark
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Unified Big Data Processing with Apache Spark (QCON 2014)
End-to-end Data Pipeline with Apache Spark
APACHE SPARK.pptx
Ad

More from Red Hat Developers (20)

PDF
DevNation Tech Talk: Getting GitOps
PDF
Exploring the power of OpenTelemetry on Kubernetes
PDF
GitHub Makeover | DevNation Tech Talk
PDF
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
PDF
Extra micrometer practices with Quarkus | DevNation Tech Talk
PDF
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
PDF
Integrating Loom in Quarkus | DevNation Tech Talk
PDF
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
PDF
Containers without docker | DevNation Tech Talk
PDF
Distributed deployment of microservices across multiple OpenShift clusters | ...
PDF
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
PDF
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
PDF
11 CLI tools every developer should know | DevNation Tech Talk
PDF
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
PDF
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
PDF
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
PDF
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
PDF
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
PDF
Building streaming applications using a managed Kafka service | DevNation Tec...
DevNation Tech Talk: Getting GitOps
Exploring the power of OpenTelemetry on Kubernetes
GitHub Makeover | DevNation Tech Talk
Quinoa: A modern Quarkus UI with no hassles | DevNation tech Talk
Extra micrometer practices with Quarkus | DevNation Tech Talk
Event-driven autoscaling through KEDA and Knative Integration | DevNation Tec...
Integrating Loom in Quarkus | DevNation Tech Talk
Quarkus Renarde 🦊♥: an old-school Web framework with today's touch | DevNatio...
Containers without docker | DevNation Tech Talk
Distributed deployment of microservices across multiple OpenShift clusters | ...
DevNation Workshop: Object detection with Red Hat OpenShift Data Science [Mar...
Dear security, compliance, and auditing: We’re sorry. Love, DevOps | DevNatio...
11 CLI tools every developer should know | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
GitHub Actions and OpenShift: ​​Supercharging your software development loops...
Profile your Java apps in production on Red Hat OpenShift with Cryostat | Dev...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kubernetes configuration and security policies with KubeLinter | DevNation Te...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Building streaming applications using a managed Kafka service | DevNation Tec...

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
The Rise and Fall of 3GPP – Time for a Sabbatical?
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf

Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk