Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk

Machine Learning
With Apache Spark
On Kubernetes
Erik Erlandson, Red Hat, Inc.
eje@redhat.com
@ManyAngled

conﬁguration
data collection
feature extraction process
management
analysis tools
monitoring
serving
infrastructure
machine
resource
management
data
veriﬁcation
(Adapted from Sculley et al., “Hidden Technical Debt in Machine Learning Systems.” NIPS 2015)
ML In An Application Context

feature
engineering
model
training
and tuning
data
collection
and cleaning
model
validation
model
deployment
monitoring,
validation
codifying
problem
and metrics
codifying
problem
and metrics
feature
engineering
model
training
and tuning
model
validation
data
collection
and cleaning
model
deployment
monitoring,
validation
ML Workflow

Temporibus quis nihil vel consequatur.
Veniam rerum sapiente qui sunt pariatur sit
deleniti veniam. Itaque quae est nulla
necessitatibus qui voluptate assumenda.
Earum et sapiente voluptatem et et. Unde
tempora temporibus molestias occaecati
labore rem. Facilis aut at fuga eaque ipsam.
Nulla dolorem impedit nulla dolor non rerum
vel. Quia quo dolores natus fuga ipsum. Eum
ea ad laboriosam nemo corporis vel.
feature
engineering
model
training
and tuning
model
validation
model
deployment
monitoring,
validation
data
collection
and cleaning
data
collection
and cleaning
codifying
problem
and metrics
ML Workflow

feature
engineering
model
training
and tuning
model
validation
model
deployment
monitoring,
validation
data
collection
and cleaning
codifying
problem
and metrics
feature
engineering
data
collection
and cleaning
f( ) = 0.67 0.57 0.84 0.08 0.42 0.01
ML Workflow

data scientists
application developersdata engineers
federate
trainmodels
events
databases
ﬁle, object
storage
management
web and
mobile
reporting
developer UItransform
transform
transform
archive

Spark’s Compute Model
0 1 2
Executor
3 4 5
Executor
6 7 8
ExecutorDriver
0 1 2 3 4 5 6 7 8App
Logical View
Physical View

0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor

0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor
λx: x * 2

0 1 2
ExecutorDriver
3 4 5
Executor
6 7 8
Executor
λx: x * 2 λx: x * 2 λx: x * 2 λx: x * 2

0 2 4
ExecutorDriver
12 14 16
Executor
λx: x * 2 λx: x * 2 λx: x * 2
6 8 10
Executor
λx: x * 2

Spark on Kubernetes
0 2 4
ExecutorDriver
12 14 16
Executor
λx: x * 2 λx: x * 2 λx: x * 2
6 8 10
Executor
λx: x * 2
Driver Pod Executor Pod Executor Pod Executor Pod

Spark Structured Streaming
records.show(5)
+----------+---------+
| user_id|wordcount|
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
20
records.groupBy($"user_id")
.agg(avg($"wordcount").alias("avg"))
.orderBy($"avg".desc)
.show(5)
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+

21
val r =
val query = r.writeStream //…
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+
val query = records
.writeStream //…
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+

22
val r =
val query = r.writeStream //…
+----------+----+
| user_id| avg|
+----------+----+
|9438801796|42.0|
|0837938601|41.0|
|0004926696|40.0|
|7439949213|39.0|
|2505585758|39.0|
+----------+----+
val query = records
.writeStream //…
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
CSV

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV
Kafka

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL DB
CSV
Kafka
...

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
SQL DB
CSV
Kafka
...

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
DataFrame
API
SQL DB
CSV
Kafka
...

DataFrame
+----------+---------+
+----------+---------+
|6458791872| 12|
|7699035787| 5|
|2509155359| 9|
|9914782373| 18|
|7816616846| 12|
+----------+---------+
SQL
DataFrame
API
SQL DB
CSV
Kafka
...User
Defined

Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk

More Related Content

What's hot (19)

Similar to Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk (20)

More from Red Hat Developers (20)

Recently uploaded (20)

Machine learning with Apache Spark on Kubernetes | DevNation Tech Talk