Meetup Kubernetes Rhein-Necker

Data-aware scheduling
Spark on Kubernetes with HDFS
Johannes M. Scheuermann
Karlsruhe, 18.10.2017

IT Engineering & Operations @ inovex
〉 Software-Defined Datacenters
〉 Infrastructure as Code
〉 Cloud technologies
〉 High Availability & Scalability
〉 Want‘s an IBM Z-Frame
〉 @johscheuer
2

• Why data-aware scheduling
• Data-ware for non Big-Data application
• Data-ware scheduler
• Big Data on Kubernetes
• Spark on Kubernetes
• HDFS on Kubernetes
Agenda

• I’m not a scheduling expert
• Concept is/was a PoC
• Share learnings/ideas
• Get feedback from the community
Spoiler(s) J

Data-aware scheduling for non Big-Data
• Databases
• (large) image processing
• Video encoding
• (Web)-Cache

• Distributed (parallel) POSIX file system
• Any workload with high performance (incl. throughput,
databases, small files)
• Can be deployed in containers, on kubelet hosts.
• Linearly scalable performance.
• Fully fault-tolerant, split-brain safe
Quobyte – What is Quobyte

• Metadata servers make placement decisions against
policies
• on file level
• tiering, isolation, …
• keep stripes of files on disks of same machine => enable local read
• allow preferring writes to local storage servers => enable local
write
• Locality information can be retrieved per file
• that’s where the scheduler hooks in
Quobyte - Placement

• Specify wanted Data
• Lookup Data Placement
• Remapping if Storage runs in Containers
• Schedule Pod
Scheduling data-aware (file-based)

Scheduler Architecture (4000ft)

Scheduler Architecture (1000ft)

Scheduler Architecture (containerized)

(Spark) Big-Data on Kubernetes

• https://github.com/apache-spark-on-k8s/spark
• Not the “faked” Spark on Kubernetes
• Still in development
• Still not in the official Apache Spark project
• Current: v2.2.0-0.4.0
• Alpha/Beta ?!
Spark on Kubernetes

Spark on Kubernetes
Spark Core
MesosYARNStandaloneKubernetes
StreamingMLlibSparkSQLGraphX

• Integrates with Kubernetes
• RBAC
• Resource Quotas
• Audit logging
• Etc.
• Only cluster-mode
Spark on Kubernetes

Spark on Kubernetes (cluster-mode)

Spark + HDFS on Kubernetes
Driver 1
Driver 2
Executor 1
Executor 2.1
HDFS NN
HDFS DN 1 HDFS DN 2
Pod
Network
Host
Network
Executor 2.2

• Rack-locality
• Node preferences
• Priority-based scheduling (K8s 1.8 alpha)
• NameNode HA
• Kerberos support
Missing pieces

• Good starting point
• Good integration
• Still some points open
• Work for better integration (more general)
• Play with it!
Conclusions

28
We are hiring!
www.inovexperts.com

• https://issues.apache.org/jira/browse/SPARK-18278
• https://www.youtube.com/watch?v=0xRHONrWwvU&
feature=youtu.be
• https://www.youtube.com/watch?v=DxCDxi08HWo&f
eature=youtu.be
Further reading

inovex GmbH
Johannes.scheuermann@inovex.de
CC BY-NC-ND inovex.de +JohannesScheuermann
github.com/johscheu
er
@johscheuer youtube.com/inovexGmb
H

Meetup Kubernetes Rhein-Necker

More Related Content

What's hot (20)

Similar to Meetup Kubernetes Rhein-Necker (20)

More from inovex GmbH (20)

Recently uploaded (20)

Meetup Kubernetes Rhein-Necker