SlideShare a Scribd company logo
Data-aware scheduling
Spark on Kubernetes with HDFS
Johannes M. Scheuermann
Karlsruhe, 18.10.2017
Johannes M. Scheuermann
IT Engineering & Operations @ inovex
〉 Software-Defined Datacenters
〉 Infrastructure as Code
〉 Cloud technologies
〉 High Availability & Scalability
〉 Want‘s an IBM Z-Frame
〉 @johscheuer
2
Data-aware scheduling
• Why data-aware scheduling
• Data-ware for non Big-Data application
• Data-ware scheduler
• Big Data on Kubernetes
• Spark on Kubernetes
• HDFS on Kubernetes
Agenda
• I’m not a scheduling expert
• Concept is/was a PoC
• Share learnings/ideas
• Get feedback from the community
Spoiler(s) J
Data-locality
Why data-locality?
Data-aware scheduling for non Big-Data
• Databases
• (large) image processing
• Video encoding
• (Web)-Cache
• Distributed (parallel) POSIX file system
• Any workload with high performance (incl. throughput,
databases, small files)
• Can be deployed in containers, on kubelet hosts.
• Linearly scalable performance.
• Fully fault-tolerant, split-brain safe
Quobyte – What is Quobyte
Quobyte - Architecture
• Metadata servers make placement decisions against
policies
• on file level
• tiering, isolation, …
• keep stripes of files on disks of same machine => enable local read
• allow preferring writes to local storage servers => enable local
write
• Locality information can be retrieved per file
• that’s where the scheduler hooks in
Quobyte - Placement
Running multiple schedulers
• Specify wanted Data
• Lookup Data Placement
• Remapping if Storage runs in Containers
• Schedule Pod
Scheduling data-aware (file-based)
Scheduler Architecture (4000ft)
Scheduler Architecture (1000ft)
Scheduler Architecture (containerized)
Benchmarks
(Spark) Big-Data on Kubernetes
• https://github.com/apache-spark-on-k8s/spark
• Not the “faked” Spark on Kubernetes
• Still in development
• Still not in the official Apache Spark project
• Current: v2.2.0-0.4.0
• Alpha/Beta ?!
Spark on Kubernetes
Spark on Kubernetes
Spark Core
MesosYARNStandaloneKubernetes
StreamingMLlibSparkSQLGraphX
• Integrates with Kubernetes
• RBAC
• Resource Quotas
• Audit logging
• Etc.
• Only cluster-mode
Spark on Kubernetes
Spark on Kubernetes (cluster-mode)
Spark + HDFS on Kubernetes
Driver 1
Driver 2
Executor 1
Executor 2.1
HDFS NN
HDFS DN 1 HDFS DN 2
Pod
Network
Host
Network
Executor 2.2
Demo
• Rack-locality
• Node preferences
• Priority-based scheduling (K8s 1.8 alpha)
• NameNode HA
• Kerberos support
Missing pieces
Conclusions
• Good starting point
• Good integration
• Still some points open
• Work for better integration (more general)
• Play with it!
Conclusions
28
We are hiring!
www.inovexperts.com
Q&A
• https://issues.apache.org/jira/browse/SPARK-18278
• https://www.youtube.com/watch?v=0xRHONrWwvU&
feature=youtu.be
• https://www.youtube.com/watch?v=DxCDxi08HWo&f
eature=youtu.be
Further reading
Johannes M. Scheuermann
inovex GmbH
Johannes.scheuermann@inovex.de
CC BY-NC-ND inovex.de +JohannesScheuermann
github.com/johscheu
er
@johscheuer youtube.com/inovexGmb
H

More Related Content

PPTX
Zabbix at scale with Elasticsearch
PDF
Bigdata and Hadoop with Docker
PDF
Apache spark on Hadoop Yarn Resource Manager
PPTX
Backup multi-cloud solution based on named pipes
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PPTX
Spark volume requirements 2018
PPTX
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
PPTX
Spark, Tachyon and Mesos internals
Zabbix at scale with Elasticsearch
Bigdata and Hadoop with Docker
Apache spark on Hadoop Yarn Resource Manager
Backup multi-cloud solution based on named pipes
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Spark volume requirements 2018
RedisConf17 - Home Depot - Turbo charging existing applications with Redis
Spark, Tachyon and Mesos internals

What's hot (20)

PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
Upstream Consultancy and Ceph RadosGW/S3 (AMTEGA Ceph Day 2018)
PDF
New use cases for Ceph, beyond OpenStack, Luis Rico
PDF
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PDF
Ceph used in Cancer Research at OICR
PDF
Presto Summit 2018 - 07 - Lyft
PDF
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
PPTX
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
PDF
Big Telco - Yousun Jeong
PDF
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
PDF
Presto Summit 2018 - 04 - Netflix Containers
PPTX
New Thor & Roxie Hardware Architecture
PPTX
RedisConf17 - Building Large High Performance Redis Databases with Redis Ente...
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PPTX
Lessons learned from embedding Cassandra in xPatterns
PDF
Presto @ Uber Hadoop summit2017
PDF
Architecture at Scale
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Upstream Consultancy and Ceph RadosGW/S3 (AMTEGA Ceph Day 2018)
New use cases for Ceph, beyond OpenStack, Luis Rico
Speeding Up Atlas Deep Learning Platform with Alluxio + Fluid
Presto Summit 2018 - 02 - LinkedIn
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HBase Global Indexing to support large-scale data ingestion at Uber
Ceph used in Cancer Research at OICR
Presto Summit 2018 - 07 - Lyft
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Big Telco - Yousun Jeong
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Presto Summit 2018 - 04 - Netflix Containers
New Thor & Roxie Hardware Architecture
RedisConf17 - Building Large High Performance Redis Databases with Redis Ente...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Lessons learned from embedding Cassandra in xPatterns
Presto @ Uber Hadoop summit2017
Architecture at Scale
Ad

Similar to Meetup Kubernetes Rhein-Necker (20)

PPTX
Storage Requirements and Options for Running Spark on Kubernetes
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PPTX
Hadoop ppt1
PPTX
Lessons learned from running Spark on Docker
PDF
Trend Micro Big Data Platform and Apache Bigtop
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
PDF
Webinar - DreamObjects/Ceph Case Study
PDF
Big data and Kubernetes
PDF
Chef for OpenStack December 2012
PDF
Big Telco Real-Time Network Analytics
PPTX
Hadoop introduction
PPTX
Hadoop and Big data in Big data and cloud.pptx
PDF
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
NAVER Ceph Storage on ssd for Container
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PPT
Apache Cassandra training. Overview and Basics
PDF
[Spark Summit 2017 NA] Apache Spark on Kubernetes
PDF
A closer look to locaweb IaaS
PPTX
HPC and cloud distributed computing, as a journey
Storage Requirements and Options for Running Spark on Kubernetes
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Hadoop ppt1
Lessons learned from running Spark on Docker
Trend Micro Big Data Platform and Apache Bigtop
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Webinar - DreamObjects/Ceph Case Study
Big data and Kubernetes
Chef for OpenStack December 2012
Big Telco Real-Time Network Analytics
Hadoop introduction
Hadoop and Big data in Big data and cloud.pptx
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
Sa introduction to big data pipelining with cassandra & spark west mins...
NAVER Ceph Storage on ssd for Container
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Cassandra training. Overview and Basics
[Spark Summit 2017 NA] Apache Spark on Kubernetes
A closer look to locaweb IaaS
HPC and cloud distributed computing, as a journey
Ad

More from inovex GmbH (20)

PDF
lldb – Debugger auf Abwegen
PDF
Are you sure about that?! Uncertainty Quantification in AI
PDF
Why natural language is next step in the AI evolution
PDF
WWDC 2019 Recap
PDF
Network Policies
PDF
Interpretable Machine Learning
PDF
Jenkins X – CI/CD in wolkigen Umgebungen
PDF
AI auf Edge-Geraeten
PDF
Prometheus on Kubernetes
PDF
Deep Learning for Recommender Systems
PDF
Azure IoT Edge
PDF
Representation Learning von Zeitreihen
PDF
Talk to me – Chatbots und digitale Assistenten
PDF
Künstlich intelligent?
PDF
Dev + Ops = Go
PDF
Das Android Open Source Project
PDF
Machine Learning Interpretability
PDF
Performance evaluation of GANs in a semisupervised OCR use case
PDF
People & Products – Lessons learned from the daily IT madness
PDF
Infrastructure as (real) Code – Manage your K8s resources with Pulumi
lldb – Debugger auf Abwegen
Are you sure about that?! Uncertainty Quantification in AI
Why natural language is next step in the AI evolution
WWDC 2019 Recap
Network Policies
Interpretable Machine Learning
Jenkins X – CI/CD in wolkigen Umgebungen
AI auf Edge-Geraeten
Prometheus on Kubernetes
Deep Learning for Recommender Systems
Azure IoT Edge
Representation Learning von Zeitreihen
Talk to me – Chatbots und digitale Assistenten
Künstlich intelligent?
Dev + Ops = Go
Das Android Open Source Project
Machine Learning Interpretability
Performance evaluation of GANs in a semisupervised OCR use case
People & Products – Lessons learned from the daily IT madness
Infrastructure as (real) Code – Manage your K8s resources with Pulumi

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
sap open course for s4hana steps from ECC to s4
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Meetup Kubernetes Rhein-Necker