SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Spark Operations
Kostas Sakellis
2© Cloudera, Inc. All rights reserved.
Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager
3© Cloudera, Inc. All rights reserved.
Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
4© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
5© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
6© Cloudera, Inc. All rights reserved.
Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
7© Cloudera, Inc. All rights reserved.
Partitions
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
8© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
9© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
10© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
11© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
12© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage
13© Cloudera, Inc. All rights reserved.
Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
• A pipelined set of transformation on a single thread
14© Cloudera, Inc. All rights reserved.
Spark Architecture
15© Cloudera, Inc. All rights reserved.
Spark System Architecture
16© Cloudera, Inc. All rights reserved.
Deployments
• Spark supports pluggable Cluster Managers
• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone
• CDH 5.x includes Spark on YARN support
17© Cloudera, Inc. All rights reserved.
Standalone
Master
Worker
Client
Worker
Process
App
Master
Process
18© Cloudera, Inc. All rights reserved.
Standalone
• On cluster
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
• Submit job
spark-submit --master <master-spark-URL> …
19© Cloudera, Inc. All rights reserved.
Container
YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
20© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
21© Cloudera, Inc. All rights reserved.
Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process
22© Cloudera, Inc. All rights reserved.
Spark on YARN
• Submit job
spark-submit --master yarn-client …
• Cluster mode
spark-submit --master yarn-cluster …
• Spark shell only works in client mode!
23© Cloudera, Inc. All rights reserved.
Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
24© Cloudera, Inc. All rights reserved.
Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager
25© Cloudera, Inc. All rights reserved.
Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
26© Cloudera, Inc. All rights reserved.
Dynamic Allocation
• Spark applications scale the number of executors based on load
• Removes need for: --num-executors
• Idle executors get killed
• First supported in CDH 5.4
• Ideal for:
• Long ETL jobs with large shuffles
• shell applications: hive and spark shell
27© Cloudera, Inc. All rights reserved.
Dynamic Allocation Limitations
• Still required to specify cores
• --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
• Our customers still get it wrong!
28© Cloudera, Inc. All rights reserved.
The Future of Dynamic Allocation
• Only “task size” needed: --task-size
• Eliminates
• --num-cores
• --num-executors
• --executor-memory
• Leads to better cluster utilization
29© Cloudera, Inc. All rights reserved.
Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
30© Cloudera, Inc. All rights reserved.
Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens
31© Cloudera, Inc. All rights reserved.
Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682
32© Cloudera, Inc. All rights reserved.
Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
• Record Service
• Column level security for Spark!
33© Cloudera, Inc. All rights reserved.
Thank you
We’re Hiring!

More Related Content

PPTX
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
PPTX
5 Apache Spark Tips in 5 Minutes
PPTX
Intro to Apache Spark
PPTX
Building Efficient Pipelines in Apache Spark
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
PDF
One Hadoop, Multiple Clouds - NYC Big Data Meetup
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
5 Apache Spark Tips in 5 Minutes
Intro to Apache Spark
Building Efficient Pipelines in Apache Spark
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production

What's hot (20)

PDF
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
PPTX
Empower Hive with Spark
PDF
Hive on spark berlin buzzwords
PDF
Application Architectures with Hadoop
PDF
dplyr Interfaces to Large-Scale Data
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PPTX
Spark One Platform Webinar
PPTX
Effective Spark on Multi-Tenant Clusters
PPTX
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
PPTX
A deep dive into running data analytic workloads in the cloud
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PPTX
Road to Cloudera certification
PDF
Kudu Cloudera Meetup Paris
PPTX
Unlock Hadoop Success with Cloudera Navigator Optimizer
PDF
Apache Hadoop 3
PPTX
Apache solr performance and scalability effort update palo alto 2017%2 f7
PPTX
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
PPTX
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
Empower Hive with Spark
Hive on spark berlin buzzwords
Application Architectures with Hadoop
dplyr Interfaces to Large-Scale Data
How to deploy Apache Spark in a multi-tenant, on-premises environment
Spark One Platform Webinar
Effective Spark on Multi-Tenant Clusters
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Low latency high throughput streaming using Apache Apex and Apache Kudu
A deep dive into running data analytic workloads in the cloud
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Road to Cloudera certification
Kudu Cloudera Meetup Paris
Unlock Hadoop Success with Cloudera Navigator Optimizer
Apache Hadoop 3
Apache solr performance and scalability effort update palo alto 2017%2 f7
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
Ad

Viewers also liked (19)

PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Are you Kudu-ing me?!
PDF
Spark on YARN: The Road Ahead
PDF
Map Reduce v2 and YARN - CHUG - 20120604
PDF
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
PDF
Apache Hadoop YARN
PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PPTX
HDFS Internals
PDF
Hadoop Internals (2.3.0 or later)
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
PDF
Operations on rdd
PPTX
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
PPTX
The Aggregation Framework
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
PPTX
Dynamic Column Masking and Row-Level Filtering in HDP
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PPTX
Hive on spark is blazing fast or is it final
PPTX
Introduction to YARN and MapReduce 2
PDF
Hadoop Overview & Architecture
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Are you Kudu-ing me?!
Spark on YARN: The Road Ahead
Map Reduce v2 and YARN - CHUG - 20120604
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN
Data Science at Scale Using Apache Spark and Apache Hadoop
HDFS Internals
Hadoop Internals (2.3.0 or later)
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Operations on rdd
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
The Aggregation Framework
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
Dynamic Column Masking and Row-Level Filtering in HDP
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hive on spark is blazing fast or is it final
Introduction to YARN and MapReduce 2
Hadoop Overview & Architecture
 
Ad

Similar to Apache Spark Operations (20)

PPTX
Getting Apache Spark Customers to Production
PPTX
Apache Spark: Usage and Roadmap in Hadoop
PPTX
Pa cloudera manager-api's_extensibility_v2
PPTX
Spark Tips & Tricks
PDF
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
PPTX
Configuring a Secure, Multitenant Cluster for the Enterprise
PDF
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
PDF
Hive Now Sparks
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
PPTX
Spark in the Enterprise - 2 Years Later by Alan Saldich
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PPTX
Hadoop 3 (2017 hadoop taiwan workshop)
PDF
Cluster management and automation with cloudera manager
PPTX
Dev ops for big data cluster management tools
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PDF
Hadoop Operations for Production Systems (Strata NYC)
Getting Apache Spark Customers to Production
Apache Spark: Usage and Roadmap in Hadoop
Pa cloudera manager-api's_extensibility_v2
Spark Tips & Tricks
Cloud-Native Machine Learning: Emerging Trends and the Road Ahead
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
Hive Now Sparks
Real Time Data Processing Using Spark Streaming
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
Hadoop World 2010: Productionizing Hadoop: Lessons Learned
Spark in the Enterprise - 2 Years Later by Alan Saldich
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Hadoop 3 (2017 hadoop taiwan workshop)
Cluster management and automation with cloudera manager
Dev ops for big data cluster management tools
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Hadoop Operations for Production Systems (Strata NYC)

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
top salesforce developer skills in 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
ai tools demonstartion for schools and inter college
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 41
L1 - Introduction to python Backend.pptx
Essential Infomation Tech presentation.pptx
Odoo POS Development Services by CandidRoot Solutions
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Softaken Excel to vCard Converter Software.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
2025 Textile ERP Trends: SAP, Odoo & Oracle
top salesforce developer skills in 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Nekopoi APK 2025 free lastest update

Apache Spark Operations

  • 1. 1© Cloudera, Inc. All rights reserved. Spark Operations Kostas Sakellis
  • 2. 2© Cloudera, Inc. All rights reserved. Me • Software Engineer at Cloudera • Contributor to Apache Spark • Before that, contributed to Cloudera Manager
  • 3. 3© Cloudera, Inc. All rights reserved. Building a proof of concept! Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg
  • 4. 4© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 5. 5© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 6. 6© Cloudera, Inc. All rights reserved. Example sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect()
  • 7. 7© Cloudera, Inc. All rights reserved. Partitions sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 8. 8© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4
  • 9. 9© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4
  • 10. 10© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4
  • 11. 11© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect
  • 12. 12© Cloudera, Inc. All rights reserved. …RDD …RDD RDD Lineage HDFS Partition 1 Partition 2 Partition 3 Partition 4 sc.textFile(“hdfs://data/u.item”, 4) .map(Movie(_)) .filter(_.month.equals(“Nov”)) .collect() Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect Lineage
  • 13. 13© Cloudera, Inc. All rights reserved. Task …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Collect • A pipelined set of transformation on a single thread
  • 14. 14© Cloudera, Inc. All rights reserved. Spark Architecture
  • 15. 15© Cloudera, Inc. All rights reserved. Spark System Architecture
  • 16. 16© Cloudera, Inc. All rights reserved. Deployments • Spark supports pluggable Cluster Managers • local, Standalone, YARN and Mesos • In early 2014, CDH 4.x with Spark 0.9 only supported Standalone • CDH 5.x includes Spark on YARN support
  • 17. 17© Cloudera, Inc. All rights reserved. Standalone Master Worker Client Worker Process App Master Process
  • 18. 18© Cloudera, Inc. All rights reserved. Standalone • On cluster ./sbin/start-master.sh ./sbin/start-slave.sh <master-spark-URL> • Submit job spark-submit --master <master-spark-URL> …
  • 19. 19© Cloudera, Inc. All rights reserved. Container YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 20. 20© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 21. 21© Cloudera, Inc. All rights reserved. Container Spark on YARN Architecture Resource Manager Node Manager Client Node Manager Container Process App Master Container Process
  • 22. 22© Cloudera, Inc. All rights reserved. Spark on YARN • Submit job spark-submit --master yarn-client … • Cluster mode spark-submit --master yarn-cluster … • Spark shell only works in client mode!
  • 23. 23© Cloudera, Inc. All rights reserved. Customers often have shared infrastructure Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg
  • 24. 24© Cloudera, Inc. All rights reserved. Multi-tenancy • Cluster utilization is top metric • Target: 70-80% utilization • Mixed workloads from mixed customers • We recommend YARN • Built in resource manager
  • 25. 25© Cloudera, Inc. All rights reserved. Underutilized Clusters Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG
  • 26. 26© Cloudera, Inc. All rights reserved. Dynamic Allocation • Spark applications scale the number of executors based on load • Removes need for: --num-executors • Idle executors get killed • First supported in CDH 5.4 • Ideal for: • Long ETL jobs with large shuffles • shell applications: hive and spark shell
  • 27. 27© Cloudera, Inc. All rights reserved. Dynamic Allocation Limitations • Still required to specify cores • --num-cores • Memory • --executor-memory • Includes JVM overhead • Need to do the math yourself • Our customers still get it wrong!
  • 28. 28© Cloudera, Inc. All rights reserved. The Future of Dynamic Allocation • Only “task size” needed: --task-size • Eliminates • --num-cores • --num-executors • --executor-memory • Leads to better cluster utilization
  • 29. 29© Cloudera, Inc. All rights reserved. Security, now it’s getting serious. Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg
  • 30. 30© Cloudera, Inc. All rights reserved. Authentication • Kerberos – the necessary evil • Ubiquitous amongst other services • YARN, HDFS, Hive, HBase, etc. • Spark utilizes delegation tokens
  • 31. 31© Cloudera, Inc. All rights reserved. Encryption • Control plane • File distribution • Block Manager • User UI / REST API • Data-at-rest (shuffle files) SPARK-6028 (Replace with netty) Replace with netty Spark 1.4 SPARK-2750 (SSL) SPARK-5682
  • 32. 32© Cloudera, Inc. All rights reserved. Authorization • Enterprises have sensitive data • Beyond HDFS file permissions • Partial access to data • Column level granularity • Apache Sentry • HDFS-Sentry synchronization plugin • Record Service • Column level security for Spark!
  • 33. 33© Cloudera, Inc. All rights reserved. Thank you We’re Hiring!

Editor's Notes

  • #2: Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.
  • #3: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #4: Spark makes building a proof of concept with a subset of data relatively easy. But then things go wrong Plug for my talk at Hadoop Summit
  • #5: Lets start with an example program in Spark.
  • #6: Lets start with an example program in Spark.
  • #7: The sum() call launches a job
  • #8: A chunk of data somewhere Could be on Hadoop File System (HDFS) Could be cached in Spark Defines the degree of parallelism
  • #9: Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • #10: Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • #11: Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • #12: Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • #13: Describes a way of generating input and output partitions Immutable – very important! RDDs can depend on other RDDs Most have single parent Joins have multiple parents Lineage over replication for fault tolerance https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • #15: Lets review the general Spark architecture
  • #16: A driver Where the DAG scheduler lives Drives the show Single point of failure Executors Communicates with driver Runs the tasks created by the driver Think of this as a ThreadPoolExecutor in java Pluggable cluster managers YARN, Mesos, standalone
  • #17: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #18: Lets review the general Spark architecture
  • #19: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #20: Lets review the general Spark architecture
  • #21: Lets review the general Spark architecture
  • #22: Lets review the general Spark architecture
  • #23: In scope - Focus on operational issues - Not on building the code itself Experience from our customer support tickets
  • #24: Spark makes building a proof of concept with a subset of data relatively easy.
  • #30: Spark makes building a proof of concept with a subset of data relatively easy.
  • #32: Control plane File distribution Block Manager User UI / REST API Data-at-rest (shuffle files)