Apache Spark Operations

1© Cloudera, Inc. All rights reserved.
Spark Operations
Kostas Sakellis

Me
• Software Engineer at Cloudera
• Contributor to Apache Spark
• Before that, contributed to Cloudera Manager

Building a proof of
concept!
Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

Example
sc.textFile(“hdfs://data/u.item”, 4)
.map(Movie(_))
.filter(_.month.equals(“Nov”))
.collect()

Example
.map(Movie(_))
.collect()

Partitions
.map(Movie(_))
.collect()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4

RDDs
.map(Movie(_))
.collect()
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4

RDDs
.map(Movie(_))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4

RDDs
.map(Movie(_))
.collect()
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4

…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
.map(Movie(_))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect

…RDD …RDD
RDD Lineage
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
.map(Movie(_))
.collect()
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
Lineage

Task
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Collect
• A pipelined set of transformation on a single thread

Spark Architecture

Spark System Architecture

Deployments
• Spark supports pluggable Cluster Managers
• local, Standalone, YARN and Mesos
• In early 2014, CDH 4.x with Spark 0.9 only supported Standalone
• CDH 5.x includes Spark on YARN support

Standalone
Master
Worker
Client
Worker
Process
App
Master
Process

Standalone
• On cluster
./sbin/start-master.sh
./sbin/start-slave.sh <master-spark-URL>
• Submit job
spark-submit --master <master-spark-URL> …

Container
YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process

Container
Spark on YARN Architecture
Resource
Manager
Node
Manager
Client
Node
Manager
Container
Process
App
Master
Container
Process

Spark on YARN
• Submit job
spark-submit --master yarn-client …
• Cluster mode
spark-submit --master yarn-cluster …
• Spark shell only works in client mode!

Customers often
have shared
infrastructure
Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

Multi-tenancy
• Cluster utilization is top metric
• Target: 70-80% utilization
• Mixed workloads from mixed customers
• We recommend YARN
• Built in resource manager

Underutilized
Clusters
Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

Dynamic Allocation
• Spark applications scale the number of executors based on load
• Removes need for: --num-executors
• Idle executors get killed
• First supported in CDH 5.4
• Ideal for:
• Long ETL jobs with large shuffles
• shell applications: hive and spark shell

Dynamic Allocation Limitations
• Still required to specify cores
• --num-cores
• Memory
• --executor-memory
• Includes JVM overhead
• Need to do the math yourself
• Our customers still get it wrong!

The Future of Dynamic Allocation
• Only “task size” needed: --task-size
• Eliminates
• --num-cores
• --num-executors
• --executor-memory
• Leads to better cluster utilization

Security, now it’s
getting serious.
Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

Authentication
• Kerberos – the necessary evil
• Ubiquitous amongst other services
• YARN, HDFS, Hive, HBase, etc.
• Spark utilizes delegation tokens

Encryption
• Control plane
• File distribution
• Block Manager
• User UI / REST API
• Data-at-rest (shuffle files)
SPARK-6028 (Replace with netty)
Replace with netty
Spark 1.4
SPARK-2750 (SSL)
SPARK-5682

Authorization
• Enterprises have sensitive data
• Beyond HDFS file permissions
• Partial access to data
• Column level granularity
• Apache Sentry
• HDFS-Sentry synchronization plugin
• Record Service
• Column level security for Spark!

Thank you
We’re Hiring!

Apache Spark Operations

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Apache Spark Operations (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Apache Spark Operations

Editor's Notes