SlideShare a Scribd company logo
Introduction to Spark 
Glenn K. Lockwood 
July 2014 
SAN DIEGO SUPERCOMPUTER CENTER
Outline 
I. Hadoop/MapReduce Recap and Limitations 
II. Complex Workflows and RDDs 
III. The Spark Framework 
IV. Spark on Gordon 
V. Practical Limitations of Spark 
SAN DIEGO SUPERCOMPUTER CENTER
Map/Reduce Parallelism 
Data Data 
SAN DIEGO SUPERCOMPUTER CENTER 
Data 
Data 
Data 
taDsakt a0 
task 5 task 4 
task 3 
task 1 task 2
Magic of HDFS 
SAN DIEGO SUPERCOMPUTER CENTER
Hadoop Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
Shuffle/Sort 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Disk 
Spill 
1. Map – convert raw input into 
key/value pairs. Output to 
local disk ("spill") 
2. Shuffle/Sort – All reducers 
retrieve all spilled records 
from all mappers over 
network 
3. Reduce – For each unique 
key, do something with all 
the corresponding values. 
Output to HDFS 
Map Map Map 
Reduce Reduce Reduce
2. Full* data dump to disk 
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce: Two 
Fundamental Limitations 
1. MapReduce prescribes 
workflow. 
• You map, then you reduce. 
• You cannot reduce, then map... 
• ...or anything else. See first 
point. 
Map Map Map 
Reduce Reduce Reduce 
between workflow steps. 
• Mappers deliver output on local 
disk (mapred.local.dir) 
• Reducers pull input over network 
from other nodes' local disks 
• Output goes right back to local 
* Combiners do local reductions to prevent a full, unreduced 
dump of data to local disk 
disks via HDFS 
Shuffle/Sort
Beyond MapReduce 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
SAN DIEGO SUPERCOMPUTER CENTER
Beyond MapReduce: Complex 
Workflows 
• What if workflow could be arbitrary in length? 
• map-map-reduce 
• reduce-map-reduce 
How can you do this without flushing intermediate 
results to disk after every operation? 
• What if higher-level map/reduce operations 
could be applied? 
• sampling or filtering of a large dataset 
• mean and variance of a dataset 
• sum/subtract all elements of a dataset 
• SQL JOIN operator 
How can you ensure fault tolerance for all of these 
baked-in operations? 
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER 
MapReduce Fault 
Tolerance 
Map Map Map 
Reduce Reduce Reduce 
Mapper Failure: 
1. Re-run map task 
and spill to disk 
2. Block until finished 
3. Reducers proceed 
as normal 
Reducer Failure: 
1. Re-fetch spills from 
all mappers' disks 
2. Re-run reducer task
Performing Complex Workflows 
How can you do complex workflows without 
flushing intermediate results to disk after every 
operation? 
1. Cache intermediate results in-memory 
2. Allow users to specify persistence in memory and 
partitioning of dataset across nodes 
How can you ensure fault tolerance? 
1. Coarse-grained atomicity via partitions (transform 
chunks of data, not record-by-record) 
2. Use transaction logging--forget replication 
SAN DIEGO SUPERCOMPUTER CENTER
Resilient Distributed Dataset (RDD) 
• Comprised of distributed, atomic partitions of elements 
• Apply transformations to generate new RDDs 
• RDDs are immutable (read-only) 
• RDDs can only be created from persistent storage (e.g., 
HDFS, POSIX, S3) or by transforming other RDDs 
# Create an RDD from a file on HDFS 
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') 
# Transform the RDD of lines into an RDD of words (one word per element) 
words = text.flatMap( lambda line: line.split() ) 
# Transform the RDD of words into an RDD of key/value pairs 
keyvals = words.map( lambda word: (word, 1) ) 
sc is a SparkContext object that describes our Spark cluster 
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) 
SAN DIEGO SUPERCOMPUTER CENTER
Potential RDD Workflow 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Transformations are lazy: nothing actually happens when 
this code is evaluated 
• RDDs are computed only when an action is called on 
them, e.g., 
• Calculate statistics over the elements of an RDD (count, mean) 
• Save the RDD to a file (saveAsTextFile) 
• Reduce elements of an RDD into a single object or value (reduce) 
• Allows you to define partitioning/caching behavior after 
defining the RDD but before calculating its contents 
SAN DIEGO SUPERCOMPUTER CENTER
RDD Transformation vs. Action 
• Must insert an action here to get pipeline to execute. 
• Actions create files or objects: 
# The saveAsTextFile action dumps the contents of an RDD to disk 
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') 
# The count action returns the number of elements in an RDD 
>>> num_elements = rdd.count(); 
num_elements; 
type(num_elements) 
SAN DIEGO SUPERCOMPUTER CENTER 
215136 
<type 'int'>
Resiliency: The 'R' in 'RDD' 
• No replication of in-memory data 
• Restrict transformations to coarse granularity 
• Partition-level operations simplifies data lineage 
SAN DIEGO SUPERCOMPUTER CENTER
Resiliency: The 'R' in 'RDD' 
• Reconstruct missing data from its lineage 
• Data in RDDs are deterministic since partitions 
are immutable and atomic 
SAN DIEGO SUPERCOMPUTER CENTER
Resiliency: The 'R' in 'RDD' 
• Long lineages or complex interactions 
(reductions, shuffles) can be checkpointed 
• RDD immutability  nonblocking (background) 
SAN DIEGO SUPERCOMPUTER CENTER
Introduction to Spark 
SPARK: AN IMPLEMENTATION 
OF RDDS 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Framework 
• Master/worker Model 
• Spark Master is analogous to Hadoop Jobtracker (MRv1) 
or Application Master (MRv2) 
• Spark Worker is analogous to Hadoop Tasktracker 
• Relies on "3rd party" storage for RDD generation 
(hdfs://, s3n://, file://, http://) 
• Spark clusters take three forms: 
• Standalone mode - workers communicate directly with 
master via spark://master:7077 URI 
• Mesos - mesos://master:5050 URI 
• YARN - no HA; complicated job launch 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Configuration 
1. Standalone mode is the simplest configuration 
and execution model (similar to MRv1) 
2. Leverage existing HDFS support in myHadoop 
for storage 
3. Combine #1 and #2 to extend myHadoop to 
support Spark: 
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf 
$ myhadoop-configure.sh 
... 
myHadoop: Enabling experimental Spark support 
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark 
myHadoop: 
To use Spark, you will want to type the following commands:" 
source /home/glock/hadoop.conf/spark/spark-env.sh 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Storage 
• Spark can use HDFS 
$ start-dfs.sh # after you run myhadoop-configure.sh, of course 
... 
$ pyspark 
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') 
>>> mydata.count() 
982394 
• Spark can use POSIX file systems too 
$ pyspark 
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') 
>>> mydata.count() 
982394 
• S3 Native (s3n://) and HTTP (http://) also work 
• file:// input will be served in chunks to Spark 
workers via the Spark driver's built-in httpd 
SAN DIEGO SUPERCOMPUTER CENTER
Spark on Gordon: Running 
Spark treats several languages as first-class 
citizens: 
Feature Scala Java Python 
Interactive YES NO YES 
Shark (SQL) YES YES YES 
Streaming YES YES NO 
MLlib YES YES YES 
GraphX YES YES NO 
R is a second-class citizen; basic RDD API is 
available outside of CRAN 
(http://amplab-extras.github.io/SparkR-pkg/) 
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop/Spark on Gordon (1/2) 
#!/bin/bash 
#PBS -l nodes=2:ppn=16:native:flash 
#PBS -l walltime=00:30:00 
#PBS -q normal 
### Environment setup for Hadoop 
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH 
module load hadoop/2.2.0 
export HADOOP_CONF_DIR=$HOME/mycluster.conf 
myhadoop-configure.sh 
### Start HDFS. Starting YARN isn't necessary since Spark will be running in 
### standalone mode on our cluster. 
start-dfs.sh 
### Load in the necessary Spark environment variables 
source $HADOOP_CONF_DIR/spark/spark-env.sh 
### Start the Spark masters and workers. Do NOT use the start-all.sh provided 
### by Spark, as they do not correctly honor $SPARK_CONF_DIR 
myspark start 
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop/Spark on Gordon (2/2) 
### Run our example problem. 
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home 
### dir by default which is different from Hadoop 1.x!) 
hdfs dfs -mkdir -p /user/$USER 
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt 
### Step 2. Run our Python Spark job. Note that Spark implicitly requires 
### Python 2.6 (some features, like MLLib, require 2.7) 
module load python scipy 
/home/glock/hadoop/run/wordcount-spark.py 
### Step 3. Copy output back out 
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ 
### Shut down Spark and HDFS 
myspark stop 
stop-dfs.sh 
### Clean up 
myhadoop-cleanup.sh 
SAN DIEGO SUPERCOMPUTER CENTER 
Wordcount submit script and Python code online: 
https://github.com/glennklockwood/sparktutorial
Introduction to Spark 
PRACTICAL LIMITATIONS 
SAN DIEGO SUPERCOMPUTER CENTER
Major Problems with Spark 
1. Still smells like a CS project 
2. Debugging is a dark art 
3. Not battle-tested at scale 
SAN DIEGO SUPERCOMPUTER CENTER
#1: Spark Smells Like CS 
• Components are constantly breaking 
• Graph.partitionBy broken in 1.0.0 (SPARK-1931) 
• Some components never worked 
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) 
• stop-master.sh doesn't work 
• Spark with YARN will break with large data sets (SPARK-2398) 
• spark-submit for standalone mode doesn't work (SPARK-2260) 
SAN DIEGO SUPERCOMPUTER CENTER
#1: Spark Smells Like CS 
• Really obvious usability issues: 
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') 
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 
scala.MatchError: 0 (of class java.lang.Integer) 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
SAN DIEGO SUPERCOMPUTER CENTER 
... 
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:722) 
Read an RDD, then write it out = unhandled exception with 
cryptic Scala errors from Python (SPARK-1690)
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') 
Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in 
saveAsTextFile 
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- 
src.zip/py4j/java_gateway.py", line 537, in __call__ 
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. 
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with 
client version 4 
at org.apache.hadoop.ipc.Client.call(Client.java:1070) 
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) 
at $Proxy7.getProtocolVersion(Unknown Source) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) 
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) 
Cause: Spark built against Hadoop 2 DFS trying to access data 
on Hadoop 1 DFS 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.count() 
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
self.serializer.dump_stream(self._batched(iterator), stream) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. 
for item in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin 
if acc is None: 
TypeError: an integer is required 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Debugging is a Dark Art 
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", 
serializer.dump_stream(func(split_index, iterator), outfile) 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. 
for obj in iterator: 
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin 
if not isinstance(x, basestring): 
SystemError: unknown opcode 
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) 
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) 
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) 
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) 
... 
Cause: Master was using Python 2.6, but workers were only 
able to find Python 2.4 
SAN DIEGO SUPERCOMPUTER CENTER
#2: Spark Debugging Tips 
• $SPARK_LOG_DIR/app-* contains master/worker 
logs with failure information 
• Try to find the salient error amidst the stack traces 
• Google that error--odds are, it is a known issue 
• Stick any required environment variables ($PATH, 
$PYTHONPATH, $JAVA_HOME) in 
$SPARK_CONF_DIR/spark-env.sh to rule out 
these problems 
• All else fails, look at Spark source code 
SAN DIEGO SUPERCOMPUTER CENTER
#3: Spark Isn't Battle Tested 
• Companies (Cloudera, SAP, etc) jumping on the 
Spark bandwagon with disclaimers about scaling 
• Spark does not handle multitenancy well at all. 
Wait scheduling is considered best way to achieve 
memory/disk data locality 
• Largest Spark clusters ~ hundreds of nodes 
SAN DIEGO SUPERCOMPUTER CENTER
Spark Take-Aways 
SAN DIEGO SUPERCOMPUTER CENTER 
• FACTS 
• Data is represented as resilient distributed datasets 
(RDDs) which remain in-memory and read-only 
• RDDs are comprised of elements 
• Elements are distributed across physical nodes in user-defined 
groups called partitions 
• RDDs are subject to transformations and actions 
• Fault tolerance achieved by lineage, not replication 
• Opinions 
• Spark is still in its infancy but its progress is promising 
• Good for evaluating--good for Gordon, Comet
Introduction to Spark 
PAGERANK EXAMPLE 
(INCOMPLETE) 
SAN DIEGO SUPERCOMPUTER CENTER
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
Start every webpage with a rank R = 1.0 
1. For each webpage linking in N neighbor webpages, 
have it "contribute" R/N to each of its N neighbors 
2. Then, for each webpage, set its rank R to (0.15 + 
0.85 * contributions) 
SAN DIEGO SUPERCOMPUTER CENTER 
3. Repeat 
insert flow diagram here
Lazy Evaluation + In-Memory Caching = 
Optimized JOIN Operations 
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') 
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such 
# that all common keys are kept together. Then retain this RDD in memory. 
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() 
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 
ranks = links.map(lambda (url, neighbors): (url, 1.0)) 
# Calculate and update URL rank 
for iteration in range(10): 
# Calculate URL contributions to their neighbors 
contribs = links.join(ranks).flatMap( 
lambda (url, (urls, rank)): computeContribs(urls, rank)) 
# Recalculate URL ranks based on neighbor contributions 
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) 
# Print all URLs and their ranks 
for (link, rank) in ranks.collect(): 
print '%s has rank %s' % (link, rank) 
SAN DIEGO SUPERCOMPUTER CENTER

More Related Content

PDF
Hadoop Streaming: Programming Hadoop without Java
PPTX
Hadoop MapReduce Streaming and Pipes
PDF
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Hadoop & MapReduce
PPT
Hadoop 2
PPTX
03 pig intro
PPTX
Map reduce prashant
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop Streaming: Programming Hadoop without Java
Hadoop MapReduce Streaming and Pipes
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
Hadoop & MapReduce
Hadoop 2
03 pig intro
Map reduce prashant
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014

What's hot (19)

PDF
Apache Pig: A big data processor
PPTX
06 pig etl features
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
MapReduce basic
PDF
Apache Hadoop MapReduce Tutorial
PPTX
Apache Spark Architecture
PPTX
MapReduce Paradigm
PDF
Apache Spark at Viadeo
PDF
Introduction to Apache Spark Ecosystem
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PPTX
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
PPTX
Introduction to Apache Pig
PPTX
Map reduce paradigm explained
PPTX
Hadoop Interview Question and Answers
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Top 5 mistakes when writing Spark applications
PPT
Hadoop MapReduce Fundamentals
PPT
Hadoop introduction 2
PPTX
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache Pig: A big data processor
06 pig etl features
Apache hadoop, hdfs and map reduce Overview
MapReduce basic
Apache Hadoop MapReduce Tutorial
Apache Spark Architecture
MapReduce Paradigm
Apache Spark at Viadeo
Introduction to Apache Spark Ecosystem
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Introduction to Apache Pig
Map reduce paradigm explained
Hadoop Interview Question and Answers
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Top 5 mistakes when writing Spark applications
Hadoop MapReduce Fundamentals
Hadoop introduction 2
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Ad

Viewers also liked (20)

PDF
myHadoop 0.30
PDF
Large-scale Genomic Analysis Enabled by Gordon
PPTX
Transpilers(Source-to-Source Compilers)
PPT
ASCI Terascale Simulation Requirements and Deployments
PPTX
Use of spark for proteomic scoring seattle presentation
PPTX
C++ AMPを使ってみよう
PDF
Лекция 3. Распределённая файловая система HDFS
PPTX
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
PDF
Лекция 12. Spark
PPTX
スパース性に基づく機械学習 2章 データからの学習
PPTX
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
PDF
Generalized Linear Models in Spark MLlib and SparkR
PDF
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
PDF
スパースモデリング入門
PDF
Spark MLlibではじめるスケーラブルな機械学習
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Systems Performance 2016
PPTX
Broken Linux Performance Tools 2016
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
BPF: Tracing and more
myHadoop 0.30
Large-scale Genomic Analysis Enabled by Gordon
Transpilers(Source-to-Source Compilers)
ASCI Terascale Simulation Requirements and Deployments
Use of spark for proteomic scoring seattle presentation
C++ AMPを使ってみよう
Лекция 3. Распределённая файловая система HDFS
スパース性に基づく機械学習(機械学習プロフェッショナルシリーズ) 1章
Лекция 12. Spark
スパース性に基づく機械学習 2章 データからの学習
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Generalized Linear Models in Spark MLlib and SparkR
スパースモデリング、スパースコーディングとその数理(第11回WBA若手の会)
スパースモデリング入門
Spark MLlibではじめるスケーラブルな機械学習
Linux Performance Analysis: New Tools and Old Secrets
Linux Systems Performance 2016
Broken Linux Performance Tools 2016
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
BPF: Tracing and more
Ad

Similar to Overview of Spark for HPC (20)

PPTX
PDF
Tuning and Debugging in Apache Spark
PPTX
Spark architechure.pptx
PPT
Scala and spark
PPTX
SparkNotes
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPTX
Tuning and Debugging in Apache Spark
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Scala Meetup Hamburg - Spark
PDF
Apache Spark: What? Why? When?
PDF
R Data Access from hdfs,spark,hive
PPTX
MapReduce Paradigm
PDF
Apache Spark: What's under the hood
PDF
Fast Data Analytics with Spark and Python
ODP
How Spark Does It Internally?
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Tuning and Debugging in Apache Spark
Spark architechure.pptx
Scala and spark
SparkNotes
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Tuning and Debugging in Apache Spark
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Intro to Apache Spark
Intro to Apache Spark
Scala Meetup Hamburg - Spark
Apache Spark: What? Why? When?
R Data Access from hdfs,spark,hive
MapReduce Paradigm
Apache Spark: What's under the hood
Fast Data Analytics with Spark and Python
How Spark Does It Internally?
Apache spark sneha challa- google pittsburgh-aug 25th
Spark Summit East 2015 Advanced Devops Student Slides

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Spectroscopy.pptx food analysis technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Spectroscopy.pptx food analysis technology
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Empathic Computing: Creating Shared Understanding

Overview of Spark for HPC

  • 1. Introduction to Spark Glenn K. Lockwood July 2014 SAN DIEGO SUPERCOMPUTER CENTER
  • 2. Outline I. Hadoop/MapReduce Recap and Limitations II. Complex Workflows and RDDs III. The Spark Framework IV. Spark on Gordon V. Practical Limitations of Spark SAN DIEGO SUPERCOMPUTER CENTER
  • 3. Map/Reduce Parallelism Data Data SAN DIEGO SUPERCOMPUTER CENTER Data Data Data taDsakt a0 task 5 task 4 task 3 task 1 task 2
  • 4. Magic of HDFS SAN DIEGO SUPERCOMPUTER CENTER
  • 5. Hadoop Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 6. Shuffle/Sort SAN DIEGO SUPERCOMPUTER CENTER MapReduce Disk Spill 1. Map – convert raw input into key/value pairs. Output to local disk ("spill") 2. Shuffle/Sort – All reducers retrieve all spilled records from all mappers over network 3. Reduce – For each unique key, do something with all the corresponding values. Output to HDFS Map Map Map Reduce Reduce Reduce
  • 7. 2. Full* data dump to disk SAN DIEGO SUPERCOMPUTER CENTER MapReduce: Two Fundamental Limitations 1. MapReduce prescribes workflow. • You map, then you reduce. • You cannot reduce, then map... • ...or anything else. See first point. Map Map Map Reduce Reduce Reduce between workflow steps. • Mappers deliver output on local disk (mapred.local.dir) • Reducers pull input over network from other nodes' local disks • Output goes right back to local * Combiners do local reductions to prevent a full, unreduced dump of data to local disk disks via HDFS Shuffle/Sort
  • 8. Beyond MapReduce • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator SAN DIEGO SUPERCOMPUTER CENTER
  • 9. Beyond MapReduce: Complex Workflows • What if workflow could be arbitrary in length? • map-map-reduce • reduce-map-reduce How can you do this without flushing intermediate results to disk after every operation? • What if higher-level map/reduce operations could be applied? • sampling or filtering of a large dataset • mean and variance of a dataset • sum/subtract all elements of a dataset • SQL JOIN operator How can you ensure fault tolerance for all of these baked-in operations? SAN DIEGO SUPERCOMPUTER CENTER
  • 10. SAN DIEGO SUPERCOMPUTER CENTER MapReduce Fault Tolerance Map Map Map Reduce Reduce Reduce Mapper Failure: 1. Re-run map task and spill to disk 2. Block until finished 3. Reducers proceed as normal Reducer Failure: 1. Re-fetch spills from all mappers' disks 2. Re-run reducer task
  • 11. Performing Complex Workflows How can you do complex workflows without flushing intermediate results to disk after every operation? 1. Cache intermediate results in-memory 2. Allow users to specify persistence in memory and partitioning of dataset across nodes How can you ensure fault tolerance? 1. Coarse-grained atomicity via partitions (transform chunks of data, not record-by-record) 2. Use transaction logging--forget replication SAN DIEGO SUPERCOMPUTER CENTER
  • 12. Resilient Distributed Dataset (RDD) • Comprised of distributed, atomic partitions of elements • Apply transformations to generate new RDDs • RDDs are immutable (read-only) • RDDs can only be created from persistent storage (e.g., HDFS, POSIX, S3) or by transforming other RDDs # Create an RDD from a file on HDFS text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt') # Transform the RDD of lines into an RDD of words (one word per element) words = text.flatMap( lambda line: line.split() ) # Transform the RDD of words into an RDD of key/value pairs keyvals = words.map( lambda word: (word, 1) ) sc is a SparkContext object that describes our Spark cluster lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages) SAN DIEGO SUPERCOMPUTER CENTER
  • 13. Potential RDD Workflow SAN DIEGO SUPERCOMPUTER CENTER
  • 14. RDD Transformation vs. Action • Transformations are lazy: nothing actually happens when this code is evaluated • RDDs are computed only when an action is called on them, e.g., • Calculate statistics over the elements of an RDD (count, mean) • Save the RDD to a file (saveAsTextFile) • Reduce elements of an RDD into a single object or value (reduce) • Allows you to define partitioning/caching behavior after defining the RDD but before calculating its contents SAN DIEGO SUPERCOMPUTER CENTER
  • 15. RDD Transformation vs. Action • Must insert an action here to get pipeline to execute. • Actions create files or objects: # The saveAsTextFile action dumps the contents of an RDD to disk >>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt') # The count action returns the number of elements in an RDD >>> num_elements = rdd.count(); num_elements; type(num_elements) SAN DIEGO SUPERCOMPUTER CENTER 215136 <type 'int'>
  • 16. Resiliency: The 'R' in 'RDD' • No replication of in-memory data • Restrict transformations to coarse granularity • Partition-level operations simplifies data lineage SAN DIEGO SUPERCOMPUTER CENTER
  • 17. Resiliency: The 'R' in 'RDD' • Reconstruct missing data from its lineage • Data in RDDs are deterministic since partitions are immutable and atomic SAN DIEGO SUPERCOMPUTER CENTER
  • 18. Resiliency: The 'R' in 'RDD' • Long lineages or complex interactions (reductions, shuffles) can be checkpointed • RDD immutability  nonblocking (background) SAN DIEGO SUPERCOMPUTER CENTER
  • 19. Introduction to Spark SPARK: AN IMPLEMENTATION OF RDDS SAN DIEGO SUPERCOMPUTER CENTER
  • 20. Spark Framework • Master/worker Model • Spark Master is analogous to Hadoop Jobtracker (MRv1) or Application Master (MRv2) • Spark Worker is analogous to Hadoop Tasktracker • Relies on "3rd party" storage for RDD generation (hdfs://, s3n://, file://, http://) • Spark clusters take three forms: • Standalone mode - workers communicate directly with master via spark://master:7077 URI • Mesos - mesos://master:5050 URI • YARN - no HA; complicated job launch SAN DIEGO SUPERCOMPUTER CENTER
  • 21. Spark on Gordon: Configuration 1. Standalone mode is the simplest configuration and execution model (similar to MRv1) 2. Leverage existing HDFS support in myHadoop for storage 3. Combine #1 and #2 to extend myHadoop to support Spark: $ export HADOOP_CONF_DIR=/home/glock/hadoop.conf $ myhadoop-configure.sh ... myHadoop: Enabling experimental Spark support myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark myHadoop: To use Spark, you will want to type the following commands:" source /home/glock/hadoop.conf/spark/spark-env.sh myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 22. Spark on Gordon: Storage • Spark can use HDFS $ start-dfs.sh # after you run myhadoop-configure.sh, of course ... $ pyspark >>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt') >>> mydata.count() 982394 • Spark can use POSIX file systems too $ pyspark >>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt') >>> mydata.count() 982394 • S3 Native (s3n://) and HTTP (http://) also work • file:// input will be served in chunks to Spark workers via the Spark driver's built-in httpd SAN DIEGO SUPERCOMPUTER CENTER
  • 23. Spark on Gordon: Running Spark treats several languages as first-class citizens: Feature Scala Java Python Interactive YES NO YES Shark (SQL) YES YES YES Streaming YES YES NO MLlib YES YES YES GraphX YES YES NO R is a second-class citizen; basic RDD API is available outside of CRAN (http://amplab-extras.github.io/SparkR-pkg/) SAN DIEGO SUPERCOMPUTER CENTER
  • 24. myHadoop/Spark on Gordon (1/2) #!/bin/bash #PBS -l nodes=2:ppn=16:native:flash #PBS -l walltime=00:30:00 #PBS -q normal ### Environment setup for Hadoop export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH module load hadoop/2.2.0 export HADOOP_CONF_DIR=$HOME/mycluster.conf myhadoop-configure.sh ### Start HDFS. Starting YARN isn't necessary since Spark will be running in ### standalone mode on our cluster. start-dfs.sh ### Load in the necessary Spark environment variables source $HADOOP_CONF_DIR/spark/spark-env.sh ### Start the Spark masters and workers. Do NOT use the start-all.sh provided ### by Spark, as they do not correctly honor $SPARK_CONF_DIR myspark start SAN DIEGO SUPERCOMPUTER CENTER
  • 25. myHadoop/Spark on Gordon (2/2) ### Run our example problem. ### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home ### dir by default which is different from Hadoop 1.x!) hdfs dfs -mkdir -p /user/$USER hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt ### Step 2. Run our Python Spark job. Note that Spark implicitly requires ### Python 2.6 (some features, like MLLib, require 2.7) module load python scipy /home/glock/hadoop/run/wordcount-spark.py ### Step 3. Copy output back out hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/ ### Shut down Spark and HDFS myspark stop stop-dfs.sh ### Clean up myhadoop-cleanup.sh SAN DIEGO SUPERCOMPUTER CENTER Wordcount submit script and Python code online: https://github.com/glennklockwood/sparktutorial
  • 26. Introduction to Spark PRACTICAL LIMITATIONS SAN DIEGO SUPERCOMPUTER CENTER
  • 27. Major Problems with Spark 1. Still smells like a CS project 2. Debugging is a dark art 3. Not battle-tested at scale SAN DIEGO SUPERCOMPUTER CENTER
  • 28. #1: Spark Smells Like CS • Components are constantly breaking • Graph.partitionBy broken in 1.0.0 (SPARK-1931) • Some components never worked • SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058) • stop-master.sh doesn't work • Spark with YARN will break with large data sets (SPARK-2398) • spark-submit for standalone mode doesn't work (SPARK-2260) SAN DIEGO SUPERCOMPUTER CENTER
  • 29. #1: Spark Smells Like CS • Really obvious usability issues: >>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt') >>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir') 14/04/30 16:23:07 ERROR Executor: Exception in task ID 19 scala.MatchError: 0 (of class java.lang.Integer) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) SAN DIEGO SUPERCOMPUTER CENTER ... at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Read an RDD, then write it out = unhandled exception with cryptic Scala errors from Python (SPARK-1690)
  • 30. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in saveAsTextFile keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path) File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1- src.zip/py4j/java_gateway.py", line 537, in __call__ File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile. : org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) Cause: Spark built against Hadoop 2 DFS trying to access data on Hadoop 1 DFS SAN DIEGO SUPERCOMPUTER CENTER
  • 31. #2: Debugging is a Dark Art >>> data.count() 14/04/30 16:15:11 ERROR Executor: Exception in task ID 12 org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. self.serializer.dump_stream(self._batched(iterator), stream) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers. for item in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin if acc is None: TypeError: an integer is required at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 32. #2: Debugging is a Dark Art >>> data.saveAsTextFile('hdfs://user/glock/output.dir/') 14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py", serializer.dump_stream(func(split_index, iterator), outfile) File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers. for obj in iterator: File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin if not isinstance(x, basestring): SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) ... Cause: Master was using Python 2.6, but workers were only able to find Python 2.4 SAN DIEGO SUPERCOMPUTER CENTER
  • 33. #2: Spark Debugging Tips • $SPARK_LOG_DIR/app-* contains master/worker logs with failure information • Try to find the salient error amidst the stack traces • Google that error--odds are, it is a known issue • Stick any required environment variables ($PATH, $PYTHONPATH, $JAVA_HOME) in $SPARK_CONF_DIR/spark-env.sh to rule out these problems • All else fails, look at Spark source code SAN DIEGO SUPERCOMPUTER CENTER
  • 34. #3: Spark Isn't Battle Tested • Companies (Cloudera, SAP, etc) jumping on the Spark bandwagon with disclaimers about scaling • Spark does not handle multitenancy well at all. Wait scheduling is considered best way to achieve memory/disk data locality • Largest Spark clusters ~ hundreds of nodes SAN DIEGO SUPERCOMPUTER CENTER
  • 35. Spark Take-Aways SAN DIEGO SUPERCOMPUTER CENTER • FACTS • Data is represented as resilient distributed datasets (RDDs) which remain in-memory and read-only • RDDs are comprised of elements • Elements are distributed across physical nodes in user-defined groups called partitions • RDDs are subject to transformations and actions • Fault tolerance achieved by lineage, not replication • Opinions • Spark is still in its infancy but its progress is promising • Good for evaluating--good for Gordon, Comet
  • 36. Introduction to Spark PAGERANK EXAMPLE (INCOMPLETE) SAN DIEGO SUPERCOMPUTER CENTER
  • 37. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations Start every webpage with a rank R = 1.0 1. For each webpage linking in N neighbor webpages, have it "contribute" R/N to each of its N neighbors 2. Then, for each webpage, set its rank R to (0.15 + 0.85 * contributions) SAN DIEGO SUPERCOMPUTER CENTER 3. Repeat insert flow diagram here
  • 38. Lazy Evaluation + In-Memory Caching = Optimized JOIN Operations lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt') # Load key/value pairs of (url, link), eliminate duplicates, and partition them such # that all common keys are kept together. Then retain this RDD in memory. links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache() # Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0 ranks = links.map(lambda (url, neighbors): (url, 1.0)) # Calculate and update URL rank for iteration in range(10): # Calculate URL contributions to their neighbors contribs = links.join(ranks).flatMap( lambda (url, (urls, rank)): computeContribs(urls, rank)) # Recalculate URL ranks based on neighbor contributions ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank) # Print all URLs and their ranks for (link, rank) in ranks.collect(): print '%s has rank %s' % (link, rank) SAN DIEGO SUPERCOMPUTER CENTER

Editor's Notes

  • #39: groupByKey: group the values for each key in the RDD into a single sequence mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning) collect: return a list containing all elements of the RDD def computeContribs(urls, rank): """Calculates URL contributions to the rank of other URLs.""" num_urls = len(urls) for url in urls: yield (url, rank / num_urls)