Overview of Spark for HPC

Introduction to Spark
Glenn K. Lockwood
July 2014
SAN DIEGO SUPERCOMPUTER CENTER

Outline
I. Hadoop/MapReduce Recap and Limitations
II. Complex Workflows and RDDs
III. The Spark Framework
IV. Spark on Gordon
V. Practical Limitations of Spark

Map/Reduce Parallelism
Data Data
Data
Data
Data
taDsakt a0
task 5 task 4
task 3
task 1 task 2

Magic of HDFS

Hadoop Workflow

Shuffle/Sort
MapReduce Disk
Spill
1. Map – convert raw input into
key/value pairs. Output to
local disk ("spill")
2. Shuffle/Sort – All reducers
retrieve all spilled records
from all mappers over
network
3. Reduce – For each unique
key, do something with all
the corresponding values.
Output to HDFS
Map Map Map
Reduce Reduce Reduce

2. Full* data dump to disk
MapReduce: Two
Fundamental Limitations
1. MapReduce prescribes
workflow.
• You map, then you reduce.
• You cannot reduce, then map...
• ...or anything else. See first
point.
Map Map Map
between workflow steps.
• Mappers deliver output on local
disk (mapred.local.dir)
• Reducers pull input over network
from other nodes' local disks
• Output goes right back to local
* Combiners do local reductions to prevent a full, unreduced
dump of data to local disk
disks via HDFS
Shuffle/Sort

Beyond MapReduce
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator

Beyond MapReduce: Complex
Workflows
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
How can you do this without flushing intermediate
results to disk after every operation?
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator
How can you ensure fault tolerance for all of these
baked-in operations?

MapReduce Fault
Tolerance
Map Map Map
Mapper Failure:
1. Re-run map task
and spill to disk
2. Block until finished
3. Reducers proceed
as normal
Reducer Failure:
1. Re-fetch spills from
all mappers' disks
2. Re-run reducer task

Performing Complex Workflows
How can you do complex workflows without
flushing intermediate results to disk after every
operation?
1. Cache intermediate results in-memory
2. Allow users to specify persistence in memory and
partitioning of dataset across nodes
How can you ensure fault tolerance?
1. Coarse-grained atomicity via partitions (transform
chunks of data, not record-by-record)
2. Use transaction logging--forget replication

Resilient Distributed Dataset (RDD)
• Comprised of distributed, atomic partitions of elements
• Apply transformations to generate new RDDs
• RDDs are immutable (read-only)
• RDDs can only be created from persistent storage (e.g.,
HDFS, POSIX, S3) or by transforming other RDDs
# Create an RDD from a file on HDFS
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt')
# Transform the RDD of lines into an RDD of words (one word per element)
words = text.flatMap( lambda line: line.split() )
# Transform the RDD of words into an RDD of key/value pairs
keyvals = words.map( lambda word: (word, 1) )
sc is a SparkContext object that describes our Spark cluster
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages)

Potential RDD Workflow

RDD Transformation vs. Action
• Transformations are lazy: nothing actually happens when
this code is evaluated
• RDDs are computed only when an action is called on
them, e.g.,
• Calculate statistics over the elements of an RDD (count, mean)
• Save the RDD to a file (saveAsTextFile)
• Reduce elements of an RDD into a single object or value (reduce)
• Allows you to define partitioning/caching behavior after
defining the RDD but before calculating its contents

RDD Transformation vs. Action
• Must insert an action here to get pipeline to execute.
• Actions create files or objects:
# The saveAsTextFile action dumps the contents of an RDD to disk
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt')
# The count action returns the number of elements in an RDD
>>> num_elements = rdd.count();
num_elements;
type(num_elements)
215136
<type 'int'>

Resiliency: The 'R' in 'RDD'
• No replication of in-memory data
• Restrict transformations to coarse granularity
• Partition-level operations simplifies data lineage

• Reconstruct missing data from its lineage
• Data in RDDs are deterministic since partitions
are immutable and atomic

• Long lineages or complex interactions
(reductions, shuffles) can be checkpointed
• RDD immutability  nonblocking (background)

SPARK: AN IMPLEMENTATION
OF RDDS

Spark Framework
• Master/worker Model
• Spark Master is analogous to Hadoop Jobtracker (MRv1)
or Application Master (MRv2)
• Spark Worker is analogous to Hadoop Tasktracker
• Relies on "3rd party" storage for RDD generation
(hdfs://, s3n://, file://, http://)
• Spark clusters take three forms:
• Standalone mode - workers communicate directly with
master via spark://master:7077 URI
• Mesos - mesos://master:5050 URI
• YARN - no HA; complicated job launch

Spark on Gordon: Configuration
1. Standalone mode is the simplest configuration
and execution model (similar to MRv1)
2. Leverage existing HDFS support in myHadoop
for storage
3. Combine #1 and #2 to extend myHadoop to
support Spark:
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf
$ myhadoop-configure.sh
...
myHadoop: Enabling experimental Spark support
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark
myHadoop:
To use Spark, you will want to type the following commands:"
source /home/glock/hadoop.conf/spark/spark-env.sh
myspark start

Spark on Gordon: Storage
• Spark can use HDFS
$ start-dfs.sh # after you run myhadoop-configure.sh, of course
...
$ pyspark
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt')
>>> mydata.count()
982394
• Spark can use POSIX file systems too
$ pyspark
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt')
>>> mydata.count()
982394
• S3 Native (s3n://) and HTTP (http://) also work
• file:// input will be served in chunks to Spark
workers via the Spark driver's built-in httpd

Spark on Gordon: Running
Spark treats several languages as first-class
citizens:
Feature Scala Java Python
Interactive YES NO YES
Shark (SQL) YES YES YES
Streaming YES YES NO
MLlib YES YES YES
GraphX YES YES NO
R is a second-class citizen; basic RDD API is
available outside of CRAN
(http://amplab-extras.github.io/SparkR-pkg/)

myHadoop/Spark on Gordon (1/2)
#!/bin/bash
#PBS -l nodes=2:ppn=16:native:flash
#PBS -l walltime=00:30:00
#PBS -q normal
### Environment setup for Hadoop
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH
module load hadoop/2.2.0
export HADOOP_CONF_DIR=$HOME/mycluster.conf
myhadoop-configure.sh
### Start HDFS. Starting YARN isn't necessary since Spark will be running in
### standalone mode on our cluster.
start-dfs.sh
### Load in the necessary Spark environment variables
source $HADOOP_CONF_DIR/spark/spark-env.sh
### Start the Spark masters and workers. Do NOT use the start-all.sh provided
### by Spark, as they do not correctly honor $SPARK_CONF_DIR
myspark start

myHadoop/Spark on Gordon (2/2)
### Run our example problem.
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home
### dir by default which is different from Hadoop 1.x!)
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt
### Step 2. Run our Python Spark job. Note that Spark implicitly requires
### Python 2.6 (some features, like MLLib, require 2.7)
module load python scipy
/home/glock/hadoop/run/wordcount-spark.py
### Step 3. Copy output back out
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/
### Shut down Spark and HDFS
myspark stop
stop-dfs.sh
### Clean up
myhadoop-cleanup.sh
Wordcount submit script and Python code online:
https://github.com/glennklockwood/sparktutorial

PRACTICAL LIMITATIONS

Major Problems with Spark
1. Still smells like a CS project
2. Debugging is a dark art
3. Not battle-tested at scale

#1: Spark Smells Like CS
• Components are constantly breaking
• Graph.partitionBy broken in 1.0.0 (SPARK-1931)
• Some components never worked
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058)
• stop-master.sh doesn't work
• Spark with YARN will break with large data sets (SPARK-2398)
• spark-submit for standalone mode doesn't work (SPARK-2260)

#1: Spark Smells Like CS
• Really obvious usability issues:
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt')
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir')
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19
scala.MatchError: 0 (of class java.lang.Integer)
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
...
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Read an RDD, then write it out = unhandled exception with
cryptic Scala errors from Python (SPARK-1690)

#2: Debugging is a Dark Art
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in
saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-
src.zip/py4j/java_gateway.py", line 537, in __call__
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with
client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
Cause: Spark built against Hadoop 2 DFS trying to access data
on Hadoop 1 DFS

>>> data.count()
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
self.serializer.dump_stream(self._batched(iterator), stream)
for obj in iterator:
for item in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin
if acc is None:
TypeError: an integer is required
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4

>>> data.saveAsTextFile('hdfs://user/glock/output.dir/')
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
for obj in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin
if not isinstance(x, basestring):
SystemError: unknown opcode
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4

#2: Spark Debugging Tips
• $SPARK_LOG_DIR/app-* contains master/worker
logs with failure information
• Try to find the salient error amidst the stack traces
• Google that error--odds are, it is a known issue
• Stick any required environment variables ($PATH,
$PYTHONPATH, $JAVA_HOME) in
$SPARK_CONF_DIR/spark-env.sh to rule out
these problems
• All else fails, look at Spark source code

#3: Spark Isn't Battle Tested
• Companies (Cloudera, SAP, etc) jumping on the
Spark bandwagon with disclaimers about scaling
• Spark does not handle multitenancy well at all.
Wait scheduling is considered best way to achieve
memory/disk data locality
• Largest Spark clusters ~ hundreds of nodes

Spark Take-Aways
• FACTS
• Data is represented as resilient distributed datasets
(RDDs) which remain in-memory and read-only
• RDDs are comprised of elements
• Elements are distributed across physical nodes in user-defined
groups called partitions
• RDDs are subject to transformations and actions
• Fault tolerance achieved by lineage, not replication
• Opinions
• Spark is still in its infancy but its progress is promising
• Good for evaluating--good for Gordon, Comet

PAGERANK EXAMPLE
(INCOMPLETE)

Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
Start every webpage with a rank R = 1.0
1. For each webpage linking in N neighbor webpages,
have it "contribute" R/N to each of its N neighbors
2. Then, for each webpage, set its rank R to (0.15 +
0.85 * contributions)
3. Repeat
insert flow diagram here

Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt')
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such
# that all common keys are kept together. Then retain this RDD in memory.
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache()
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0
ranks = links.map(lambda (url, neighbors): (url, 1.0))
# Calculate and update URL rank
for iteration in range(10):
# Calculate URL contributions to their neighbors
contribs = links.join(ranks).flatMap(
lambda (url, (urls, rank)): computeContribs(urls, rank))
# Recalculate URL ranks based on neighbor contributions
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank)
# Print all URLs and their ranks
for (link, rank) in ranks.collect():
print '%s has rank %s' % (link, rank)

Overview of Spark for HPC

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Overview of Spark for HPC (20)

Recently uploaded (20)

Overview of Spark for HPC

Editor's Notes