Spark etl

1© Cloudera, Inc. All rights reserved.
Tips for Writing ETL Pipelines
with Spark
Imran Rashid|Cloudera, Apache Spark PMC

Outline
• Quick Refresher
• Tips for Pipelines
• Spark Performance
• Using the UI
• Understanding Stage Boundaries
• Baby photos

About Me
• Member of the Spark PMC
• User of Spark from v0.5 at Quantifind
• Built ETL pipelines, prototype to production
• Supported Data Scientists
• Now work on Spark full time at Cloudera

RDDs: Resilient Distributed Dataset
• Data is distributed into partitions spread across a cluster
• Each partition is processed independently and in parallel
• Logical view of the data – not materialized
Image from Dean Wampler, Typesafe

Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...

Cheap!
• No serialization
• No IO
• Pipelined
Expensive!
• Serialize Data
• Write to disk
• Transfer over
network
• Deserialize Data

Compare to MapReduce Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Useful Patterns

Pipelines get complicated
• Pipelines get messy
• Input data is messy
• Things go wrong
• Never fast enough
• Need stability for months to
years
• Need Forecasting / Capacity
Planning
Alice one
year ago
Bob 6
months ago
Connie 3
months ago
Derrick last
month
Alice last week

Design Goals
• Modularity
• Error Handling
• Understand where and how

Catching Errors (1)
sc.textFile(…).map{ line =>
//blows up with parse exception
parse(line)
}
sc.textFile(…).flatMap { line =>
//now we’re safe, right?
Try(parse(line)).toOption
}
How many errors?
1 record? 100 records?
90% of our data?

Catching Errors (2)
val parseErrors = sc.accumulator(0L)
val parsed = sc.textFile(…).flatMap { line =>
Try(parse(line)) match {
case Success(s) => Some(s)
case Failure(f) =>
parseErrors += 1
None
}
// parse errors is always 0
if (parseErrors > 500) fail(…)
// and what if we want to see those errors?

Catching Errors (3)
• Accumulators break the
RDD abstraction
• You care about when
an action has taken
place
• Force action, or pass
error handling on
• SparkListener to deal w/
failures
• https://gist.github.com/squito/2f7cc02c313
e4c9e7df4#file-accumulatorlistener-scala
case class ParsedWithErrorCounts(val parsed:
RDD[LogLine], errors: Accumulator[Long])
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrorCounter =
sc.accumulator(0L).setName(“parseErrors”)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrorCounter += 1
None
}
}
ParsedWithErrorCounts(parsed, parseErrorCounter)
}

Catching Errors (4)
• Accumulators can give
you “multiple output”
• Create sample of error
records
• You can look at them for
debugging
• WARNING: accumulators
are not scalable
class ReservoirSample[T] {...}
class ReservoirSampleAccumulableParam[T] extends
AccumulableParam[ReservoirSample[T], T]{...}
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrors = sc.accumulable(
new ReservoirSample[String](100))(…)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrors += line
None
}
}
ParsedWithErrorCounts(parsed, parseErrors)
}

Catching Errors (5)
• What if instead, we just filter out each condition?
• Beware deep pipelines
• Eg. RDD.randomSplit
Huge Raw Data
Filter
FlatMap
…parsed
Error 1
Error 2

Modularity with RDDs
• Who is caching what?
• What resources should each component?
• What assumptions are made on inputs?

Win By Cheating
• Fastest way to shuffle a lot of data:
• Don’t shuffle
• Second fastest way to shuffle a lot of data:
• Shuffle a small amount of data
• ReduceByKey
• Approximate Algorithms
• Same as MapReduce
• BloomFilters, HyperLogLog, Tdigest
• Joins with Narrow Dependencies

ReduceByKey when Possible
• ReduceByKey allows a map-side-combine
• Data is merged together before its
serialized & sent over network
• GroupByKey transfers all the data
• Higher serialization and network transfer
costs
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()

But I need groupBy
• Eg., incoming transaction logs from user
• 10 TB of historical data
• 50 GB of new data each day
Historical Logs
Day 1
logs
Day 2
Logs
Day 3
Logs
Grouped Logs

Using Partitioners for Narrow Joins
• Sort the Historical Logs once
• Each day, sort the small new data
• Join – narrow dependency
• Write data to hdfs
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join

Assume Partitioned
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was
written with a partitioner
// Day 1
val myPartitioner = …
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/19”, …)
.partitionBy(myPartitioner)
val newData =
sc.hadoopFile(“…/newData/2015/05/20”, …)
.partitionBy(myPartitioner)
val grouped = myRdd.cogroup(newData)
grouped.saveAsHadoopFile(
“…/mergedLogs/2015/05/20”)
//Day 2 – new spark context
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/20”, …)
.assumePartitionedBy(myPartitioner)

Recovering from Errors
• I write bugs
• You write bugs
• Spark has bugs
• The bugs might appear after 17 hours in stage 78 of your application
• Spark’s failure recovery might not help you

HDFS: Its not so bad
• DiskCachedRDD
• Before doing any work, check if it exists on disk
• If so, just load it
• If not, create it and write it to disk

Partitions, Partitions, Partitions …
• Partitions should be small
• Max partition size is 2GB*
• Small partitions help deal w/ stragglers
• Small partitions avoid overhead – take a closer look at internals …
• Partitions should be big
• “For ML applications, the best setting to set the number of partitions to match
the number of cores to reduce shuffle size.” Xiangrui Meng on user@
• Why? Take a closer look at internals …

Parameterize Partition Numbers
• Many transformations take a second parameter
• reduceByKey(…, nPartitions)
• sc.textFile(…, nPartitions)
• Both sides of shuffle matter!
• Shuffle read (aka “reduce”)
• Shuffle write (aka “map”) – controlled by previous stage
• As datasets change, you might need to change the numbers
• Make this a parameter to your application
• Yes, you may need to expose a LOT of parameters

Using the UI

Some Demos
• Collect a lot of data
• Slow tasks
• DAG visualization
• RDD names

Understanding Performance

What data and where is it going?
• Narrow Dependencies (aka “OneToOneDependency”)
• cheap
• Wide Dependencies (aka shuffles)
• how much is shuffled
• Is it skewed
• Driver bottleneck

Driver can be a bottleneck
Credit: Sandy Ryza, Cloudera

Driver can be a bottleneck
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.

Stage Boundaries

Stages are not MapReduce Steps!
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
ReduceByKey
(mapside
combine)
Shuffle
Filter
MapReduce
Step
ReduceByKey
FlatMap
GroupByKey
Collect
Shuffle

I still get confused
(discussion in a code review, testing a large sortByKey)
WP: … then we wait for completion of stage 3 …
ME: hang on, stage 3? Why are there 3 stages?
SortByKey does one extra pass to find the range of the
keys, but that’s two stages
WP: The other stage is data generation
ME: That can’t be right. Data Generation is pipelined,
its just part of the first stage
…
ME: duh – the final sort is two stages – shuffle write
then shuffle read
InputRDD
Sample
data to find
range of
keys
ShuffleMap
for Sort
ShuffleRead
for Sort
Stage 1
Stage 2
Stage 3
NB:
computed twice!

Tip grab bag
• Minimize data volume
• Compact formats: avro, parquet
• Kryo Serialization
• require registration in development, but not in production
• Look at data skew, key cardinality
• Tune your cluster
• Use the UI to tune your job
• Set names on all cached RDDs

More Resources
• Very active and friendly community
• http://spark.apache.org/community.html
• Dean Wampler’s self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
• Tuning Spark On Yarn
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

Thank you

Cleaning Up Resources (Try 1)

Cleaning Up Resources (Try 2)

Cleaning Up Resources (Success)

Spark etl

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark etl (20)

Recently uploaded (20)

Spark etl