SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Tips for Writing ETL Pipelines
with Spark
Imran Rashid|Cloudera, Apache Spark PMC
2© Cloudera, Inc. All rights reserved.
Outline
• Quick Refresher
• Tips for Pipelines
• Spark Performance
• Using the UI
• Understanding Stage Boundaries
• Baby photos
3© Cloudera, Inc. All rights reserved.
About Me
• Member of the Spark PMC
• User of Spark from v0.5 at Quantifind
• Built ETL pipelines, prototype to production
• Supported Data Scientists
• Now work on Spark full time at Cloudera
4© Cloudera, Inc. All rights reserved.
RDDs: Resilient Distributed Dataset
• Data is distributed into partitions spread across a cluster
• Each partition is processed independently and in parallel
• Logical view of the data – not materialized
Image from Dean Wampler, Typesafe
5© Cloudera, Inc. All rights reserved.
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...
6© Cloudera, Inc. All rights reserved.
Cheap!
• No serialization
• No IO
• Pipelined
Expensive!
• Serialize Data
• Write to disk
• Transfer over
network
• Deserialize Data
7© Cloudera, Inc. All rights reserved.
Compare to MapReduce Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
8© Cloudera, Inc. All rights reserved.
Useful Patterns
9© Cloudera, Inc. All rights reserved.
Pipelines get complicated
• Pipelines get messy
• Input data is messy
• Things go wrong
• Never fast enough
• Need stability for months to
years
• Need Forecasting / Capacity
Planning
Alice one
year ago
Bob 6
months ago
Connie 3
months ago
Derrick last
month
Alice last week
10© Cloudera, Inc. All rights reserved.
Design Goals
• Modularity
• Error Handling
• Understand where and how
11© Cloudera, Inc. All rights reserved.
Catching Errors (1)
sc.textFile(…).map{ line =>
//blows up with parse exception
parse(line)
}
sc.textFile(…).flatMap { line =>
//now we’re safe, right?
Try(parse(line)).toOption
}
How many errors?
1 record? 100 records?
90% of our data?
12© Cloudera, Inc. All rights reserved.
Catching Errors (2)
val parseErrors = sc.accumulator(0L)
val parsed = sc.textFile(…).flatMap { line =>
Try(parse(line)) match {
case Success(s) => Some(s)
case Failure(f) =>
parseErrors += 1
None
}
// parse errors is always 0
if (parseErrors > 500) fail(…)
// and what if we want to see those errors?
13© Cloudera, Inc. All rights reserved.
Catching Errors (3)
• Accumulators break the
RDD abstraction
• You care about when
an action has taken
place
• Force action, or pass
error handling on
• SparkListener to deal w/
failures
• https://gist.github.com/squito/2f7cc02c313
e4c9e7df4#file-accumulatorlistener-scala
case class ParsedWithErrorCounts(val parsed:
RDD[LogLine], errors: Accumulator[Long])
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrorCounter =
sc.accumulator(0L).setName(“parseErrors”)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrorCounter += 1
None
}
}
ParsedWithErrorCounts(parsed, parseErrorCounter)
}
14© Cloudera, Inc. All rights reserved.
Catching Errors (4)
• Accumulators can give
you “multiple output”
• Create sample of error
records
• You can look at them for
debugging
• WARNING: accumulators
are not scalable
class ReservoirSample[T] {...}
class ReservoirSampleAccumulableParam[T] extends
AccumulableParam[ReservoirSample[T], T]{...}
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrors = sc.accumulable(
new ReservoirSample[String](100))(…)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrors += line
None
}
}
ParsedWithErrorCounts(parsed, parseErrors)
}
15© Cloudera, Inc. All rights reserved.
Catching Errors (5)
• What if instead, we just filter out each condition?
• Beware deep pipelines
• Eg. RDD.randomSplit
Huge Raw Data
Filter
FlatMap
…parsed
Error 1
Error 2
16© Cloudera, Inc. All rights reserved.
Modularity with RDDs
• Who is caching what?
• What resources should each component?
• What assumptions are made on inputs?
17© Cloudera, Inc. All rights reserved.
Win By Cheating
• Fastest way to shuffle a lot of data:
• Don’t shuffle
• Second fastest way to shuffle a lot of data:
• Shuffle a small amount of data
• ReduceByKey
• Approximate Algorithms
• Same as MapReduce
• BloomFilters, HyperLogLog, Tdigest
• Joins with Narrow Dependencies
18© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
• ReduceByKey allows a map-side-combine
• Data is merged together before its
serialized & sent over network
• GroupByKey transfers all the data
• Higher serialization and network transfer
costs
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
19© Cloudera, Inc. All rights reserved.
But I need groupBy
• Eg., incoming transaction logs from user
• 10 TB of historical data
• 50 GB of new data each day
Historical Logs
Day 1
logs
Day 2
Logs
Day 3
Logs
Grouped Logs
20© Cloudera, Inc. All rights reserved.
Using Partitioners for Narrow Joins
• Sort the Historical Logs once
• Each day, sort the small new data
• Join – narrow dependency
• Write data to hdfs
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join
21© Cloudera, Inc. All rights reserved.
Assume Partitioned
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was
written with a partitioner
// Day 1
val myPartitioner = …
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/19”, …)
.partitionBy(myPartitioner)
val newData =
sc.hadoopFile(“…/newData/2015/05/20”, …)
.partitionBy(myPartitioner)
val grouped = myRdd.cogroup(newData)
grouped.saveAsHadoopFile(
“…/mergedLogs/2015/05/20”)
//Day 2 – new spark context
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/20”, …)
.assumePartitionedBy(myPartitioner)
22© Cloudera, Inc. All rights reserved.
Recovering from Errors
• I write bugs
• You write bugs
• Spark has bugs
• The bugs might appear after 17 hours in stage 78 of your application
• Spark’s failure recovery might not help you
23© Cloudera, Inc. All rights reserved.
HDFS: Its not so bad
• DiskCachedRDD
• Before doing any work, check if it exists on disk
• If so, just load it
• If not, create it and write it to disk
24© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions …
• Partitions should be small
• Max partition size is 2GB*
• Small partitions help deal w/ stragglers
• Small partitions avoid overhead – take a closer look at internals …
• Partitions should be big
• “For ML applications, the best setting to set the number of partitions to match
the number of cores to reduce shuffle size.” Xiangrui Meng on user@
• Why? Take a closer look at internals …
25© Cloudera, Inc. All rights reserved.
Parameterize Partition Numbers
• Many transformations take a second parameter
• reduceByKey(…, nPartitions)
• sc.textFile(…, nPartitions)
• Both sides of shuffle matter!
• Shuffle read (aka “reduce”)
• Shuffle write (aka “map”) – controlled by previous stage
• As datasets change, you might need to change the numbers
• Make this a parameter to your application
• Yes, you may need to expose a LOT of parameters
26© Cloudera, Inc. All rights reserved.
Using the UI
27© Cloudera, Inc. All rights reserved.
Some Demos
• Collect a lot of data
• Slow tasks
• DAG visualization
• RDD names
28© Cloudera, Inc. All rights reserved.
Understanding Performance
29© Cloudera, Inc. All rights reserved.
What data and where is it going?
• Narrow Dependencies (aka “OneToOneDependency”)
• cheap
• Wide Dependencies (aka shuffles)
• how much is shuffled
• Is it skewed
• Driver bottleneck
30© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
Credit: Sandy Ryza, Cloudera
31© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.
32© Cloudera, Inc. All rights reserved.
Stage Boundaries
33© Cloudera, Inc. All rights reserved.
Stages are not MapReduce Steps!
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
ReduceByKey
(mapside
combine)
Shuffle
Filter
MapReduce
Step
ReduceByKey
FlatMap
GroupByKey
Collect
Shuffle
34© Cloudera, Inc. All rights reserved.
I still get confused
(discussion in a code review, testing a large sortByKey)
WP: … then we wait for completion of stage 3 …
ME: hang on, stage 3? Why are there 3 stages?
SortByKey does one extra pass to find the range of the
keys, but that’s two stages
WP: The other stage is data generation
ME: That can’t be right. Data Generation is pipelined,
its just part of the first stage
…
ME: duh – the final sort is two stages – shuffle write
then shuffle read
InputRDD
Sample
data to find
range of
keys
ShuffleMap
for Sort
ShuffleRead
for Sort
Stage 1
Stage 2
Stage 3
NB:
computed twice!
35© Cloudera, Inc. All rights reserved.
Tip grab bag
• Minimize data volume
• Compact formats: avro, parquet
• Kryo Serialization
• require registration in development, but not in production
• Look at data skew, key cardinality
• Tune your cluster
• Use the UI to tune your job
• Set names on all cached RDDs
36© Cloudera, Inc. All rights reserved.
More Resources
• Very active and friendly community
• http://spark.apache.org/community.html
• Dean Wampler’s self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
• Tuning Spark On Yarn
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
37© Cloudera, Inc. All rights reserved.
Thank you
38© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 1)
39© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 2)
40© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Success)

More Related Content

PDF
PySpark Best Practices
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Understanding Query Plans and Spark UIs
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Apache Spark Introduction
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
Why your Spark Job is Failing
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PySpark Best Practices
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Understanding Query Plans and Spark UIs
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Apache Spark Introduction
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Why your Spark Job is Failing
Apache Spark Core—Deep Dive—Proper Optimization

What's hot (20)

PDF
Apache Spark Core – Practical Optimization
PDF
Spark shuffle introduction
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Optimizing Apache Spark SQL Joins
PDF
Understanding and Improving Code Generation
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
SparkSQL: A Compiler from Queries to RDDs
PPTX
Delta lake and the delta architecture
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
PPTX
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Physical Plans in Spark SQL
PDF
Parquet performance tuning: the missing guide
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PPTX
Introduction to Apache Spark
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark Core – Practical Optimization
Spark shuffle introduction
Deep Dive: Memory Management in Apache Spark
Optimizing Apache Spark SQL Joins
Understanding and Improving Code Generation
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
A Deep Dive into Query Execution Engine of Spark SQL
SparkSQL: A Compiler from Queries to RDDs
Delta lake and the delta architecture
How We Optimize Spark SQL Jobs With parallel and sync IO
Introducing DataFrames in Spark for Large Scale Data Science
Large Scale Lakehouse Implementation Using Structured Streaming
Physical Plans in Spark SQL
Parquet performance tuning: the missing guide
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Hive Bucketing in Apache Spark with Tejas Patil
Introduction to Apache Spark
Apache Spark in Depth: Core Concepts, Architecture & Internals
Ad

Viewers also liked (20)

PPTX
ETL with SPARK - First Spark London meetup
PPTX
Building a unified data pipeline in Apache Spark
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PDF
Intro to Spark and Spark SQL
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PPTX
Oracle's BigData solutions
PDF
COUG_AAbate_Oracle_Database_12c_New_Features
PDF
Aioug vizag oracle12c_new_features
PDF
Oracle12 - The Top12 Features by NAYA Technologies
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PPTX
Introduce to Spark sql 1.3.0
PPTX
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
PPTX
SPARQL and Linked Data Benchmarking
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PPTX
Spark meetup v2.0.5
PDF
Pandas, Data Wrangling & Data Science
PDF
Data Science with Spark
ETL with SPARK - First Spark London meetup
Building a unified data pipeline in Apache Spark
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Building a Data Pipeline from Scratch - Joe Crobak
Intro to Spark and Spark SQL
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Oracle's BigData solutions
COUG_AAbate_Oracle_Database_12c_New_Features
Aioug vizag oracle12c_new_features
Oracle12 - The Top12 Features by NAYA Technologies
RISELab:Enabling Intelligent Real-Time Decisions
Introduce to Spark sql 1.3.0
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
SPARQL and Linked Data Benchmarking
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Spark meetup v2.0.5
Pandas, Data Wrangling & Data Science
Data Science with Spark
Ad

Similar to Spark etl (20)

PPTX
5 Apache Spark Tips in 5 Minutes
PPTX
Apache Spark Workshop
PPTX
Intro to Apache Spark
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PPT
11. From Hadoop to Spark 1:2
PDF
Sparklife - Life In The Trenches With Spark
PPTX
Spark 计算模型
PDF
Introduction to Apache Spark
PPTX
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PDF
Apache Spark: What? Why? When?
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Spark Tips & Tricks
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PPTX
Spark Gotchas and Lessons Learned
PDF
Hadoop and Spark
5 Apache Spark Tips in 5 Minutes
Apache Spark Workshop
Intro to Apache Spark
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
11. From Hadoop to Spark 1:2
Sparklife - Life In The Trenches With Spark
Spark 计算模型
Introduction to Apache Spark
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Introduction to Spark - Phoenix Meetup 08-19-2014
Apache Spark: What? Why? When?
Spark 4th Meetup Londond - Building a Product with Spark
Intro to Spark - for Denver Big Data Meetup
Spark Tips & Tricks
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
Apache Spark™ is a multi-language engine for executing data-S5.ppt
The Future of Hadoop: A deeper look at Apache Spark
Spark Gotchas and Lessons Learned
Hadoop and Spark

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
ai tools demonstartion for schools and inter college
PPTX
Transform Your Business with a Software ERP System
PDF
medical staffing services at VALiNTRY
PDF
top salesforce developer skills in 2025.pdf
PPT
Introduction Database Management System for Course Database
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administration Chapter 2
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administraation Chapter 3
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
ai tools demonstartion for schools and inter college
Transform Your Business with a Software ERP System
medical staffing services at VALiNTRY
top salesforce developer skills in 2025.pdf
Introduction Database Management System for Course Database
Wondershare Filmora 15 Crack With Activation Key [2025
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PTS Company Brochure 2025 (1).pdf.......
System and Network Administration Chapter 2
Online Work Permit System for Fast Permit Processing
Materi-Enum-and-Record-Data-Type (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Operating system designcfffgfgggggggvggggggggg
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administraation Chapter 3
2025 Textile ERP Trends: SAP, Odoo & Oracle
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How to Migrate SBCGlobal Email to Yahoo Easily

Spark etl

  • 1. 1© Cloudera, Inc. All rights reserved. Tips for Writing ETL Pipelines with Spark Imran Rashid|Cloudera, Apache Spark PMC
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Quick Refresher • Tips for Pipelines • Spark Performance • Using the UI • Understanding Stage Boundaries • Baby photos
  • 3. 3© Cloudera, Inc. All rights reserved. About Me • Member of the Spark PMC • User of Spark from v0.5 at Quantifind • Built ETL pipelines, prototype to production • Supported Data Scientists • Now work on Spark full time at Cloudera
  • 4. 4© Cloudera, Inc. All rights reserved. RDDs: Resilient Distributed Dataset • Data is distributed into partitions spread across a cluster • Each partition is processed independently and in parallel • Logical view of the data – not materialized Image from Dean Wampler, Typesafe
  • 5. 5© Cloudera, Inc. All rights reserved. Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save • ...
  • 6. 6© Cloudera, Inc. All rights reserved. Cheap! • No serialization • No IO • Pipelined Expensive! • Serialize Data • Write to disk • Transfer over network • Deserialize Data
  • 7. 7© Cloudera, Inc. All rights reserved. Compare to MapReduce Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 8. 8© Cloudera, Inc. All rights reserved. Useful Patterns
  • 9. 9© Cloudera, Inc. All rights reserved. Pipelines get complicated • Pipelines get messy • Input data is messy • Things go wrong • Never fast enough • Need stability for months to years • Need Forecasting / Capacity Planning Alice one year ago Bob 6 months ago Connie 3 months ago Derrick last month Alice last week
  • 10. 10© Cloudera, Inc. All rights reserved. Design Goals • Modularity • Error Handling • Understand where and how
  • 11. 11© Cloudera, Inc. All rights reserved. Catching Errors (1) sc.textFile(…).map{ line => //blows up with parse exception parse(line) } sc.textFile(…).flatMap { line => //now we’re safe, right? Try(parse(line)).toOption } How many errors? 1 record? 100 records? 90% of our data?
  • 12. 12© Cloudera, Inc. All rights reserved. Catching Errors (2) val parseErrors = sc.accumulator(0L) val parsed = sc.textFile(…).flatMap { line => Try(parse(line)) match { case Success(s) => Some(s) case Failure(f) => parseErrors += 1 None } // parse errors is always 0 if (parseErrors > 500) fail(…) // and what if we want to see those errors?
  • 13. 13© Cloudera, Inc. All rights reserved. Catching Errors (3) • Accumulators break the RDD abstraction • You care about when an action has taken place • Force action, or pass error handling on • SparkListener to deal w/ failures • https://gist.github.com/squito/2f7cc02c313 e4c9e7df4#file-accumulatorlistener-scala case class ParsedWithErrorCounts(val parsed: RDD[LogLine], errors: Accumulator[Long]) def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrorCounter = sc.accumulator(0L).setName(“parseErrors”) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrorCounter += 1 None } } ParsedWithErrorCounts(parsed, parseErrorCounter) }
  • 14. 14© Cloudera, Inc. All rights reserved. Catching Errors (4) • Accumulators can give you “multiple output” • Create sample of error records • You can look at them for debugging • WARNING: accumulators are not scalable class ReservoirSample[T] {...} class ReservoirSampleAccumulableParam[T] extends AccumulableParam[ReservoirSample[T], T]{...} def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrors = sc.accumulable( new ReservoirSample[String](100))(…) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrors += line None } } ParsedWithErrorCounts(parsed, parseErrors) }
  • 15. 15© Cloudera, Inc. All rights reserved. Catching Errors (5) • What if instead, we just filter out each condition? • Beware deep pipelines • Eg. RDD.randomSplit Huge Raw Data Filter FlatMap …parsed Error 1 Error 2
  • 16. 16© Cloudera, Inc. All rights reserved. Modularity with RDDs • Who is caching what? • What resources should each component? • What assumptions are made on inputs?
  • 17. 17© Cloudera, Inc. All rights reserved. Win By Cheating • Fastest way to shuffle a lot of data: • Don’t shuffle • Second fastest way to shuffle a lot of data: • Shuffle a small amount of data • ReduceByKey • Approximate Algorithms • Same as MapReduce • BloomFilters, HyperLogLog, Tdigest • Joins with Narrow Dependencies
  • 18. 18© Cloudera, Inc. All rights reserved. ReduceByKey when Possible • ReduceByKey allows a map-side-combine • Data is merged together before its serialized & sent over network • GroupByKey transfers all the data • Higher serialization and network transfer costs parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  • 19. 19© Cloudera, Inc. All rights reserved. But I need groupBy • Eg., incoming transaction logs from user • 10 TB of historical data • 50 GB of new data each day Historical Logs Day 1 logs Day 2 Logs Day 3 Logs Grouped Logs
  • 20. 20© Cloudera, Inc. All rights reserved. Using Partitioners for Narrow Joins • Sort the Historical Logs once • Each day, sort the small new data • Join – narrow dependency • Write data to hdfs • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner Wide Join Narrow Join
  • 21. 21© Cloudera, Inc. All rights reserved. Assume Partitioned • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner // Day 1 val myPartitioner = … val historical = sc.hadoopFile(“…/mergedLogs/2015/05/19”, …) .partitionBy(myPartitioner) val newData = sc.hadoopFile(“…/newData/2015/05/20”, …) .partitionBy(myPartitioner) val grouped = myRdd.cogroup(newData) grouped.saveAsHadoopFile( “…/mergedLogs/2015/05/20”) //Day 2 – new spark context val historical = sc.hadoopFile(“…/mergedLogs/2015/05/20”, …) .assumePartitionedBy(myPartitioner)
  • 22. 22© Cloudera, Inc. All rights reserved. Recovering from Errors • I write bugs • You write bugs • Spark has bugs • The bugs might appear after 17 hours in stage 78 of your application • Spark’s failure recovery might not help you
  • 23. 23© Cloudera, Inc. All rights reserved. HDFS: Its not so bad • DiskCachedRDD • Before doing any work, check if it exists on disk • If so, just load it • If not, create it and write it to disk
  • 24. 24© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions … • Partitions should be small • Max partition size is 2GB* • Small partitions help deal w/ stragglers • Small partitions avoid overhead – take a closer look at internals … • Partitions should be big • “For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size.” Xiangrui Meng on user@ • Why? Take a closer look at internals …
  • 25. 25© Cloudera, Inc. All rights reserved. Parameterize Partition Numbers • Many transformations take a second parameter • reduceByKey(…, nPartitions) • sc.textFile(…, nPartitions) • Both sides of shuffle matter! • Shuffle read (aka “reduce”) • Shuffle write (aka “map”) – controlled by previous stage • As datasets change, you might need to change the numbers • Make this a parameter to your application • Yes, you may need to expose a LOT of parameters
  • 26. 26© Cloudera, Inc. All rights reserved. Using the UI
  • 27. 27© Cloudera, Inc. All rights reserved. Some Demos • Collect a lot of data • Slow tasks • DAG visualization • RDD names
  • 28. 28© Cloudera, Inc. All rights reserved. Understanding Performance
  • 29. 29© Cloudera, Inc. All rights reserved. What data and where is it going? • Narrow Dependencies (aka “OneToOneDependency”) • cheap • Wide Dependencies (aka shuffles) • how much is shuffled • Is it skewed • Driver bottleneck
  • 30. 30© Cloudera, Inc. All rights reserved. Driver can be a bottleneck Credit: Sandy Ryza, Cloudera
  • 31. 31© Cloudera, Inc. All rights reserved. Driver can be a bottleneck GOOD BAD rdd.collect() Exploratory data analysis; merging a small set of results. Sequentially scan entire data set on driver. No parallelism, OOM on driver. rdd.reduce() Summarize the results from a small dataset. Big Data Structures, from lots of partitions. sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.
  • 32. 32© Cloudera, Inc. All rights reserved. Stage Boundaries
  • 33. 33© Cloudera, Inc. All rights reserved. Stages are not MapReduce Steps! Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map ReduceByKey (mapside combine) Shuffle Filter MapReduce Step ReduceByKey FlatMap GroupByKey Collect Shuffle
  • 34. 34© Cloudera, Inc. All rights reserved. I still get confused (discussion in a code review, testing a large sortByKey) WP: … then we wait for completion of stage 3 … ME: hang on, stage 3? Why are there 3 stages? SortByKey does one extra pass to find the range of the keys, but that’s two stages WP: The other stage is data generation ME: That can’t be right. Data Generation is pipelined, its just part of the first stage … ME: duh – the final sort is two stages – shuffle write then shuffle read InputRDD Sample data to find range of keys ShuffleMap for Sort ShuffleRead for Sort Stage 1 Stage 2 Stage 3 NB: computed twice!
  • 35. 35© Cloudera, Inc. All rights reserved. Tip grab bag • Minimize data volume • Compact formats: avro, parquet • Kryo Serialization • require registration in development, but not in production • Look at data skew, key cardinality • Tune your cluster • Use the UI to tune your job • Set names on all cached RDDs
  • 36. 36© Cloudera, Inc. All rights reserved. More Resources • Very active and friendly community • http://spark.apache.org/community.html • Dean Wampler’s self-paced spark workshop • https://github.com/deanwampler/spark-workshop • Tips for Better Spark Jobs • http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs • Tuning & Debugging Spark (with another explanation of internals) • http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark • Tuning Spark On Yarn • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
  • 37. 37© Cloudera, Inc. All rights reserved. Thank you
  • 38. 38© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 1)
  • 39. 39© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 2)
  • 40. 40© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Success)