SlideShare a Scribd company logo
Beyond Parallelize & Collect
(Effective testing of Spark Programs)
Now
mostly
“works”*
*See developer for details. Does not imply warranty. :p
Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Software Engineer
currently IBM and previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
@holdenkarau
Slide share http://www.slideshare.net/hkarau
Linkedin https://www.linkedin.com/in/holdenkarau
What is going to be covered:
What I think I might know about you
A bit about why you should test your programs
Using parallelize & collect for unit testing (quick skim)
Comparing datasets too large to fit in memory
Considerations for Streaming & SQL (DataFrames & Datasets)
Cute & scary pictures
I promise at least one panda and one cat
“Future Work”
Who I think you wonderful humans are?
Nice* people
Like silly pictures
Familiar with Apache Spark
If not, buy one of my books or watch Paco’s awesome video
Familiar with one of Scala, Java, or Python
If you know R well I’d love to chat though
Want to make better software
(or models, or w/e)
So why should you test?
Makes you a better person
Save $s
May help you avoid losing your employer all of their money
Or “users” if we were in the bay
AWS is expensive
Waiting for our jobs to fail is a pretty long dev cycle
This is really just to guilt trip you & give you flashbacks to your QA internships
So why should you test - continued
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
So why should you test - continued
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Why don’t we test?
It’s hard
Faking data, setting up integration tests, urgh w/e
Our tests can get too slow
It takes a lot of time
and people always want everything done yesterday
or I just want to go home see my partner
etc.
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
super.beforeAll()
}
override def afterAll() {
if (sc != null)
sc.stop()
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
super.afterAll()
}
Photo by morinesque
And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
}
def tokenize(f: RDD[String]) = {
f.map(_.split(" ").toList)
}
Photo by morinesque
Wait, where were the batteries?
Photo by Jim Bauer
Let’s get batteries!
Spark unit testing
spark-testing-base - https://github.com/holdenk/spark-testing-base
sscheck - https://github.com/juanrh/sscheck
Integration testing
spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
Performance
spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
Spark job validation
Photo by Mike Mozart
A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}
Ok but what about problems @ scale
Maybe our program works fine on our local sized input
If we are using Spark our actual workload is probably huge
How do we test workloads too large for a single machine?
we can’t just use parallelize and collect
Qfamily
Distributed “set” operations to the rescue*
Pretty close - already built into Spark
Doesn’t do so well with floating points :(
damn floating points keep showing up everywhere :p
Doesn’t really handle duplicates very well
{“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila
Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected:
RDD[T], result: RDD[T]): Option[(T, T)] = {
expected.zip(result).filter{case (x, y) => x !=
y}.take(1).headOption
}
Matti Mattila
Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]):
Option[(T, Int, Int)] = {
val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +
_)
val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
=>
i1.isEmpty || i2.isEmpty || i1.head !=
i2.head}.take(1).headOption.
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
i2.headOption.getOrElse(0))}
}
Matti Mattila
But where do we get the data for those tests?
If you have production data you can sample you are lucky!
If possible you can try and save in the same format
If our data is a bunch of Vectors or Doubles Spark’s got tools :)
Coming up with good test data can take a long time
Lori Rielly
QuickCheck / ScalaCheck
QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
sscheck
Awesome people*, supports generating DStreams too!
spark-testing-base
Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
PROtara hunt
With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}
Testing streaming….
Photo by Steve Jurvetson
// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
SparkContext,
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration -
1).toInt
val selectedInput = if (index < input.size) input(index) else
Seq[T]()
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
}
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + rdd.id + " with " + selectedInput)
Some(rdd)
}
}
Artisanal Stream Testing Code
trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging
with SharedSparkContext {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[4]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
dir.toString
}
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
override def afterAll() {
System.clearProperty("spark.streaming.clock")
super.afterAll()
}
Pho
to
by
Stev
e
Jurv
etso
n
and continued….
/**
* Create an input stream for the provided input sequence. This is done using
* TestInputStream as queueStream's are not checkpointable.
*/
def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ :
TestStreamingContext,
input: Seq[Seq[T]]): TestInputStream[T] = {
new TestInputStream(sc, ssc_, input, numInputPartitions)
}
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
override def beforeAll() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock",
"org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock",
"org.apache.spark.streaming.util.SystemClock")
}
super.beforeAll()
}
/**
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
*/
def withOutputAndStreamingContext[R](outputStreamSSC:
(TestOutputStream[R], TestStreamingContext))
(block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = {
val outputStream = outputStreamSSC._1
val ssc = outputStreamSSC._2
try {
block(outputStream, ssc)
} finally {
try {
ssc.stop(stopSparkContext = false)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
}
}
}
}
and now for the clock
/*
* Allows us access to a manual clock. Note that the manual clock changed between
1.1.1 and 1.3
*/
class TestManualClock(var time: Long) extends Clock {
def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat
def currentTime(): Long = getTimeMillis() // Compat
def getTimeMillis(): Long =
synchronized {
time
}
def setTime(timeToSet: Long): Unit =
synchronized {
time = timeToSet
notifyAll()
}
def advance(timeToAdd: Long): Unit =
synchronized {
time += timeToAdd
notifyAll()
}
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
/**
* @param targetTime block until the clock time is set or advanced to at least this
time
* @return current time reported by the clock when waiting finishes
*/
def waitTillTime(targetTime: Long): Long =
synchronized {
while (time < targetTime) {
wait(100)
}
getTimeMillis()
}
}
Testing streaming the happy panda way
Creating test data is hard
ssc.queueStream works - unless you need checkpoints (1.4.1+)
Collecting the data locally is hard
foreachRDD & a var
figuring out when your test is “done”
Let’s abstract all that away into testOperation
We can hide all of that:
test("really simple transformation") {
val input = List(List("hi"), List("hi holden"), List("bye"))
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}
Photo by An eye
for my mind
What about DataFrames?
We can do the same as we did for RDD’s (.rdd)
Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
Sadly it’s not in a published package & local only
instead we expose:
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
…. and Datasets
We can do the same as we did for RDD’s (.rdd)
Inside of Spark validation looks like:
def checkAnswer(df: Dataset[T], expectedAnswer: T*)
Sadly it’s not in a published package & local only
instead we expose:
def equalDatasets(expected: Dataset[U], result: Dataset[V]) {
def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
This is what it looks like:
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
*This may or may not be easier.
Which has “built-in” large support :)
Photo by allison
Let’s talk about local mode
It’s way better than you would expect*
It does its best to try and catch serialization errors
It’s still not the same as running on a “real” cluster
Especially since if we were just local mode, parallelize and collect might be fine
Photo by: Bev Sykes
Options beyond local mode:
Just point at your existing cluster (set master)
Start one with your shell scripts & change the master
Really easy way to plug into existing integration testing
spark-docker - hack in our own tests
YarnMiniCluster
https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/Ba
seYarnClusterSuite.scala
In Spark Testing Base extend SharedMiniCluster
Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner
Validation
Validation can be really useful for catching errors before deploying a model
Our tests can’t catch everything
For now checking file sizes & execution time seem like the most common best
practice (from survey)
Accumulators have some challenges (see SPARK-12469 for progress) but are
an interesting option
spark-validator is still in early stages and not ready for production use but
interesting proof of concept
Photo by:
Paul Schadler
Related talks & blog posts
Testing Spark Best Practices (Spark Summit 2014)
Every Day I’m Shuffling (Strata 2015) & slides
Spark and Spark Streaming Unit Testing
Making Spark Unit Testing With Spark Testing Base
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
Still being written - signup to be notified when it is available:
http://www.highperformancespark.com
https://twitter.com/highperfspark
Related packages
spark-testing-base: https://github.com/holdenk/spark-testing-base
sscheck: https://github.com/juanrh/sscheck
spark-validator: https://github.com/holdenk/spark-validator *ALPHA*
spark-perf - https://github.com/databricks/spark-perf
spark-integration-tests - https://github.com/databricks/spark-integration-tests
“Future Work”
Better ScalaCheck integration (ala sscheck)
Testing details in my next Spark book
Whatever* you all want
Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
integration testing (for now see @cfriegly’s Spark + Docker setup):
https://github.com/fluxcapacitor/pipeline
Pretty unlikely:*That I feel like doing, or you feel like making a pull request for.
Photo by
bullet101
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey:
http://bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau

More Related Content

PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
PDF
Testing and validating spark programs - Strata SJ 2016
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
PDF
Getting the best performance with PySpark - Spark Summit West 2016
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Testing and validating spark programs - Strata SJ 2016
Introduction to Spark Datasets - Functional and relational together at last
Effective testing for spark programs scala bay preview (pre-strata ny 2015)
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Getting the best performance with PySpark - Spark Summit West 2016

What's hot (20)

PDF
Effective testing for spark programs Strata NY 2015
PPTX
Beyond shuffling global big data tech conference 2015 sj
PPTX
Beyond shuffling - Strata London 2016
PDF
Spark with Elasticsearch - umd version 2014
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Spark with Elasticsearch
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PDF
Testing and validating distributed systems with Apache Spark and Apache Beam ...
PDF
Unit testing of spark applications
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PDF
Using Spark ML on Spark Errors - What do the clusters tell us?
Effective testing for spark programs Strata NY 2015
Beyond shuffling global big data tech conference 2015 sj
Beyond shuffling - Strata London 2016
Spark with Elasticsearch - umd version 2014
Beyond Parallelize and Collect by Holden Karau
Introduction to Spark ML Pipelines Workshop
Improving PySpark performance: Spark Performance Beyond the JVM
Spark with Elasticsearch
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Debugging PySpark: Spark Summit East talk by Holden Karau
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Streaming & Scaling Spark - London Spark Meetup 2016
A fast introduction to PySpark with a quick look at Arrow based UDFs
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Unit testing of spark applications
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Using Spark ML on Spark Errors - What do the clusters tell us?
Ad

Similar to Beyond parallelize and collect - Spark Summit East 2016 (20)

PDF
Spark Summit EU talk by Ted Malaska
PDF
Validating Big Data Pipelines - Big Data Spain 2018
PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
PDF
Validating big data jobs - Spark AI Summit EU
PDF
Training Large-scale Ad Ranking Models in Spark
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
PDF
Testing Spark and Scala
PDF
Spark summit2014 techtalk - testing spark
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PDF
Testing in the World of Functional Programming
PDF
Introduction to Apache Spark
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PDF
Apache Spark: What? Why? When?
PPT
Bigdata processing with Spark - part II
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
PDF
Artigo 81 - spark_tutorial.pdf
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PPTX
Spark Overview and Performance Issues
Spark Summit EU talk by Ted Malaska
Validating Big Data Pipelines - Big Data Spain 2018
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Validating big data jobs - Spark AI Summit EU
Training Large-scale Ad Ranking Models in Spark
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Testing Spark and Scala
Spark summit2014 techtalk - testing spark
From Pipelines to Refineries: Scaling Big Data Applications
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Testing in the World of Functional Programming
Introduction to Apache Spark
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache Spark: What? Why? When?
Bigdata processing with Spark - part II
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Artigo 81 - spark_tutorial.pdf
Lessons learned from designing QA automation event streaming platform(IoT big...
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Spark Overview and Performance Issues
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Monthly Chronicles - July 2025
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)

Beyond parallelize and collect - Spark Summit East 2016

  • 1. Beyond Parallelize & Collect (Effective testing of Spark Programs) Now mostly “works”* *See developer for details. Does not imply warranty. :p
  • 2. Who am I? My name is Holden Karau Prefered pronouns are she/her I’m a Software Engineer currently IBM and previously Alpine, Databricks, Google, Foursquare & Amazon co-author of Learning Spark & Fast Data processing with Spark @holdenkarau Slide share http://www.slideshare.net/hkarau Linkedin https://www.linkedin.com/in/holdenkarau
  • 3. What is going to be covered: What I think I might know about you A bit about why you should test your programs Using parallelize & collect for unit testing (quick skim) Comparing datasets too large to fit in memory Considerations for Streaming & SQL (DataFrames & Datasets) Cute & scary pictures I promise at least one panda and one cat “Future Work”
  • 4. Who I think you wonderful humans are? Nice* people Like silly pictures Familiar with Apache Spark If not, buy one of my books or watch Paco’s awesome video Familiar with one of Scala, Java, or Python If you know R well I’d love to chat though Want to make better software (or models, or w/e)
  • 5. So why should you test? Makes you a better person Save $s May help you avoid losing your employer all of their money Or “users” if we were in the bay AWS is expensive Waiting for our jobs to fail is a pretty long dev cycle This is really just to guilt trip you & give you flashbacks to your QA internships
  • 6. So why should you test - continued Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 7. So why should you test - continued Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  • 8. Why don’t we test? It’s hard Faking data, setting up integration tests, urgh w/e Our tests can get too slow It takes a lot of time and people always want everything done yesterday or I just want to go home see my partner etc.
  • 9. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • 10. An artisanal Spark unit test @transient private var _sc: SparkContext = _ override def beforeAll() { _sc = new SparkContext("local[4]") super.beforeAll() } override def afterAll() { if (sc != null) sc.stop() System.clearProperty("spark.driver.port") // rebind issue _sc = null super.afterAll() } Photo by morinesque
  • 11. And on to the actual test... test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(tokenize(sc.parallelize(input)).collect().toList === expected) } def tokenize(f: RDD[String]) = { f.map(_.split(" ").toList) } Photo by morinesque
  • 12. Wait, where were the batteries? Photo by Jim Bauer
  • 13. Let’s get batteries! Spark unit testing spark-testing-base - https://github.com/holdenk/spark-testing-base sscheck - https://github.com/juanrh/sscheck Integration testing spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests Performance spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf Spark job validation Photo by Mike Mozart
  • 14. A simple unit test re-visited (Scala) class SampleRDDTest extends FunSuite with SharedSparkContext { test("really simple transformation") { val input = List("hi", "hi holden", "bye") val expected = List(List("hi"), List("hi", "holden"), List("bye")) assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected) } }
  • 15. Ok but what about problems @ scale Maybe our program works fine on our local sized input If we are using Spark our actual workload is probably huge How do we test workloads too large for a single machine? we can’t just use parallelize and collect Qfamily
  • 16. Distributed “set” operations to the rescue* Pretty close - already built into Spark Doesn’t do so well with floating points :( damn floating points keep showing up everywhere :p Doesn’t really handle duplicates very well {“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations... Matti Mattila
  • 17. Or use RDDComparisions: def compareWithOrderSamePartitioner[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, T)] = { expected.zip(result).filter{case (x, y) => x != y}.take(1).headOption } Matti Mattila
  • 18. Or use RDDComparisions: def compare[T: ClassTag](expected: RDD[T], result: RDD[T]): Option[(T, Int, Int)] = { val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ + _) val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _) expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2)) => i1.isEmpty || i2.isEmpty || i1.head != i2.head}.take(1).headOption. map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0), i2.headOption.getOrElse(0))} } Matti Mattila
  • 19. But where do we get the data for those tests? If you have production data you can sample you are lucky! If possible you can try and save in the same format If our data is a bunch of Vectors or Doubles Spark’s got tools :) Coming up with good test data can take a long time Lori Rielly
  • 20. QuickCheck / ScalaCheck QuickCheck generates tests data under a set of constraints Scala version is ScalaCheck - supported by the two unit testing libraries for Spark sscheck Awesome people*, supports generating DStreams too! spark-testing-base Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs PROtara hunt
  • 21. With spark-testing-base test("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() } }
  • 23. // Setup our Stream: class TestInputStream[T: ClassTag](@transient var sc: SparkContext, ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int) extends FriendlyInputDStream[T](ssc_) { def start() {} def stop() {} def compute(validTime: Time): Option[RDD[T]] = { logInfo("Computing RDD for time " + validTime) val index = ((validTime - ourZeroTime) / slideDuration - 1).toInt val selectedInput = if (index < input.size) input(index) else Seq[T]() // lets us test cases where RDDs are not created if (selectedInput == null) { return None } val rdd = sc.makeRDD(selectedInput, numPartitions) logInfo("Created RDD " + rdd.id + " with " + selectedInput) Some(rdd) } } Artisanal Stream Testing Code trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging with SharedSparkContext { // Name of the framework for Spark context def framework: String = this.getClass.getSimpleName // Master for Spark context def master: String = "local[4]" // Batch duration def batchDuration: Duration = Seconds(1) // Directory where the checkpoint data will be saved lazy val checkpointDir = { val dir = Utils.createTempDir() logDebug(s"checkpointDir: $dir") dir.toString } // Default after function for any streaming test suite. Override this // if you want to add your stuff to "after" (i.e., don't call after { } ) override def afterAll() { System.clearProperty("spark.streaming.clock") super.afterAll() } Pho to by Stev e Jurv etso n
  • 24. and continued…. /** * Create an input stream for the provided input sequence. This is done using * TestInputStream as queueStream's are not checkpointable. */ def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ : TestStreamingContext, input: Seq[Seq[T]]): TestInputStream[T] = { new TestInputStream(sc, ssc_, input, numInputPartitions) } // Default before function for any streaming test suite. Override this // if you want to add your stuff to "before" (i.e., don't call before { } ) override def beforeAll() { if (useManualClock) { logInfo("Using manual clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock } else { logInfo("Using real clock") conf.set("spark.streaming.clock", "org.apache.spark.streaming.util.SystemClock") } super.beforeAll() } /** * Run a block of code with the given StreamingContext and automatically * stop the context when the block completes or when an exception is thrown. */ def withOutputAndStreamingContext[R](outputStreamSSC: (TestOutputStream[R], TestStreamingContext)) (block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = { val outputStream = outputStreamSSC._1 val ssc = outputStreamSSC._2 try { block(outputStream, ssc) } finally { try { ssc.stop(stopSparkContext = false) } catch { case e: Exception => logError("Error stopping StreamingContext", e) } } } }
  • 25. and now for the clock /* * Allows us access to a manual clock. Note that the manual clock changed between 1.1.1 and 1.3 */ class TestManualClock(var time: Long) extends Clock { def this() = this(0L) def getTime(): Long = getTimeMillis() // Compat def currentTime(): Long = getTimeMillis() // Compat def getTimeMillis(): Long = synchronized { time } def setTime(timeToSet: Long): Unit = synchronized { time = timeToSet notifyAll() } def advance(timeToAdd: Long): Unit = synchronized { time += timeToAdd notifyAll() } def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat /** * @param targetTime block until the clock time is set or advanced to at least this time * @return current time reported by the clock when waiting finishes */ def waitTillTime(targetTime: Long): Long = synchronized { while (time < targetTime) { wait(100) } getTimeMillis() } }
  • 26. Testing streaming the happy panda way Creating test data is hard ssc.queueStream works - unless you need checkpoints (1.4.1+) Collecting the data locally is hard foreachRDD & a var figuring out when your test is “done” Let’s abstract all that away into testOperation
  • 27. We can hide all of that: test("really simple transformation") { val input = List(List("hi"), List("hi holden"), List("bye")) val expected = List(List("hi"), List("hi", "holden"), List("bye")) testOperation[String, String](input, tokenize _, expected, useSet = true) } Photo by An eye for my mind
  • 28. What about DataFrames? We can do the same as we did for RDD’s (.rdd) Inside of Spark validation looks like: def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row]) Sadly it’s not in a published package & local only instead we expose: def equalDataFrames(expected: DataFrame, result: DataFrame) { def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {
  • 29. …. and Datasets We can do the same as we did for RDD’s (.rdd) Inside of Spark validation looks like: def checkAnswer(df: Dataset[T], expectedAnswer: T*) Sadly it’s not in a published package & local only instead we expose: def equalDatasets(expected: Dataset[U], result: Dataset[V]) { def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {
  • 30. This is what it looks like: test("dataframe should be equal to its self") { val sqlCtx = sqlContext import sqlCtx.implicits._// Yah I know this is ugly val input = sc.parallelize(inputList).toDF equalDataFrames(input, input) } *This may or may not be easier.
  • 31. Which has “built-in” large support :)
  • 33. Let’s talk about local mode It’s way better than you would expect* It does its best to try and catch serialization errors It’s still not the same as running on a “real” cluster Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  • 34. Options beyond local mode: Just point at your existing cluster (set master) Start one with your shell scripts & change the master Really easy way to plug into existing integration testing spark-docker - hack in our own tests YarnMiniCluster https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/Ba seYarnClusterSuite.scala In Spark Testing Base extend SharedMiniCluster Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  • 35. Validation Validation can be really useful for catching errors before deploying a model Our tests can’t catch everything For now checking file sizes & execution time seem like the most common best practice (from survey) Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option spark-validator is still in early stages and not ready for production use but interesting proof of concept Photo by: Paul Schadler
  • 36. Related talks & blog posts Testing Spark Best Practices (Spark Summit 2014) Every Day I’m Shuffling (Strata 2015) & slides Spark and Spark Streaming Unit Testing Making Spark Unit Testing With Spark Testing Base
  • 37. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark
  • 38. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 39. And the next book….. Still being written - signup to be notified when it is available: http://www.highperformancespark.com https://twitter.com/highperfspark
  • 40. Related packages spark-testing-base: https://github.com/holdenk/spark-testing-base sscheck: https://github.com/juanrh/sscheck spark-validator: https://github.com/holdenk/spark-validator *ALPHA* spark-perf - https://github.com/databricks/spark-perf spark-integration-tests - https://github.com/databricks/spark-integration-tests
  • 41. “Future Work” Better ScalaCheck integration (ala sscheck) Testing details in my next Spark book Whatever* you all want Testing with Spark survey: http://bit.ly/holdenTestingSpark Semi-likely: integration testing (for now see @cfriegly’s Spark + Docker setup): https://github.com/fluxcapacitor/pipeline Pretty unlikely:*That I feel like doing, or you feel like making a pull request for. Photo by bullet101
  • 42. Cat wave photo by Quinn Dombrowski k thnx bye! If you want to fill out survey: http://bit.ly/holdenTestingSpark Will use update results in Strata Presentation & tweet eventually at @holdenkarau

Editor's Notes

  • #10: http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  • #13: Battiries not included from https://www.flickr.com/photos/lens-cap/7553792896/in/photolist-cvvasy-oEUhRK-oETiTQ-oEV1wi-oETe35-oESPKq-oETbkJ-cZ7rpm-oXkQqA-oXn5wE-fQyJC7-fQhfhB-4ad6ya-fQhubF-6M4vnj-hTEkP-fQyV8o-4nbDm1-dWkwUT-9jPc9o-oETULy-oVmHK5-cJY7LU-cJY4Xm-cJXGf1-oXoM8a-oVmEfQ-oEUFXz-oVmKFE-oEUsHG-oXo4xB-oVmsiy-oX7Awt-oX8aWK-bBepyy-7hZMab-cL7ZdU-pHuPtk-fQyZ87-a15gBU-rZmBj6-dWP4E-fQkdBM-6y2eHQ-dWmHPx-oXmsFm-dpBr3D-oET4Sw-6HBvaF-5LibjA
  • #14: https://www.flickr.com/photos/jeepersmedia/16476021195/in/photolist-r6VV5X-qPtjzM-qPnhKN-r6R1XL-qPnj1o-r4CxMo-r6VU5a-qPkRdq-r4Cxt7-qPtdS4-qPnhMS-qPncrS-ne6XZa-n9Ugsj-m7jGac-q9VfnL-r4CvKY-qBbUEB-q1ZgDd-qXQAoH-eq6vy-r4CDsb-qPnfd3-ns5gNN-ncCifL-ofuYC3-ofvknC-q9VbS9-qA3kgn-8oFX4F-q9chTw-r6Z9Ss-qPv3XZ-qPivAy-bqRqJG-pcp8Ar-oVbabG-pcEXqD-oVbbrY-fNL3k-khRFmo-qXZuon-qXUKG5-f6Jom3-pxT7m4-5acgHG-fuMJ3L-kFhNiN-cLfyxY-qYbpkr https://www.flickr.com/photos/deanj/176756479/in/photolist-gBVxv-dWP4E-5HUryK-9CvFup-fTgof-46WwoT-dpBr3D-g5GEcJ-2vL6RQ-9NCM2k-egYfPR-dkTY97-6cpjH3-7rGFb6-aZodte-7gth8o-99ickX-9ZfE8F-fhWrTw-eHtJZ8-qwD1iy-dEDJHz-dNHCxJ-2rEQhA-5r7XvB-8Mz5rS-4osGwD-r9MBQ6-exVkE-q3ZH7Q-qih8hq-63hzq6-bTEWZT-8QnASD-nW5d5E-7JnhLi-g5Hec8-8v7Qhe-8VgzBz-8VgzoP-471BH5-bPmL12-aqdz3y-aZodKp-NSKm4-cnhU2f-dzjUSL-cMDMGS-diLRr3-z7yha
  • #16: https://www.flickr.com/photos/dasqfamily/2689550144/in/photolist-56EDyd-efQnXR-7pgVnD-4UMS9Q-7Txe2X-5rKoLh-5rF3Jv-AiJDB-8bdX7z-4WUcin-q64D3v-44zDJw-bVocHs-2JEjyQ-qj83RW-45VZ8N-5hZehC-eFNna-pNRDyB-qisyve-457EPa-e6uRmn-5mjX9f-7iFMZ5-7iFMCY-4Sukyq-76RvKL-8yoq5h-yKcCs-7HsQGb-7HoV2Z-7HoUQt-71XbYG-hFGHfP-6FPisU-bvhhXE-8E3pv3-fQ61m2-6JwT35-ffYJqu-ffJDhg-p9umgB-eiRgoc-ejA2z3-pNQN19-4XsWJL-bRtmHt-q63y18-81VoSv-p9r6vm
  • #20: https://www.flickr.com/photos/loririelly/310423284/in/photolist-tr12A-ce3umw-ce3usA-sTuju-4KzpXM-bWF7N4-cmgsuj-5VAsVe-5ez5rL-dU3xmR-63Yk5R-aiAzej-axU8gH-itrDy4-yDZ4h-9Ndmmx-7H4JUh-7MxHEw-67LCuV-bN9iWp-dk2fqW-8WyXvu-efkbSV-81V6UF-Kfh5K-649haa-649hTT-6dbcH1-63faGr-63TBmt-63TzMZ-s1SZi-LbcWy-5UkVv5-67uA3-cAb4b-afspPn-Awb3N-8WvNvH-6tjfnB-7teVts-61F8tE-62pHDa-d3rnjE-4AwRMn-5UgxY6-dgv268-62LqUt-64dtZq-645hnk
  • #21: https://www.flickr.com/photos/missrogue/2976197742/in/photolist-5wZMUb-byJi8-e6WsRf-7fTLLH-5wVqsT-8Qpce3-8QAhV6-94zTWo-9PE5CH-5Pdi4T-5wVpw2-8PCDEp-8N31iD-8Dgo7D-aPHB8k-ecrFYb-3LbyXN-bmMaha-4WBsuY-cXnPdC-8VFVah-9nBTao-oQETkH-8XEpf6-guycFY-5zohWb-axDVJw-9nBB6S-9nBAsS-9nBVAJ-9nyRUi-guyeNd-dWQLit-9nBFkG-9nySHg-9nBKhQ-9nBKYd-9nBJxC-9nyPCZ-9nBCsN-9nyB1i-755Ci6-9nBBHU-9nBPQ1-9nyCFi-f1cuCZ-7cqyfu-9nCnbG-9nzjuB-dhSg5C
  • #23: https://www.flickr.com/photos/jurvetson/4210838735/in/photolist-7q6DDR-5RaoFH-qmSvU1-j2na5B-phmZw8-aFSChB-qzXHhu-oH5FJf-pAGYVx-fED7-rgye2d-5QuZWy-7JB9Mc-9uGbh2-p6vrSm-2iiBUv-bUJuTG-2hqRBB-6uaDJd-nKjazz-oNooaJ-4vG64A-rxGqa8-c5eSsE-fuzLVQ-6s9roS-6PgK7C-iK4crn-4YEKVM-wccaLP-s3RDMf-oSG8xs-p7bESa-5PqV7f-6fZe18-6E87UX-xsGoaX-5QqHnc-bXj2ws-7yobun-ec5YY3-5R1zW9-6BNBr4-pwFExC-sKn1DA-prfaJf-5QKGWf-8SbLkG-kpoLT8-EpTsV
  • #28: https://www.flickr.com/photos/aneyeformymind/14749269260/in/photolist-otkRTS-5ji4kA-fyVwQ-7cX7cn-bD2Dic-5Up7fc-dTKVbm-4ubtHm-4mvKN2-c1QfDq-61T5w1-etiK69-5BpACJ-6mMEc9-6FSWWG-nR63sg-ayca4Z-cqp5bY-pptfUm-aypUKW-6KjVAF-yfDtM-NtxX2-77Yy7i-f2oJiP-6ZMHYb-72iAwz-9VRuVt-4KSt36-6E7wW-4n2pB3-f2CZos-dUaFoJ-fHhLpC-43FdQB-5udfMD-kL3sb-f7h28P-fybJfq-9MNLVf-B4f3-8KJqAL-69tyc7-4mXn2v-rA5FAJ-8KSrdG-ytz6-8rFz61-5LqL6z-8H9yc6
  • #33: https://www.flickr.com/photos/scrapstothefuture/1693509767/in/photolist-3zDFsg-biN7Tr-7h9kns-8VPPnQ-8Pno2P-89jeNY-7VGqwT-5yWMmX-5bUR4H-6miUqS-sTujx-d3Vu5j-4YmCfx-3C1S8M-59vWmE-7uGLG5-59rNx4-58Ghuh-4GfmJg-EkZN3-6frTkE-4L1Ahe-qNetwn-58kvj7-4sq62b-aeQKBs-4oZHQ4-7Atgj3-4FYAN-3Jxj8e-4Gfm9H-6xZ4GH-59nDb4-dEsTzm-4j84kn-4Aw5nh-78c2Y1-aohS4v-6u9EJv-5MUXAf-59rP28-xR1fw-7mdYgs-53VzHE-af7Ssn-553oZ7-9ncLeZ-7sssjM-4Mm838-5ma2x3
  • #34: Possibilities: https://en.wikipedia.org/wiki/Pentium_FDIV_bug#/media/File:KL_Intel_Pentium_A80501.jpg https://en.wikipedia.org/wiki/Leaning_Tower_of_Pisa#/media/File:The_Leaning_Tower_of_Pisa_SB.jpeg https://www.flickr.com/photos/basykes/7131116/in/photolist-CxQh-7YroS8-bYk8c9-coJyyy-4T4sZ3-rnb4RS-pBF85W-srn3Ey-r1YeC9-djyitF-4AHNep-rBrRXb-rn9Z7w-qGJM4L-rDCFb3-rnhsNR-8dbKX7-6yz3gw-nw4U6T-ee2zJ-ebvTwX-ofqP5u-2YUDLf-ebvTrn-S2FW9-4GYkaz-bovp33-edLn8-4AHN9M-2cYaVv-493PXz-5CDLuJ-ebvTGP-5CzsV4-bqhofa-dmfiAW-4AHN3c-8DJig8-dm9hY6-kYQMZz-4AN427-4AHMPB-4AN4km-4AN49o-4AHNnr-4AN3WJ-4AHMJ2-4UW5si-4AN3Hh-bovp57
  • #35: https://www.flickr.com/photos/bike/266453254/in/photolist-pxDgC-qVJKmb-iFVWP-6UDyTD-fttcb1-67zFsk-AfWQY-5rW57Y-7j4KQk-2jyi5i-qtxD8C-7yNP2L-peib-8T1eHB-tGV9W-3M87B-Gm3jx-7RaDjp-p5gXvj-qknDK-2uej6e-ecA7yq-e4kwFj-rHFE-6NAXg-HRgk-47ivvz-4yWJjP-3M87F-3M87C-kH4yN4-4FwBEn-zyarW-5RRoUJ-2M1zdU-4ry6Q-4FARpd-4FwChi-crq3s5-4ry6R-4FwD1P-4mJsJB-4ygbyd-3M87A-pxNx6a-dWQ1BS-5mTe82-5WTzYa-5WTzTe-5WTA4g
  • #36: Possible photo: https://www.flickr.com/photos/sfllaw/38986605/in/photolist-4rPmR , instead used https://www.flickr.com/photos/pschadler/4932737690/in/photolist-8vTz4y-6tS73K-9TXEfL-aRJfs-4k6uy-6DCjqc-5Z8JJd-PBAnX-4dTchT-6tnWGj-9jAV1o-o2Cy43-a8VWx-5FVSP4-cAnFj7-7zoSsn-8wK14H-6McqyJ-6xH2ZW-7YSH34-q3VQtE-dLCZfF-fDThha-hQ5ASg-bjA1h2-9jxPnZ-9jAUXq-5yHMR3-5FVUCM-5FVTJx-6GsBmn-bjA1Kk-8ssmdF-6Hb1Zt-6Hf3Vf-6Hf3JW-6HaZTp-6HaZzF-6HaUQr-6HeV8h-6HeTJN-6HeSZY-6HaN5x-6HaMnx-6HaM92-6HeP1Q-6HaLgP-6HaKxi-6HeHub-6Hazr2
  • #41: https://www.flickr.com/photos/quinndombrowski/3989228746/in/photolist-75vQHY-dgvnYJ-5bAup3-4x831B-7fVXTM-pR6HA-o6eA6-92zvRC-qLF5z4-ni4HDN-3SNYcE-8TxtCy-9ykzAD-crJPi7-6twtNJ-7s7E5X-8ENyBD-e4jm7v-eVysNX-x7FyF-8ZiBvi-ARNA5-CFza2-K8qbK-4JzPcU-3DcdFQ-6PqiTX-5Wm5SU-82xRCq-643viR-9dj9GZ-9TdZjm-kiPbpT-87UCH8-pRBXb9-nX1ftd-5QNXgs-vK1Xrc-qe6i2S-pjBDZi-pYNRZW-qgjhno-pYWafe-pjooTJ-qgnzVg-pjBGut-qgnwxp-qgcsEa-qgcuCi-qgcscX
  • #42: http://stuffpoint.com/cats/image/394826/space-cat-wallpaper/