Beyond parallelize and collect - Spark Summit East 2016

Beyond Parallelize & Collect
(Effective testing of Spark Programs)
Now
mostly
“works”*
*See developer for details. Does not imply warranty. :p

Who am I?
My name is Holden Karau
Prefered pronouns are she/her
I’m a Software Engineer
currently IBM and previously Alpine, Databricks, Google, Foursquare & Amazon
co-author of Learning Spark & Fast Data processing with Spark
@holdenkarau
Slide share http://www.slideshare.net/hkarau
Linkedin https://www.linkedin.com/in/holdenkarau

What is going to be covered:
What I think I might know about you
A bit about why you should test your programs
Using parallelize & collect for unit testing (quick skim)
Comparing datasets too large to fit in memory
Considerations for Streaming & SQL (DataFrames & Datasets)
Cute & scary pictures
I promise at least one panda and one cat
“Future Work”

Who I think you wonderful humans are?
Nice* people
Like silly pictures
Familiar with Apache Spark
If not, buy one of my books or watch Paco’s awesome video
Familiar with one of Scala, Java, or Python
If you know R well I’d love to chat though
Want to make better software
(or models, or w/e)

So why should you test?
Makes you a better person
Save $s
May help you avoid losing your employer all of their money
Or “users” if we were in the bay
AWS is expensive
Waiting for our jobs to fail is a pretty long dev cycle
This is really just to guilt trip you & give you flashbacks to your QA internships

So why should you test - continued
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

Why don’t we test?
It’s hard
Faking data, setting up integration tests, urgh w/e
Our tests can get too slow
It takes a lot of time
and people always want everything done yesterday
or I just want to go home see my partner
etc.

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

An artisanal Spark unit test
@transient private var _sc: SparkContext = _
override def beforeAll() {
_sc = new SparkContext("local[4]")
super.beforeAll()
}
override def afterAll() {
if (sc != null)
sc.stop()
System.clearProperty("spark.driver.port") // rebind issue
_sc = null
super.afterAll()
}
Photo by morinesque

And on to the actual test...
test("really simple transformation") {
val input = List("hi", "hi holden", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
}
def tokenize(f: RDD[String]) = {
f.map(_.split(" ").toList)
}
Photo by morinesque

Wait, where were the batteries?
Photo by Jim Bauer

Let’s get batteries!
Spark unit testing
spark-testing-base - https://github.com/holdenk/spark-testing-base
sscheck - https://github.com/juanrh/sscheck
Integration testing
spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests
Performance
spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf
Spark job validation
Photo by Mike Mozart

A simple unit test re-visited (Scala)
class SampleRDDTest extends FunSuite with SharedSparkContext {
val input = List("hi", "hi holden", "bye")
assert(SampleRDD.tokenize(sc.parallelize(input)).collect().toList === expected)
}
}

Ok but what about problems @ scale
Maybe our program works fine on our local sized input
If we are using Spark our actual workload is probably huge
How do we test workloads too large for a single machine?
we can’t just use parallelize and collect
Qfamily

Distributed “set” operations to the rescue*
Pretty close - already built into Spark
Doesn’t do so well with floating points :(
damn floating points keep showing up everywhere :p
Doesn’t really handle duplicates very well
{“coffee”, “coffee”, “panda”} != {“panda”, “coffee”} but with set operations...
Matti Mattila

Or use RDDComparisions:
def compareWithOrderSamePartitioner[T: ClassTag](expected:
RDD[T], result: RDD[T]): Option[(T, T)] = {
expected.zip(result).filter{case (x, y) => x !=
y}.take(1).headOption
}
Matti Mattila

Or use RDDComparisions:
def compare[T: ClassTag](expected: RDD[T], result: RDD[T]):
Option[(T, Int, Int)] = {
val expectedKeyed = expected.map(x => (x, 1)).reduceByKey(_ +
_)
val resultKeyed = result.map(x => (x, 1)).reduceByKey(_ + _)
expectedKeyed.cogroup(resultKeyed).filter{case (_, (i1, i2))
=>
i1.isEmpty || i2.isEmpty || i1.head !=
i2.head}.take(1).headOption.
map{case (v, (i1, i2)) => (v, i1.headOption.getOrElse(0),
i2.headOption.getOrElse(0))}
}
Matti Mattila

But where do we get the data for those tests?
If you have production data you can sample you are lucky!
If possible you can try and save in the same format
If our data is a bunch of Vectors or Doubles Spark’s got tools :)
Coming up with good test data can take a long time
Lori Rielly

QuickCheck / ScalaCheck
QuickCheck generates tests data under a set of constraints
Scala version is ScalaCheck - supported by the two unit testing libraries for
Spark
sscheck
Awesome people*, supports generating DStreams too!
spark-testing-base
Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs
PROtara hunt

With spark-testing-base
test("map should not change number of elements") {
forAll(RDDGenerator.genRDD[String](sc)){
rdd => rdd.map(_.length).count() == rdd.count()
}
}

Testing streaming….
Photo by Steve Jurvetson

// Setup our Stream:
class TestInputStream[T: ClassTag](@transient var sc:
SparkContext,
ssc_ : StreamingContext, input: Seq[Seq[T]], numPartitions: Int)
extends FriendlyInputDStream[T](ssc_) {
def start() {}
def stop() {}
def compute(validTime: Time): Option[RDD[T]] = {
logInfo("Computing RDD for time " + validTime)
val index = ((validTime - ourZeroTime) / slideDuration -
1).toInt
val selectedInput = if (index < input.size) input(index) else
Seq[T]()
// lets us test cases where RDDs are not created
if (selectedInput == null) {
return None
}
val rdd = sc.makeRDD(selectedInput, numPartitions)
logInfo("Created RDD " + rdd.id + " with " + selectedInput)
Some(rdd)
}
}
Artisanal Stream Testing Code
trait StreamingSuiteBase extends FunSuite with BeforeAndAfterAll with Logging
with SharedSparkContext {
// Name of the framework for Spark context
def framework: String = this.getClass.getSimpleName
// Master for Spark context
def master: String = "local[4]"
// Batch duration
def batchDuration: Duration = Seconds(1)
// Directory where the checkpoint data will be saved
lazy val checkpointDir = {
val dir = Utils.createTempDir()
logDebug(s"checkpointDir: $dir")
dir.toString
}
// Default after function for any streaming test suite. Override this
// if you want to add your stuff to "after" (i.e., don't call after { } )
override def afterAll() {
System.clearProperty("spark.streaming.clock")
super.afterAll()
}
Pho
to
by
Stev
e
Jurv
etso
n

and continued….
/**
* Create an input stream for the provided input sequence. This is done using
* TestInputStream as queueStream's are not checkpointable.
*/
def createTestInputStream[T: ClassTag](sc: SparkContext, ssc_ :
TestStreamingContext,
input: Seq[Seq[T]]): TestInputStream[T] = {
new TestInputStream(sc, ssc_, input, numInputPartitions)
}
// Default before function for any streaming test suite. Override this
// if you want to add your stuff to "before" (i.e., don't call before { } )
override def beforeAll() {
if (useManualClock) {
logInfo("Using manual clock")
conf.set("spark.streaming.clock",
"org.apache.spark.streaming.util.TestManualClock") // We can specify our own clock
} else {
logInfo("Using real clock")
conf.set("spark.streaming.clock",
"org.apache.spark.streaming.util.SystemClock")
}
super.beforeAll()
}
/**
* Run a block of code with the given StreamingContext and automatically
* stop the context when the block completes or when an exception is thrown.
*/
def withOutputAndStreamingContext[R](outputStreamSSC:
(TestOutputStream[R], TestStreamingContext))
(block: (TestOutputStream[R], TestStreamingContext) => Unit): Unit = {
val outputStream = outputStreamSSC._1
val ssc = outputStreamSSC._2
try {
block(outputStream, ssc)
} finally {
try {
ssc.stop(stopSparkContext = false)
} catch {
case e: Exception =>
logError("Error stopping StreamingContext", e)
}
}
}
}

and now for the clock
/*
* Allows us access to a manual clock. Note that the manual clock changed between
1.1.1 and 1.3
*/
class TestManualClock(var time: Long) extends Clock {
def this() = this(0L)
def getTime(): Long = getTimeMillis() // Compat
def currentTime(): Long = getTimeMillis() // Compat
def getTimeMillis(): Long =
synchronized {
time
}
def setTime(timeToSet: Long): Unit =
synchronized {
time = timeToSet
notifyAll()
}
def advance(timeToAdd: Long): Unit =
synchronized {
time += timeToAdd
notifyAll()
}
def addToTime(timeToAdd: Long): Unit = advance(timeToAdd) // Compat
/**
* @param targetTime block until the clock time is set or advanced to at least this
time
* @return current time reported by the clock when waiting finishes
*/
def waitTillTime(targetTime: Long): Long =
synchronized {
while (time < targetTime) {
wait(100)
}
getTimeMillis()
}
}

Testing streaming the happy panda way
Creating test data is hard
ssc.queueStream works - unless you need checkpoints (1.4.1+)
Collecting the data locally is hard
foreachRDD & a var
figuring out when your test is “done”
Let’s abstract all that away into testOperation

We can hide all of that:
val input = List(List("hi"), List("hi holden"), List("bye"))
testOperation[String, String](input, tokenize _, expected, useSet = true)
}
Photo by An eye
for my mind

What about DataFrames?
We can do the same as we did for RDD’s (.rdd)
Inside of Spark validation looks like:
def checkAnswer(df: DataFrame, expectedAnswer: Seq[Row])
Sadly it’s not in a published package & local only
instead we expose:
def equalDataFrames(expected: DataFrame, result: DataFrame) {
def approxEqualDataFrames(e: DataFrame, r: DataFrame, tol: Double) {

…. and Datasets
We can do the same as we did for RDD’s (.rdd)
Inside of Spark validation looks like:
def checkAnswer(df: Dataset[T], expectedAnswer: T*)
Sadly it’s not in a published package & local only
instead we expose:
def equalDatasets(expected: Dataset[U], result: Dataset[V]) {
def approxEqualDatasets(e: Dataset[U], r: Dataset[V], tol: Double) {

This is what it looks like:
test("dataframe should be equal to its self") {
val sqlCtx = sqlContext
import sqlCtx.implicits._// Yah I know this is ugly
val input = sc.parallelize(inputList).toDF
equalDataFrames(input, input)
}
*This may or may not be easier.

Which has “built-in” large support :)

Let’s talk about local mode
It’s way better than you would expect*
It does its best to try and catch serialization errors
It’s still not the same as running on a “real” cluster
Especially since if we were just local mode, parallelize and collect might be fine
Photo by: Bev Sykes

Options beyond local mode:
Just point at your existing cluster (set master)
Start one with your shell scripts & change the master
Really easy way to plug into existing integration testing
spark-docker - hack in our own tests
YarnMiniCluster
https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/Ba
seYarnClusterSuite.scala
In Spark Testing Base extend SharedMiniCluster
Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+)
Photo by Richard Masoner

Validation
Validation can be really useful for catching errors before deploying a model
Our tests can’t catch everything
For now checking file sizes & execution time seem like the most common best
practice (from survey)
Accumulators have some challenges (see SPARK-12469 for progress) but are
an interesting option
spark-validator is still in early stages and not ready for production use but
interesting proof of concept
Photo by:
Paul Schadler

Related talks & blog posts
Testing Spark Best Practices (Spark Summit 2014)
Every Day I’m Shuffling (Strata 2015) & slides
Spark and Spark Streaming Unit Testing
Making Spark Unit Testing With Spark Testing Base

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark

And the next book…..
Still being written - signup to be notified when it is available:
http://www.highperformancespark.com
https://twitter.com/highperfspark

Related packages
spark-testing-base: https://github.com/holdenk/spark-testing-base
sscheck: https://github.com/juanrh/sscheck
spark-validator: https://github.com/holdenk/spark-validator *ALPHA*
spark-perf - https://github.com/databricks/spark-perf
spark-integration-tests - https://github.com/databricks/spark-integration-tests

“Future Work”
Better ScalaCheck integration (ala sscheck)
Testing details in my next Spark book
Whatever* you all want
Testing with Spark survey: http://bit.ly/holdenTestingSpark
Semi-likely:
integration testing (for now see @cfriegly’s Spark + Docker setup):
https://github.com/fluxcapacitor/pipeline
Pretty unlikely:*That I feel like doing, or you feel like making a pull request for.
Photo by
bullet101

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you want to fill out survey:
http://bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau

Beyond parallelize and collect - Spark Summit East 2016

More Related Content

What's hot (20)

Similar to Beyond parallelize and collect - Spark Summit East 2016 (20)

Recently uploaded (20)

Beyond parallelize and collect - Spark Summit East 2016

Editor's Notes