Apache spark: in and out

Apache Spark: in and out
Ben Fradet - Tech lead

In and out
1. Intro
2. The different batch APIs
3. Real world examples
4. Addressing the API shortcomings
5. Running and configuring a Spark job on AWS EMR
6. Outro

The different batch APIs - minimal examples
val counts = rdd
.map(line => (line.word, 1))
.reduceByKey(_ + _)
val counts = df
.groupBy(“word”)
.count()
val counts = spark
.sql(“select word, count(*) ” +
“from words group by word”)
id word
1 Scala
2 Spark
3 API
4 Scala
val counts = ds
.groupByKey(_.word)
.count()
word count
Scala 2
Spark 1
API 1

The different batch APIs - comparison
RDD SQL DataFrame Dataset
API looks like Scala collections SQL Scala / SQL Scala collections
In memory JVM objects Off heap Off heap Off heap
Query optimization ✗ ✔ ✔ ✔
Code generation ✗ ✔ ✔ ✔
Syntax errors Compile time Runtime Compile time Compile time
Analysis errors NA Runtime Runtime Compile time

Real world examples - EnrichJob
val input: RDD[_] = getInputRDD(inputPath)
val all: RDD[(_, List[ValidatedEnrichedEvent]) =
input.map(e => (e, enrich(e))).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
spark.createDataFrame(bad).write.text(badOutputPath)
val good: RDD[EnrichedEvent] = all.flatMap(e => projectGoods(e))
spark.createDataset(good).write.csv(goodOutputPath)

Real world examples - ShredJob
val input: RDD[String] = sc.textFile(inputPath)
val all: RDD[(String, List[ValidatedShreddedEvent]) =
input.map(e => shred(e)).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
val deduped: RDD[(ShreddedEvent, ValidatedBoolean)] = all
.flatMap(e => projectGoods(e))
.groupBy(e => (e.eventId, e.eventFingerprint))
.flatMap { case (_, values) => values.take(1) }
.map(e => (e, depdupeCrossBatch(e)))
// …
spark.createDataFrame(bad + dedupeF).write.text(badOutputPath)
spark.createDataFrame(dedupeS.map(_.event)).write.text(goodOutputPath)
dedupeS.map(_.shreds).write.text(shreddedTypesOutputPath)

Addressing the API shortcomings - typelevel/frameless
// untyped -> runtime exception
ds.select(ds(“not-word”))
val typedDS = TypedDataset.create(ds)
// typed -> doesn’t compile
typedDS.select(typedDS(‘not-word))
val counts = ds.groupByKey(_.word).count()
case class WordCount(word: String, count: Int)
// checked at compile time
counts.as[WordCount]
id word
1 Scala
2 Spark
3 API
4 Scala

Addressing the API shortcomings - BenFradet/struct-type-encoder
case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
.read
.json("/some/dir/*.json")
.as[MyCaseClass]
val derived = spark
.read
.schema(StructTypeEncoder[MyCaseClass].encode)
.json("/some/dir/*.json")
.as[MyCaseClass]
Plus support for metadata and deeply-nested schemas!

Running and configuring a Spark job on AWS EMR
-- configurations ‘[{
“Classification”: “yarn-site”,
“Properties”: {
“yarn.nodemanager.vmem-check-enabled”: “false”,
“yarn.nodemanager.resource.memory-mb”: “117760”,
“yarn.scheduler.maximum-allocation-mb”: “117760”
}
},{
“Classification”: “spark”,
“Properties”: { “maximizeResourceAllocation”: "false" }
},{
“Classification”: “spark-defaults”,
“Properties: {
“spark.dynamicAllocation.enabled”: “false”,
“spark.executor.instances”: "4",
“spark.yarn.executor.memoryOverhead”: “3072”,
“spark.executor.memory”: “20G”,
“spark.executor.cores”: “3”,
“spark.yarn.driver.memoryOverhead”: “3072”,
“spark.driver.memory”: “20G”,
“spark.driver.cores”: “3”,
“spark.default.parallelism”: "48"
}]’
aws emr create-cluster
--name "Snowplow Enrich Job"
--release-label emr-5.12.0
--applications Name=Spark
--instance-type r4.4xlarge
--instance-count 1
--steps ‘[{
"Name": "Snowplow Spark Enrich",
"Args": [...],
"Jar": "s3://bucket/my-jar.jar",
"ActionOnFailure": "CONTINUE",
"MainClass": "EnrichJob",
"Type": "CUSTOM_JAR",
"Properties": "string"
}]’
Spark config cheat sheet

Thanks!
GitHub:
- github.com/BenFradet
- github.com/snowplow/snowplow
Twitter:
- @fradetben
- @snowplowdata
Contact us:
sales@snowplowanalytics.com
snowplowanalytics.com

Apache spark: in and out

More Related Content

What's hot (16)

Similar to Apache spark: in and out (20)

Recently uploaded (20)

Apache spark: in and out