SlideShare a Scribd company logo
Apache Spark: in and out
Ben Fradet - Tech lead
In and out
1. Intro
2. The different batch APIs
3. Real world examples
4. Addressing the API shortcomings
5. Running and configuring a Spark job on AWS EMR
6. Outro
Intro
The different batch APIs - minimal examples
val counts = rdd
.map(line => (line.word, 1))
.reduceByKey(_ + _)
val counts = df
.groupBy(“word”)
.count()
val counts = spark
.sql(“select word, count(*) ” +
“from words group by word”)
id word
1 Scala
2 Spark
3 API
4 Scala
val counts = ds
.groupByKey(_.word)
.count()
word count
Scala 2
Spark 1
API 1
The different batch APIs - comparison
RDD SQL DataFrame Dataset
API looks like Scala collections SQL Scala / SQL Scala collections
In memory JVM objects Off heap Off heap Off heap
Query optimization ✗ ✔ ✔ ✔
Code generation ✗ ✔ ✔ ✔
Syntax errors Compile time Runtime Compile time Compile time
Analysis errors NA Runtime Runtime Compile time
Real world examples - EnrichJob
val input: RDD[_] = getInputRDD(inputPath)
val all: RDD[(_, List[ValidatedEnrichedEvent]) =
input.map(e => (e, enrich(e))).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
spark.createDataFrame(bad).write.text(badOutputPath)
val good: RDD[EnrichedEvent] = all.flatMap(e => projectGoods(e))
spark.createDataset(good).write.csv(goodOutputPath)
Real world examples - ShredJob
val input: RDD[String] = sc.textFile(inputPath)
val all: RDD[(String, List[ValidatedShreddedEvent]) =
input.map(e => shred(e)).cache()
val bad: RDD[Row] = all.flatMap(e => projectBads(e))
val deduped: RDD[(ShreddedEvent, ValidatedBoolean)] = all
.flatMap(e => projectGoods(e))
.groupBy(e => (e.eventId, e.eventFingerprint))
.flatMap { case (_, values) => values.take(1) }
.map(e => (e, depdupeCrossBatch(e)))
// …
spark.createDataFrame(bad + dedupeF).write.text(badOutputPath)
spark.createDataFrame(dedupeS.map(_.event)).write.text(goodOutputPath)
dedupeS.map(_.shreds).write.text(shreddedTypesOutputPath)
Addressing the API shortcomings - typelevel/frameless
// untyped -> runtime exception
ds.select(ds(“not-word”))
val typedDS = TypedDataset.create(ds)
// typed -> doesn’t compile
typedDS.select(typedDS(‘not-word))
val counts = ds.groupByKey(_.word).count()
case class WordCount(word: String, count: Int)
// checked at compile time
counts.as[WordCount]
id word
1 Scala
2 Spark
3 API
4 Scala
Addressing the API shortcomings - BenFradet/struct-type-encoder
case class MyCaseClass(a: Int, b: String, c: Double)
val inferred = spark
.read
.json("/some/dir/*.json")
.as[MyCaseClass]
val derived = spark
.read
.schema(StructTypeEncoder[MyCaseClass].encode)
.json("/some/dir/*.json")
.as[MyCaseClass]
Plus support for metadata and deeply-nested schemas!
Running and configuring a Spark job on AWS EMR
-- configurations ‘[{
“Classification”: “yarn-site”,
“Properties”: {
“yarn.nodemanager.vmem-check-enabled”: “false”,
“yarn.nodemanager.resource.memory-mb”: “117760”,
“yarn.scheduler.maximum-allocation-mb”: “117760”
}
},{
“Classification”: “spark”,
“Properties”: { “maximizeResourceAllocation”: "false" }
},{
“Classification”: “spark-defaults”,
“Properties: {
“spark.dynamicAllocation.enabled”: “false”,
“spark.executor.instances”: "4",
“spark.yarn.executor.memoryOverhead”: “3072”,
“spark.executor.memory”: “20G”,
“spark.executor.cores”: “3”,
“spark.yarn.driver.memoryOverhead”: “3072”,
“spark.driver.memory”: “20G”,
“spark.driver.cores”: “3”,
“spark.default.parallelism”: "48"
}]’
aws emr create-cluster 
--name "Snowplow Enrich Job" 
--release-label emr-5.12.0 
--applications Name=Spark 
--instance-type r4.4xlarge 
--instance-count 1 
--steps ‘[{
"Name": "Snowplow Spark Enrich",
"Args": [...],
"Jar": "s3://bucket/my-jar.jar",
"ActionOnFailure": "CONTINUE",
"MainClass": "EnrichJob",
"Type": "CUSTOM_JAR",
"Properties": "string"
}]’ 
Spark config cheat sheet
Thanks!
GitHub:
- github.com/BenFradet
- github.com/snowplow/snowplow
Twitter:
- @fradetben
- @snowplowdata
Contact us:
sales@snowplowanalytics.com
snowplowanalytics.com

More Related Content

ZIP
Lisp Macros in 20 Minutes (Featuring Clojure)
PDF
(ThoughtWorks Away Day 2009) one or two things you may not know about typesys...
PDF
Kotlin and Domain-Driven Design: A perfect match - Kotlin Meetup Munich
PDF
Justjava 2007 Arquitetura Java EE Paulo Silveira, Phillip Calçado
PDF
The Future Shape of Ruby Objects
PPTX
Andriy Slobodyanyk "How to Use Hibernate: Key Problems and Solutions"
PPTX
Apache spark
PDF
Streams or Loops? Java 8 Stream API by Niki Petkov - Proxiad Bulgaria
Lisp Macros in 20 Minutes (Featuring Clojure)
(ThoughtWorks Away Day 2009) one or two things you may not know about typesys...
Kotlin and Domain-Driven Design: A perfect match - Kotlin Meetup Munich
Justjava 2007 Arquitetura Java EE Paulo Silveira, Phillip Calçado
The Future Shape of Ruby Objects
Andriy Slobodyanyk "How to Use Hibernate: Key Problems and Solutions"
Apache spark
Streams or Loops? Java 8 Stream API by Niki Petkov - Proxiad Bulgaria

What's hot (16)

PDF
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
PDF
Bind me if you can
KEY
Brunhild
PDF
Lift 2 0
 
PPT
Auto cad 2006_api_overview
PDF
SE 20016 - programming languages landscape.
PPT
JavaScript - Object-Oriented Programming & Remote Scripting
PDF
Java Se next Generetion
PDF
L13: Scripting
DOCX
Student Data Base Using C/C++ Final Project
PDF
Scheme 核心概念(一)
PDF
C Prog. - Strings (Updated)
PDF
Relaxing With CouchDB
PDF
Swift Micro-services and AWS Technologies
PPTX
Querying Nested JSON Data Using N1QL and Couchbase
PPTX
The Aggregation Framework
Community-driven Language Design at TC39 on the JavaScript Pipeline Operator ...
Bind me if you can
Brunhild
Lift 2 0
 
Auto cad 2006_api_overview
SE 20016 - programming languages landscape.
JavaScript - Object-Oriented Programming & Remote Scripting
Java Se next Generetion
L13: Scripting
Student Data Base Using C/C++ Final Project
Scheme 核心概念(一)
C Prog. - Strings (Updated)
Relaxing With CouchDB
Swift Micro-services and AWS Technologies
Querying Nested JSON Data Using N1QL and Couchbase
The Aggregation Framework
Ad

Similar to Apache spark: in and out (20)

PDF
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
PPTX
Testing batch and streaming Spark applications
PPTX
Apache Spark - Aram Mkrtchyan
PDF
Beneath RDD in Apache Spark by Jacek Laskowski
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
PDF
Spark Summit EU talk by Ted Malaska
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Operations on rdd
PDF
Introduction to Scalding and Monoids
PDF
Introduction to spark
PDF
Intro to Scala.js - Scala UG Cologne
PDF
Introduction to Apache Spark
PDF
Apache Spark & Streaming
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PPTX
Introduction to Apache Spark
PPTX
Spark by Adform Research, Paulius
PPTX
AST - the only true tool for building JavaScript
PPT
apache spark presentation for distributed processing
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Testing batch and streaming Spark applications
Apache Spark - Aram Mkrtchyan
Beneath RDD in Apache Spark by Jacek Laskowski
Big Data Analytics with Scala at SCALA.IO 2013
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Spark Summit EU talk by Ted Malaska
SparkR - Play Spark Using R (20160909 HadoopCon)
Spark SQL Deep Dive @ Melbourne Spark Meetup
Operations on rdd
Introduction to Scalding and Monoids
Introduction to spark
Intro to Scala.js - Scala UG Cologne
Introduction to Apache Spark
Apache Spark & Streaming
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Introduction to Apache Spark
Spark by Adform Research, Paulius
AST - the only true tool for building JavaScript
apache spark presentation for distributed processing
Ad

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Foundation of Data Science unit number two notes
Data_Analytics_and_PowerBI_Presentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Mega Projects Data Mega Projects Data
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to machine learning and Linear Models
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Clinical guidelines as a resource for EBP(1).pdf
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Foundation of Data Science unit number two notes

Apache spark: in and out

  • 1. Apache Spark: in and out Ben Fradet - Tech lead
  • 2. In and out 1. Intro 2. The different batch APIs 3. Real world examples 4. Addressing the API shortcomings 5. Running and configuring a Spark job on AWS EMR 6. Outro
  • 4. The different batch APIs - minimal examples val counts = rdd .map(line => (line.word, 1)) .reduceByKey(_ + _) val counts = df .groupBy(“word”) .count() val counts = spark .sql(“select word, count(*) ” + “from words group by word”) id word 1 Scala 2 Spark 3 API 4 Scala val counts = ds .groupByKey(_.word) .count() word count Scala 2 Spark 1 API 1
  • 5. The different batch APIs - comparison RDD SQL DataFrame Dataset API looks like Scala collections SQL Scala / SQL Scala collections In memory JVM objects Off heap Off heap Off heap Query optimization ✗ ✔ ✔ ✔ Code generation ✗ ✔ ✔ ✔ Syntax errors Compile time Runtime Compile time Compile time Analysis errors NA Runtime Runtime Compile time
  • 6. Real world examples - EnrichJob val input: RDD[_] = getInputRDD(inputPath) val all: RDD[(_, List[ValidatedEnrichedEvent]) = input.map(e => (e, enrich(e))).cache() val bad: RDD[Row] = all.flatMap(e => projectBads(e)) spark.createDataFrame(bad).write.text(badOutputPath) val good: RDD[EnrichedEvent] = all.flatMap(e => projectGoods(e)) spark.createDataset(good).write.csv(goodOutputPath)
  • 7. Real world examples - ShredJob val input: RDD[String] = sc.textFile(inputPath) val all: RDD[(String, List[ValidatedShreddedEvent]) = input.map(e => shred(e)).cache() val bad: RDD[Row] = all.flatMap(e => projectBads(e)) val deduped: RDD[(ShreddedEvent, ValidatedBoolean)] = all .flatMap(e => projectGoods(e)) .groupBy(e => (e.eventId, e.eventFingerprint)) .flatMap { case (_, values) => values.take(1) } .map(e => (e, depdupeCrossBatch(e))) // … spark.createDataFrame(bad + dedupeF).write.text(badOutputPath) spark.createDataFrame(dedupeS.map(_.event)).write.text(goodOutputPath) dedupeS.map(_.shreds).write.text(shreddedTypesOutputPath)
  • 8. Addressing the API shortcomings - typelevel/frameless // untyped -> runtime exception ds.select(ds(“not-word”)) val typedDS = TypedDataset.create(ds) // typed -> doesn’t compile typedDS.select(typedDS(‘not-word)) val counts = ds.groupByKey(_.word).count() case class WordCount(word: String, count: Int) // checked at compile time counts.as[WordCount] id word 1 Scala 2 Spark 3 API 4 Scala
  • 9. Addressing the API shortcomings - BenFradet/struct-type-encoder case class MyCaseClass(a: Int, b: String, c: Double) val inferred = spark .read .json("/some/dir/*.json") .as[MyCaseClass] val derived = spark .read .schema(StructTypeEncoder[MyCaseClass].encode) .json("/some/dir/*.json") .as[MyCaseClass] Plus support for metadata and deeply-nested schemas!
  • 10. Running and configuring a Spark job on AWS EMR -- configurations ‘[{ “Classification”: “yarn-site”, “Properties”: { “yarn.nodemanager.vmem-check-enabled”: “false”, “yarn.nodemanager.resource.memory-mb”: “117760”, “yarn.scheduler.maximum-allocation-mb”: “117760” } },{ “Classification”: “spark”, “Properties”: { “maximizeResourceAllocation”: "false" } },{ “Classification”: “spark-defaults”, “Properties: { “spark.dynamicAllocation.enabled”: “false”, “spark.executor.instances”: "4", “spark.yarn.executor.memoryOverhead”: “3072”, “spark.executor.memory”: “20G”, “spark.executor.cores”: “3”, “spark.yarn.driver.memoryOverhead”: “3072”, “spark.driver.memory”: “20G”, “spark.driver.cores”: “3”, “spark.default.parallelism”: "48" }]’ aws emr create-cluster --name "Snowplow Enrich Job" --release-label emr-5.12.0 --applications Name=Spark --instance-type r4.4xlarge --instance-count 1 --steps ‘[{ "Name": "Snowplow Spark Enrich", "Args": [...], "Jar": "s3://bucket/my-jar.jar", "ActionOnFailure": "CONTINUE", "MainClass": "EnrichJob", "Type": "CUSTOM_JAR", "Properties": "string" }]’ Spark config cheat sheet
  • 11. Thanks! GitHub: - github.com/BenFradet - github.com/snowplow/snowplow Twitter: - @fradetben - @snowplowdata Contact us: sales@snowplowanalytics.com snowplowanalytics.com