SlideShare a Scribd company logo
A Fast Intro to Spark
And a glance at BEAM
Lightning fast cluster computing*
Who am I?
Who am I?
● So this is kind of a long shot, but American TV gets everywhere so….
● I’m not a doctor but I did stay at an IHG property last night
● Which is like a fancy version of Holiday Inn Express
○ I’m honestly not sure if this makes me more or less qualified
● And I did get my IHG points restored
● Ok but for real
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
A super fast introduction to Spark and glance at BEAM
Who do I think you all are?
● Nice people*
● Getting started with Spark or BEAM
○ Or wondering if you need it
● Familiar-ish with Scala or Java or Python
Amanda
What we are going to explore together!
● What is Spark?
● Getting Spark setup locally
● Spark primary distributed collection
● Word count in Spark
● Spark SQL / DataFrames
Then a glance at BEAM
● What is BEAM & what’s its current state
● Streaming wordcount because of course
Some things that may color my views:
● I’m on the Spark PMC -- Spark’s success => I can probably make more $s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Companion (optional!) notebook funtimes:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs
http://bit.ly/PySparkIntroExamples (has a notebook!)
● Did you know? You can run Spark on Dataproc there by
giving my employer money. You can also run it
elsewhere.
http://bit.ly/learningSparkExamples (lots of code files)
http://bit.ly/hkMLExample (has a notebook, ML focused)
David DeHetre
SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
This can feel like magic* sometimes :)
Steven Saus
*I mean not good magic.
Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
_torne
Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
f ford Pinto by Morven
f ford Pinto by Morven
ayphen
Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder
Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
Word count w/Datasets
val df = spark.read.load(src).select("text")
val ds = df.as[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight
So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
● Allows the optimizer to see what aggregates are being
performed
Sherrie Thai
Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
Our ever growing ecosystem:
http://mattturck.com/bigdata2017/
General purpose eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Photo by D Coetzee
Even then, lots of general purpose tools:
Resulting in:
flink/
h2o/
hdfs/
integration/
mr/
spark/
viennacl/
And language silos (Scala, Python, Go, etc.!)
Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
And cloud silos….
Photo By: Zechariah Judy
And the proliferation of pagers :(
Photo by: Hades2k
Mike Knell
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
compiler**
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
BEAM Languages
● JVM: Scala, Java, etc.
● non-JVM: Python w/Go and more coming
BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
What do the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
Spark wordcount (Python*) - “pure” relational
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
c.output(word);
}}}))
.apply(Count.<String>perElement())
What about windowed word count?
Trish Hamme
Christer van der Meeren
What else might happen?
● One execution engine becomes super amazing at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations.
● People realize there big data problem is actually three small data problems in
a trench coat
bnilsen
And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? Tweet me @holdenkarau
○ LinuxConf AU - next week
○ Sydney Spark meetup 23rd
○ Data Day Texas - Nate will be there too!
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkComing Soon:
Learning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me? Promise not to be
creepy? Ok:
holden@pigscanfly.ca

More Related Content

PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PDF
Introduction to and Extending Spark ML
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
PDF
Extending spark ML for custom models now with python!
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
PDF
Spark ML for custom models - FOSDEM HPC 2017
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Introduction to and Extending Spark ML
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Extending spark ML for custom models now with python!
Streaming & Scaling Spark - London Spark Meetup 2016
Spark ML for custom models - FOSDEM HPC 2017
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016

What's hot (20)

PDF
Testing and validating distributed systems with Apache Spark and Apache Beam ...
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
PDF
Getting the best performance with PySpark - Spark Summit West 2016
PPTX
Beyond shuffling - Strata London 2016
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Debugging PySpark - PyCon US 2018
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Getting started contributing to Apache Spark
PDF
Beyond shuffling - Scala Days Berlin 2016
PDF
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Improving PySpark performance: Spark Performance Beyond the JVM
Introduction to Spark ML Pipelines Workshop
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Getting the best performance with PySpark - Spark Summit West 2016
Beyond shuffling - Strata London 2016
Debugging PySpark: Spark Summit East talk by Holden Karau
A fast introduction to PySpark with a quick look at Arrow based UDFs
Introduction to Spark Datasets - Functional and relational together at last
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Apache Spark Super Happy Funtimes - CHUG 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Debugging PySpark - PyCon US 2018
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Big Data Beyond the JVM - Strata San Jose 2018
Getting started contributing to Apache Spark
Beyond shuffling - Scala Days Berlin 2016
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Ad

Similar to A super fast introduction to Spark and glance at BEAM (20)

PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PPTX
Spark real world use cases and optimizations
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PDF
Meetup ml spark_ppt
PDF
Introduction to Spark
PDF
Apache Spark Overview
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Apache Spark Tutorial
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
PPTX
OVERVIEW ON SPARK.pptx
PDF
Getting The Best Performance With PySpark
PDF
Apache Spark: What? Why? When?
PDF
Introduction to Apache Spark
PDF
Apache Spark Introduction - CloudxLab
PPTX
Beyond shuffling global big data tech conference 2015 sj
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Introduction to Apache Spark
PPTX
Dive into spark2
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Alpine academy apache spark series #1 introduction to cluster computing wit...
Spark real world use cases and optimizations
Apache spark sneha challa- google pittsburgh-aug 25th
Meetup ml spark_ppt
Introduction to Spark
Apache Spark Overview
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Tutorial
The magic of (data parallel) distributed systems and where it all breaks - Re...
OVERVIEW ON SPARK.pptx
Getting The Best Performance With PySpark
Apache Spark: What? Why? When?
Introduction to Apache Spark
Apache Spark Introduction - CloudxLab
Beyond shuffling global big data tech conference 2015 sj
AI與大數據數據處理 Spark實戰(20171216)
Introduction to Apache Spark
Dive into spark2
Ad

Recently uploaded (20)

PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Funds Management Learning Material for Beg
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PPTX
SAP Ariba Sourcing PPT for learning material
PPTX
artificial intelligence overview of it and more
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Testing WebRTC applications at scale.pdf
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Internet___Basics___Styled_ presentation
DOCX
Unit-3 cyber security network security of internet system
Cloud-Scale Log Monitoring _ Datadog.pdf
Tenda Login Guide: Access Your Router in 5 Easy Steps
INTERNET------BASICS-------UPDATED PPT PRESENTATION
An introduction to the IFRS (ISSB) Stndards.pdf
The Internet -By the Numbers, Sri Lanka Edition
Funds Management Learning Material for Beg
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Triggering QUIC, presented by Geoff Huston at IETF 123
SAP Ariba Sourcing PPT for learning material
artificial intelligence overview of it and more
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Testing WebRTC applications at scale.pdf
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Introuction about ICD -10 and ICD-11 PPT.pptx
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Slides PDF The World Game (s) Eco Economic Epochs.pdf
SASE Traffic Flow - ZTNA Connector-1.pdf
Internet___Basics___Styled_ presentation
Unit-3 cyber security network security of internet system

A super fast introduction to Spark and glance at BEAM

  • 1. A Fast Intro to Spark And a glance at BEAM Lightning fast cluster computing*
  • 3. Who am I? ● So this is kind of a long shot, but American TV gets everywhere so…. ● I’m not a doctor but I did stay at an IHG property last night ● Which is like a fancy version of Holiday Inn Express ○ I’m honestly not sure if this makes me more or less qualified ● And I did get my IHG points restored ● Ok but for real
  • 4. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 6. Who do I think you all are? ● Nice people* ● Getting started with Spark or BEAM ○ Or wondering if you need it ● Familiar-ish with Scala or Java or Python Amanda
  • 7. What we are going to explore together! ● What is Spark? ● Getting Spark setup locally ● Spark primary distributed collection ● Word count in Spark ● Spark SQL / DataFrames Then a glance at BEAM ● What is BEAM & what’s its current state ● Streaming wordcount because of course
  • 8. Some things that may color my views: ● I’m on the Spark PMC -- Spark’s success => I can probably make more $s ● My employer cares about BEAM (and Spark and other things) ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 9. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 10. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 11. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 12. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 13. Companion (optional!) notebook funtimes: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs http://bit.ly/PySparkIntroExamples (has a notebook!) ● Did you know? You can run Spark on Dataproc there by giving my employer money. You can also run it elsewhere. http://bit.ly/learningSparkExamples (lots of code files) http://bit.ly/hkMLExample (has a notebook, ML focused) David DeHetre
  • 14. SparkContext: entry to the world ● Can be used to create RDDs from many input sources ○ Native collections, local & remote FS ○ Any Hadoop Data Source ● Also create counters & accumulators ● Automatically created in the shells (called sc) ● Specify master & app name when creating ○ Master can be local[*], spark:// , yarn, etc. ○ app name should be human readable and make sense ● etc. Petfu l
  • 15. RDDs: Spark’s Primary abstraction RDD (Resilient Distributed Dataset) ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 16. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 17. Why laziness is cool (and not) ● Pipelining (can put maps, filter, flatMap together) ● Can do interesting optimizations by delaying work ● We use the DAG to recompute on failure ○ (writing data out to 3 disks on different machines is so last season) ○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an R :( How it hurts: ● Debugging is confusing ● Re-using data - lazyness only sees up to the first action ● Some people really hate immutability Matthew Hurst
  • 18. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 19. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 20. Some common transformations & actions Transformations (lazy) ● map ● filter ● flatMap ● reduceByKey ● join ● cogroup Actions (eager) ● count ● reduce ● collect ● take ● saveAsTextFile ● saveAsHadoop ● countByValue Photo by Steve Photo by Dan G
  • 21. This can feel like magic* sometimes :) Steven Saus *I mean not good magic.
  • 22. Magic has it’s limits: key-skew + black boxes ● There is a worse way to do WordCount ● We can use the seemingly safe thing called groupByKey ● Then compute the sum... _torne
  • 23. Bad word count RDD :( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 24. f ford Pinto by Morven
  • 25. f ford Pinto by Morven ayphen
  • 26. Why should we consider Datasets? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates Rikki's Refuge
  • 27. Why are Datasets so awesome? ● Easier to mix functional style and relational style ○ No more hive UDFs! ● Nice performance of Spark SQL flexibility of RDDs ○ Tungsten (better serialization) ○ Equivalent of Sortable trait ● Strongly typed ● The future (ML, Graph, etc.) ● Potential for better language interop ○ Something like Arrow has a much better chance with Datasets ○ Cross-platform libraries are easier to make & use Will Folsom
  • 28. What is the performance like? Andrew Skudder
  • 29. How is it so fast? ● Optimizer has more information (schema & operations) ● More efficient storage formats ● Faster serialization ● Some operations directly on serialized data formats ● non-JVM languages: does more computation in the JVM Andrew Skudder
  • 30. Word count w/Dataframes df = spark.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 31. Word count w/Datasets val df = spark.read.load(src).select("text") val ds = df.as[String] # Returns an Dataset! val words = ds.flatMap(x => x.split(" ")) val grouped = words.groupBy("value") val word_count = grouped.agg(count("*") as "count") word_count.write.format("parquet").save("wc") Can’t push down filters from here If it’s a simple type we don’t have to define a case class
  • 32. What can the optimizer do now? ● Sort on the serialized data ● Understand the aggregate (“partial aggregates”) ○ Could sort of do this before but not as awesomely, and only if we used reduceByKey - not groupByKey ● Pack them bits nice and tight
  • 33. So whats this new groupBy? ● No longer causes explosions like RDD groupBy ○ Able to introspect and pipeline the aggregation ● Returns a GroupedData (or GroupedDataset) ● Makes it easy to perform multiple aggregations ● Built in shortcuts for aggregates like avg, min, max ● Longer list at http://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.sql.functions$ ● Allows the optimizer to see what aggregates are being performed Sherrie Thai
  • 34. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 35. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain")) PhotoAtelier
  • 36. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 37. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions (pre-2.0) Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 38. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} } Chris Isherwood
  • 39. But where DataFrames explode? ● Iterative algorithms - large plans ○ Use your escape hatch to RDDs! ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 40. Our ever growing ecosystem: http://mattturck.com/bigdata2017/
  • 41. General purpose eating the world ● Operations overhead ● Moving data from System 1 to System 2 (sqoop and friends) ● We still have specialized tools, but being built on top of general frameworks ○ e.g. see mahout on Spark ○ Less closely tied things like Hive/Pig on Spark ○ TF.Transform etc. Photo by D Coetzee
  • 42. Even then, lots of general purpose tools:
  • 44. And language silos (Scala, Python, Go, etc.!) Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
  • 45. And cloud silos…. Photo By: Zechariah Judy
  • 46. And the proliferation of pagers :( Photo by: Hades2k Mike Knell
  • 47. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask -- but hard to talk to existing ecosystem David Brown
  • 48. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 49. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 50. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 51. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 52. BEAM backends: ● BEAM nominally supports* ○ Dataflow ○ Flink* ○ Spark* ○ IBM Streams, etc. ● Goal of more than just lowest-common-demoninator, think of it like a compiler** *Supports as in early-stage, but we’re working on it (and we’d love your help!) **But you know, in the same sense I compare Spark Streams to pandas coming a wooden slide.
  • 53. BEAM Languages ● JVM: Scala, Java, etc. ● non-JVM: Python w/Go and more coming
  • 54. BEAM Beyond the JVM ● This part doesn’t work outside of Google’s hosted environment yet, so I’m going to be light on the details ● tl;dr : uses grpc / protobuf ● But exciting new plans (w/ some early code) to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/
  • 55. What do the different APIs look like? ● Everyone's favourite: Streaming Word Count Example ● And then windowed wordcount! ● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
  • 56. Spark wordcount (Python*) - “pure” relational # Split the lines into words words = lines.select( # explode turns each item in an array into a separate row explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count()
  • 57. BEAM wordcount (Java) p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().split("W+")) { c.output(word); }}})) .apply(Count.<String>perElement())
  • 58. What about windowed word count? Trish Hamme Christer van der Meeren
  • 59. What else might happen? ● One execution engine becomes super amazing at everything ● Instead of a compiler like unifier we see something like streaming SQL become our unifier ○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations. ● People realize there big data problem is actually three small data problems in a trench coat bnilsen
  • 60. And some upcoming talks: ● Jan ○ If interest tomorrow: Office Hours? Tweet me @holdenkarau ○ LinuxConf AU - next week ○ Sydney Spark meetup 23rd ○ Data Day Texas - Nate will be there too! ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM
  • 61. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkComing Soon: Learning PySpark
  • 62. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 63. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca