SlideShare a Scribd company logo
Apache Spark as a gateway
drug to functional programming
Concepts taught & broken
@holdenkarau
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, Beam contributor
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer
Why Google Cloud cares about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-
releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● Like functional programming
● Want to keep growing the functional programming community
Lori Erickson
What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● Wordcount example (as required by license)
● What concepts Spark does a good of teaching folks
● What concepts Spark does a so-so job of teaching folks
● Some examples
● Format is going to be happy-sad (repeating) so we can end on a downer
● But with lots of cat pictures I promise
What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
When we say distributed we mean...
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my
cluster anymore, let
alone my MacBook Pro
:( Maybe this Spark
business will solve
that...
brownpau
Plus a little magic :)
Steven Saus
What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly through lack of documentation :p
Stuart
What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
● Heavy mutation focused API for Machine Learning
● Internals are…. very not functional programming best practices
indamage
Before the technical details: positioning
● “100x faster than hadoop” sounds nice
● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page
● “Integrated solution” - sounds nice if you have to write your data to disk
between querying and training
● “Works with your existing JVM or Python code base.”
● Part of the Apache Software Foundation
○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled
● Like all positioning each of these have some implied “*”s associated with
them
● Focused on the benefits made possible with FP rather than the fact it was FP
● Lead to large commercial install base of folks being exposed to functional
programming
Hello World (Word count) - 1 of ?
val lines = sc.textFile("boop")
val words = lines.flatMap(line => line.split(" "))
val wordPairs = words.map(word => (word, 1))
val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2)
wordCounts.saveAsTextFile("snoop")
Photo By: Will
Keightley
Spark & Lazy Evaluation: True BFFs
● Allows Spark to build a graph of operations and use it for recomputing on
failure
● Less passes over the data without thinking*
● Except…. Debates around limiting it due to developer confusion with
debugging
● And library level rather than macro or compiler level means needing to
kcxd
Spark & Immutability: Really good friends
● Many systems enforce some level of immutability but...
● Scala allows mutability (vars & friends), but Spark get’s not so happy
● Execution model is recompute on failure, and mutation would break this
● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks
we just recompute on failure leading to 100x improvement”*)
Bartlomiej Mostek
Even if you try to use, no mutation for you*
scala> var x = 0
x: Int = 0
scala> val rdd = sc.parallelize(1.to(100))
scala> val added = rdd.map{e => x += e}
scala> added.count()
res0: Long = 100
scala> x
res1: Int = 0
PROJennifer C.
Well…. only local mutation
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res1: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120,
136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 26, 53, 81, 110, 140, 171, 203,
236, 270, 305, 341, 378, 416, 455, 495, 536, 578, 621, 665, 710, 756, 803, 851,
900, 950, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 616, 678, 741, 805,
870, 936, 1003, 1071, 1140, 1210, 1281, 1353, 1426, 1500, 1575, 76, 153, 231,
310, 390, 471, 553, 636, 720, 805, 891, 978, 1066, 1155, 1245, 1336, 1428, 1521,
1615, 1710, 1806, 1903, 2001, 2100, 2200)
Raita Futo
Well…. only very local mutation
scala> val rdd = sc.parallelize(1.to(100), 10)
scala> val result = rdd.map{e => x += e; x}
scala> result.collect()
res2: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 11, 23, 36, 50, 65, 81,
98, 116, 135, 155, 21, 43, 66, 90, 115, 141, 168, 196, 225, 255, 31, 63, 96, 130,
165, 201, 238, 276, 315, 355, 41, 83, 126, 170, 215, 261, 308, 356, 405, 455, 51,
103, 156, 210, 265, 321, 378, 436, 495, 555, 61, 123, 186, 250, 315, 381, 448,
516, 585, 655, 71, 143, 216, 290, 365, 441, 518, 596, 675, 755, 81, 163, 246, 330,
415, 501, 588, 676, 765, 855, 91, 183, 276, 370, 465, 561, 658, 756, 855, 955)
PROSusanne Nilsson
Spark & Immutability: A few small challenges
● Accumulators:
○ Accumulators “leak” on recompute and partial compute (another Spark trick)
● You _can_ mutate state inside of the worker or master it just won’t propagate
○ No error messages here, just unexpected future “happiness”
Caroline
Spark & Lambdas: Everyone gets a lambda!
● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p )
● Writing in Java 7?
○ You get a weird looking lambda!
Mark Jensen
Spark & Lambdas - closures & the dark side
● Python & Scala lambda serialization is (understandably) cautious
● Referencing a variable in the class brings the whole class with you
● Can create the impression closures are limited to serializable data
● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle -
“Welllll…. What about if we stored this differently?”
● pre-Java 8 (new FunctionX<A, B, C>... ugh)
● SQL API custom aggregates don’t currently support lambdas :(
David Goehring
Oh right: serialization :(
● Many things in JVM & Python are not well designed to be rehydrated on
another VM let alone another machine
● Space efficiency: well…. At least its not XML?
● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh!
● Oh hey let’s make it configurable… oh wait...
Reaching non-“traditional” FP languages
● Spark works in many languages - Python, R, Java, etc.
○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another
language
○ e.g. Numerical Scala libraries are… rough
● Meeting developers where they are
● Some overhead (often) pushes systemy folks towards learning Scala for
performance
● Sometimes we could do a better job of working with the existing FP tools
○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing
.flatMap on Python itrs
Reaching non-“traditional” FP languages
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
counts = lines.flatMap(lambda x: x.split(' ')) 
.map(lambda x: (x, 1)) 
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
photobom
Reaching non-“traditional” FP languages Petful
ML: Enough getters and setters for the 90s!
● Took inspiration from ski-kit-learn
○ Which is a cool system, just not super functionally oriented
● Added hidden metadata, which got dropped in lots of places (not quite as bad
as global state buuuut….)
● Threw away compile time type information
● I really don’t have a + for this one in the FP column teaching
Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)
The internals… are fairly imperative
● Not everywhere and not without reason (at the time they were created)
● No macros for historic performance reasons
○ Code generation is done by concating strings of Java code
● This would matter less, but Spark is pretty aggressive at trying to hide things,
enough that some major projects end up having to peek a little bit at the code
● Can feel as a “do what I say not what I do” sometimes
○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs
ivva
And folks need to access them :(
ivva
Let’s end on a happy-ish note
● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data”
developers & data scientists functional programming without calling it that (at
the start)
● Once we get them hooked we can show them cool things!
● If we can find other areas where we make tools that expose FP apis to more
non-FP folks (and not just for the sake of FP) we can make better software
● Or we can all go back to writing 90s style enterprise Java
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today! A great second Spark book to read (but
please buy it first)
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
And some upcoming talks:
● July
○ OSCON Portland & meetup
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
● October
○ Spark Summit London
○ Reversim Tel Aviv
k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
(holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback

More Related Content

PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
PPTX
Simplifying training deep and serving learning models with big data in python...
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Simplifying training deep and serving learning models with big data in python...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Accelerating Big Data beyond the JVM - Fosdem 2018
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Making the big data ecosystem work together with python apache arrow, spark,...
Big Data Beyond the JVM - Strata San Jose 2018

What's hot (20)

PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
PDF
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
PDF
Debugging PySpark - PyCon US 2018
PDF
Big data beyond the JVM - DDTX 2018
PDF
Using Spark ML on Spark Errors - What do the clusters tell us?
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
PDF
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
PDF
Functional Programming for Busy Object Oriented Programmers
PDF
A peek into Python's Metaclass and Bytecode from a Smalltalk User
PPTX
Go fundamentals
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
PPTX
Puppetizing Your Organization
PDF
Guglielmo iozzia - Google I/O extended dublin 2018
PDF
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
PDF
Designing and coding for cloud-native applications using Python, Harjinder Mi...
PDF
Julia language: inside the corporation
PPTX
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
PPTX
Iron Sprog Tech Talk
PDF
Golang 101
Powering tensor flow with big data using apache beam, flink, and spark cern...
Intro - End to end ML with Kubeflow @ SignalConf 2018
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Debugging PySpark - PyCon US 2018
Big data beyond the JVM - DDTX 2018
Using Spark ML on Spark Errors - What do the clusters tell us?
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Functional Programming for Busy Object Oriented Programmers
A peek into Python's Metaclass and Bytecode from a Smalltalk User
Go fundamentals
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Puppetizing Your Organization
Guglielmo iozzia - Google I/O extended dublin 2018
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Designing and coding for cloud-native applications using Python, Harjinder Mi...
Julia language: inside the corporation
From Python to smartphones: neural nets @ Saint-Gobain, François Sausset
Iron Sprog Tech Talk
Golang 101
Ad

Similar to Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018 (20)

PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
A super fast introduction to Spark and glance at BEAM
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
PDF
Are general purpose big data systems eating the world?
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Testing and validating spark programs - Strata SJ 2016
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Debugging PySpark - Spark Summit East 2017
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
PDF
Debugging Apache Spark - Scala & Python super happy fun times 2017
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
PDF
Apache Spark: The Analytics Operating System
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
A fast introduction to PySpark with a quick look at Arrow based UDFs
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
A super fast introduction to Spark and glance at BEAM
An introduction into Spark ML plus how to go beyond when you get stuck
Are general purpose big data systems eating the world?
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Improving PySpark performance: Spark Performance Beyond the JVM
Introduction to Spark ML Pipelines Workshop
Apache Spark Super Happy Funtimes - CHUG 2016
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Introduction to Spark Datasets - Functional and relational together at last
Testing and validating spark programs - Strata SJ 2016
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark - Spark Summit East 2017
The magic of (data parallel) distributed systems and where it all breaks - Re...
Debugging Apache Spark - Scala & Python super happy fun times 2017
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Apache Spark: The Analytics Operating System
Ad

Recently uploaded (20)

PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
Digital Literacy And Online Safety on internet
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
Introduction to Information and Communication Technology
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
Internet___Basics___Styled_ presentation
RPKI Status Update, presented by Makito Lay at IDNOG 10
introduction about ICD -10 & ICD-11 ppt.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PptxGenJS_Demo_Chart_20250317130215833.pptx
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
presentation_pfe-universite-molay-seltan.pptx
Design_with_Watersergyerge45hrbgre4top (1).ppt
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
Digital Literacy And Online Safety on internet
Module 1 - Cyber Law and Ethics 101.pptx
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Job_Card_System_Styled_lorem_ipsum_.pptx
The Internet -By the Numbers, Sri Lanka Edition
522797556-Unit-2-Temperature-measurement-1-1.pptx
Introduction to Information and Communication Technology
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Sims 4 Historia para lo sims 4 para jugar
Slides PDF The World Game (s) Eco Economic Epochs.pdf
Internet___Basics___Styled_ presentation

Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018

  • 1. Apache Spark as a gateway drug to functional programming Concepts taught & broken @holdenkarau
  • 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos
  • 4. Who is Boo? ● Boo uses she/her pronouns (as I told the Texas house committee) ● Best doge ● Lot’s of experience barking at computers to make them go faster ● Author of “Learning to Bark” & “High Performance Barking” ○ Currently out of print, discussing a reprint re-run with my wife ● On twitter @BooProgrammer
  • 5. Why Google Cloud cares about Spark? ● Lots of data! ○ We mostly use different, although similar FP inspired, tools internally ● We have two hosted solutions for using Spark (dataproc & GKE) ○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test the current RC - https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark- releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
  • 6. Who do I think y’all are? ● Friendly[ish] people ● Don’t mind pictures of cats or stuffed animals ● Like functional programming ● Want to keep growing the functional programming community Lori Erickson
  • 7. What will be covered? ● What is Spark (super brief) & how it’s helped drive FP to enterprise ● Wordcount example (as required by license) ● What concepts Spark does a good of teaching folks ● What concepts Spark does a so-so job of teaching folks ● Some examples ● Format is going to be happy-sad (repeating) so we can end on a downer ● But with lots of cat pictures I promise
  • 8. What is Spark? ● General purpose distributed system ○ Built in Scala with an FP inspired API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 9. When we say distributed we mean...
  • 10. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 11. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 12. Plus a little magic :) Steven Saus
  • 13. What is the “magic” of Spark? ● Automatically distributed functional programming :) ● DAG / “query plan” is the root of much of it ● Optimizer to combine steps ● Resiliency: recover from failures rather than protecting from failures. ● “In-memory” + “spill-to-disk” ● Functional programming to build the DAG for “free” ● Select operations without deserialization ● The best way to trick people into learning functional programming Richard Gillin
  • 14. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 15. What Spark got right (for Scala/FP): ● Strong enforced[ish] requirement for immutable data ○ Use recompute for failure so a core part of the logic ● Functional operators (map, filter, flatMap, etc.) ● Lambdas for everyone! ○ Sometime too many…. ● Solved a “business need” ○ Even if that need was imaginary ● Made it hard to have side effects against external variables without being very explicit & verbose ○ Even then discouraged strongly through lack of documentation :p Stuart
  • 16. What Spark got … less right (for Scala/FP): ● Serialization… complications ○ Makes people think closures are more limited than they can be ● Lots of Map[String, String] (equivalent) settings ○ Hey buddy can you spare a type checker? ● Hard to debug, could be confused with Scala hard to debug ○ Not completely unjustified sometimes ● New ML & SQL APIs without “any” types (initially) ● Heavy mutation focused API for Machine Learning ● Internals are…. very not functional programming best practices indamage
  • 17. Before the technical details: positioning ● “100x faster than hadoop” sounds nice ● wc(wc) < 17 also sounds nice if you’re current wordcount doesn’t fit on a page ● “Integrated solution” - sounds nice if you have to write your data to disk between querying and training ● “Works with your existing JVM or Python code base.” ● Part of the Apache Software Foundation ○ e.g. you already run software from these folks, don’t worry about lock-in they’ve got it handled ● Like all positioning each of these have some implied “*”s associated with them ● Focused on the benefits made possible with FP rather than the fact it was FP ● Lead to large commercial install base of folks being exposed to functional programming
  • 18. Hello World (Word count) - 1 of ? val lines = sc.textFile("boop") val words = lines.flatMap(line => line.split(" ")) val wordPairs = words.map(word => (word, 1)) val wordCounts = wordPairs.reduceByKey(c1, c2: c1 + c2) wordCounts.saveAsTextFile("snoop") Photo By: Will Keightley
  • 19. Spark & Lazy Evaluation: True BFFs ● Allows Spark to build a graph of operations and use it for recomputing on failure ● Less passes over the data without thinking* ● Except…. Debates around limiting it due to developer confusion with debugging ● And library level rather than macro or compiler level means needing to kcxd
  • 20. Spark & Immutability: Really good friends ● Many systems enforce some level of immutability but... ● Scala allows mutability (vars & friends), but Spark get’s not so happy ● Execution model is recompute on failure, and mutation would break this ● Provides a clear benefit (e.g. “Instead of writing data out to 3 network disks we just recompute on failure leading to 100x improvement”*) Bartlomiej Mostek
  • 21. Even if you try to use, no mutation for you* scala> var x = 0 x: Int = 0 scala> val rdd = sc.parallelize(1.to(100)) scala> val added = rdd.map{e => x += e} scala> added.count() res0: Long = 100 scala> x res1: Int = 0 PROJennifer C.
  • 22. Well…. only local mutation scala> val result = rdd.map{e => x += e; x} scala> result.collect() res1: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136, 153, 171, 190, 210, 231, 253, 276, 300, 325, 26, 53, 81, 110, 140, 171, 203, 236, 270, 305, 341, 378, 416, 455, 495, 536, 578, 621, 665, 710, 756, 803, 851, 900, 950, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 616, 678, 741, 805, 870, 936, 1003, 1071, 1140, 1210, 1281, 1353, 1426, 1500, 1575, 76, 153, 231, 310, 390, 471, 553, 636, 720, 805, 891, 978, 1066, 1155, 1245, 1336, 1428, 1521, 1615, 1710, 1806, 1903, 2001, 2100, 2200) Raita Futo
  • 23. Well…. only very local mutation scala> val rdd = sc.parallelize(1.to(100), 10) scala> val result = rdd.map{e => x += e; x} scala> result.collect() res2: Array[Int] = Array(1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 11, 23, 36, 50, 65, 81, 98, 116, 135, 155, 21, 43, 66, 90, 115, 141, 168, 196, 225, 255, 31, 63, 96, 130, 165, 201, 238, 276, 315, 355, 41, 83, 126, 170, 215, 261, 308, 356, 405, 455, 51, 103, 156, 210, 265, 321, 378, 436, 495, 555, 61, 123, 186, 250, 315, 381, 448, 516, 585, 655, 71, 143, 216, 290, 365, 441, 518, 596, 675, 755, 81, 163, 246, 330, 415, 501, 588, 676, 765, 855, 91, 183, 276, 370, 465, 561, 658, 756, 855, 955) PROSusanne Nilsson
  • 24. Spark & Immutability: A few small challenges ● Accumulators: ○ Accumulators “leak” on recompute and partial compute (another Spark trick) ● You _can_ mutate state inside of the worker or master it just won’t propagate ○ No error messages here, just unexpected future “happiness” Caroline
  • 25. Spark & Lambdas: Everyone gets a lambda! ● Maps, flatMaps, filters, reducers oh my (and that’s just wordcount :p ) ● Writing in Java 7? ○ You get a weird looking lambda! Mark Jensen
  • 26. Spark & Lambdas - closures & the dark side ● Python & Scala lambda serialization is (understandably) cautious ● Referencing a variable in the class brings the whole class with you ● Can create the impression closures are limited to serializable data ● ClosureCleaner.scala - “You weren’t using that class right?” & CloudPickle - “Welllll…. What about if we stored this differently?” ● pre-Java 8 (new FunctionX<A, B, C>... ugh) ● SQL API custom aggregates don’t currently support lambdas :( David Goehring
  • 27. Oh right: serialization :( ● Many things in JVM & Python are not well designed to be rehydrated on another VM let alone another machine ● Space efficiency: well…. At least its not XML? ● Oh wait, we have to parse Python serialized data in the JVM? Ruh roh! ● Oh hey let’s make it configurable… oh wait...
  • 28. Reaching non-“traditional” FP languages ● Spark works in many languages - Python, R, Java, etc. ○ Allows Spark to have a wider audience that doesn’t have the time (or tooling) to learn another language ○ e.g. Numerical Scala libraries are… rough ● Meeting developers where they are ● Some overhead (often) pushes systemy folks towards learning Scala for performance ● Sometimes we could do a better job of working with the existing FP tools ○ e.g. we could do more to make an RDD look like “normal” Python and maybe I’d stop typing .flatMap on Python itrs
  • 29. Reaching non-“traditional” FP languages lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) counts = lines.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) output = counts.collect() for (word, count) in output: print("%s: %i" % (word, count)) photobom
  • 31. ML: Enough getters and setters for the 90s! ● Took inspiration from ski-kit-learn ○ Which is a cool system, just not super functionally oriented ● Added hidden metadata, which got dropped in lots of places (not quite as bad as global state buuuut….) ● Threw away compile time type information ● I really don’t have a + for this one in the FP column teaching
  • 32. Basic Dataprep pipeline for “ML” // Combines a list of double input features into a vector val assembler = new VectorAssembler().setInputCols(Array("age", "education-num")).setOutputCol("features") // String indexer converts a set of strings into doubles val indexer = StringIndexer().setInputCol("category") .setOutputCol("category-index") // Can be used to combine pipeline components together val pipeline = Pipeline().setStages(Array(assembler, indexer)) Huang Yun Chung
  • 33. Adding some ML (no longer cool -- DL) // Specify model val dt = new DecisionTreeClassifier() .setLabelCol("category-index") .setFeaturesCol("features") // Add it to the pipeline val pipeline_and_model = Pipeline().setStages( List(assembler, indexer, dt)) val pipeline_model = pipeline_and_model.fit(df)
  • 34. The internals… are fairly imperative ● Not everywhere and not without reason (at the time they were created) ● No macros for historic performance reasons ○ Code generation is done by concating strings of Java code ● This would matter less, but Spark is pretty aggressive at trying to hide things, enough that some major projects end up having to peek a little bit at the code ● Can feel as a “do what I say not what I do” sometimes ○ Especially in ML lot’s of hacks to make things go a little bit faster using internal APIs ivva
  • 35. And folks need to access them :( ivva
  • 36. Let’s end on a happy-ish note ● Yes Spark isn’t perfect, but it’s APIs are teaching a generation of “big data” developers & data scientists functional programming without calling it that (at the start) ● Once we get them hooked we can show them cool things! ● If we can find other areas where we make tools that expose FP apis to more non-FP folks (and not just for the sake of FP) we can make better software ● Or we can all go back to writing 90s style enterprise Java
  • 37. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 38. High Performance Spark! Available today! A great second Spark book to read (but please buy it first) You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 39. And some upcoming talks: ● July ○ OSCON Portland & meetup ● August ○ JupyterCon NYC ● September ○ Strata NYC ○ Strangeloop STL ● October ○ Spark Summit London ○ Reversim Tel Aviv
  • 40. k thnx bye :) If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Pssst: Have feedback on the presentation? Give me a shout (holden@pigscanfly.ca or http://bit.ly/holdenTalkFeedback ) if you feel comfortable doing so :) Feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

Editor's Notes

  • #9: introduce spark rdds, purple blog diagrams
  • #15: https://www.flickr.com/photos/jon_a_ross/2679856182/in/photolist-55NXSW-4UZZHe-e1Ubar-8oA19X-4V2hrU-4UX6dT-4HpqVm-58CV9k-ardHmQ-72uLB3-6p6gqL-58gez2-hjhDoA-4MqZrU-8ZMidf-4NFd8N-4NFcMQ-9R6Dr6-55JQDr-rxeWPU-oDVKTS-arbcbX-arbbTp-aVNBqi-47TCvC-4NFctq-b4BE3p-7WcAGh-9w8FFR-6HYNpP-662zun-5LX51n-5BWeR2-oZc3Xk-ewax6c-7Z3vKE-e5W5AJ-bi3HtM-bEBTUZ-s1c3gw-qMbK5K-6heJzF-g6YbwT-aoRa8z-kNDkqL-YRwm-4BESNo-iRhKvk-ib7bUU-nmuxdF
  • #19: We can examine how RDD’s work in practice with the traditonal word count example. If you’ve taken another intro to big data class, or just worked with mapreduce you’ll notice that this is a lot less code than we normally have to do. https://www.flickr.com/photos/feverblue/1166368091/in/photolist-2M4WAF-HKi9Wb-bPUuHF-bAZRBu-bAZRJs-cPgNi9-cPgM5y-ecdXE4-qiy3fi-ece5nR-8DSTJT-ekXn2J-ekXkQA-ekRMVx-p5EdCT-424qe7-41ZhFv-cHJ84A-ekRLzc-cS67Hu-cS6sKq-cEcmt5-9ae42r-eoqvBm-HH9A1M-846gJQ-dk6J-nqxTYN-J1PZUH-5q2iLC-HVDkA7-g6xvD3-96MiqV-hbZene-46uZse-6SUYhp-dGaXzD-oUC1ic-DayAdv-aoPJSZ-73L8Lm-5qVCiG-5qVC5q-3Kyyzj-Bf5UzP-4CbKo9-9ae8KM-5CXLKZ-pjFpPT-eccLSR
  • #21: The 3rd cat hiding in the background is serialization
  • #22: The 3rd cat hiding in the background is serialization
  • #23: The 3rd cat hiding in the background is serialization
  • #24: The 3rd cat hiding in the background is serialization
  • #29: Red panda image is public domain https://www.flickr.com/photos/mathiasappel/25901445745/in/photolist-FsPFqM-8V1c5-EfF7ct-P6AMfB-oUV39B-EenZCo-oVbE3K-DDMMtE-295N7-dtVySZ-dtVzLP-7yXK47-6DEn2P-oY28m2-DnHaur-qtt3m9-DzTW6Q-E6JQmA-7yNdac-8gAHMU-8469Y8-pCUWfm-qLtvZk-E1BDYY-A5UiNX-bEw84A-yXwV8w-dAyief-z7erYS-BgXdnN-E2yyjM-fhXmyG-Dvyp78-qW9LEZ-qDHaZv-B9BzuN-nL11Wn-C4spLb-8wnAdN-6S1ECS-ntw1gj-7zbXAA-sanjgW-Ci72XP-oNYs5j-rBErd-awBvQG-4YDCuv-G82zkD-zwcfSu
  • #34: https://www.flickr.com/photos/wapiko57/6514540899/in/photolist-82QaA6-aVfJNM-oX8Dp7-aVEJ3F-qTG9ni-97uBZ7-97SVrH-qWFs4R-cgE8rJ-a9mSXv-qm83Bv-cUPhgC-988EVA-kUgwo-4sqj48-8e6MB6-apVrgH-3KAUyx-5F373J-qyD7E9-j17GZ-eakbAD-VrPk79-4GSqUt-9Kwe3v/