SlideShare a Scribd company logo
Accelerating
Big Data Beyond the JVM
Engine noises go here
@holdenkarau & @warre_n_peace
Rachel
- Rachel Warren โ†’ She/ Her
- Data Scientist / Software engineer at Salesforce Einstein
- Formerly at Alpine Data (with Holden)
- Lots of experience scaling Spark in different production environments
- The other half of the High Performance Spark team :)
- @warre_n_peace
- Linked in: https://www.linkedin.com/in/rachelbwarren/
- Slideshare: https://www.slideshare.net/RachelWarren4/
- Github: https://github.com/rachelwarren
Holden:
โ— My name is Holden Karau
โ— Prefered pronouns are she/her
โ— Developer Advocate at Google
โ— Apache Spark PMC :)
โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
โ— co-author of Learning Spark & High Performance Spark
โ— @holdenkarau
โ— Slide share http://www.slideshare.net/hkarau
โ— Linkedin https://www.linkedin.com/in/holdenkarau
โ— Github https://github.com/holdenk
โ— Spark Videos http://bit.ly/holdenSparkVideos
โ— Talk feedback: http://bit.ly/holdenTalkFeedback http://bit.ly/holdenTalkFeedback
Big Data Beyond the JVM - Strata San Jose 2018
Who I think you wonderful humans are?
โ— Nice enough people
โ— Iโ€™m sure you love pictures of cats
โ— Might know some the different distributed systems talked about
โ— Possibly know some Python or R
โ— Trying to learn Python to become a deep learning startup
โ— Or are just tired of scala/ java
Lori Erickson
What will be covered?
โ— A more detailed look at the current state of PySpark
โ— Why it isnโ€™t good enough and plans for the future
โ— Using Arrow for fast Python UDFS with Spark
โ— Dask
โ— Beam Outside the JVM
โ— Our even less subtle attempts to get you to buy our new book
โ— Pictures of cats & stuffed animals
โ— tl;dr - Weโ€™ve* made some bad** choices historically, and projects like Arrow &
friends can save us from some of these (yay!)
Whatโ€™s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
โ— Pickling, Strings, JSON, XML, oh my!
โ— Unix pipes
โ— Sockets
What about if we donโ€™t want to copy the data all the time? Dataframe Api + Arrow
โ— Or standalone โ€œpureโ€* re-implementations of everything
โ—‹ Reasonable option for things like Kafka where you would have the I/O regardless.
โ—‹ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem
David Brown
PySpark:
โ— The Python interface to Spark
โ— Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
โ— Has some serious performance hurdles from the design
โ— Same general technique used as the bases for the other non JVM
implementations in Spark
โ—‹ C#
โ—‹ R
โ—‹ Julia
โ—‹ Javascript - surprisingly different
Yes, we have wordcount! :p
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an โ€œactionโ€
which forces spark to
evaluate the RDD
These are still
combined and
executed in
one python
executor
Trish Hamme
A quick detour into PySparkโ€™s internals
+ + JSON
Spark in Scala, how does PySpark work?
โ— Py4J + pickling + JSON and magic
โ—‹ Py4j in the driver
โ—‹ Pipes to start python process from java exec
โ—‹ cloudPickle to serialize data between JVM and python executors
(transmitted via sockets)
โ—‹ Json for dataframe schema
โ— Data from Spark worker serialized and piped to Python
worker --> then piped back to jvm
โ—‹ Multiple iterator-to-iterator transformations are still pipelined :)
โ—‹ So serialization happens only once per stage
โ— Spark SQL (and DataFrames) avoid some of this
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
So how does that impact PySpark?
โ— Double serialization cost makes everything more
expensive
โ— Python worker startup takes a bit of extra time
โ— Python memory isnโ€™t controlled by the JVM - easy to go
over container limits if deploying on YARN or similar
โ— Error messages make ~0 sense
โ— Spark Features arenโ€™t automatically exposed, but
exposing them is normally simple
Our saviour from serialization: DataFrames
โ— For the most part keeps data in the JVM
โ—‹ Notable exception is UDFs written in Python
โ— Takes our python calls and turns it into a query plan
โ— if we need more than the native operations in Sparkโ€™s
DataFrames we end up in a pickle ;)
โ— be wary of Distributed Systems bringing claims of
usabilityโ€ฆ.
Andy
Blackledge
So what are Spark DataFrames?
โ— More than SQL tables
โ— Not Pandas or R DataFrames
โ— Semi-structured (have schema information)
โ— tabular
โ— work on expression as well as lambdas
โ—‹ e.g. df.filter(df.col(โ€œhappyโ€) == true) instead of rdd.filter(lambda x:
x.happy == true))
โ— Not a subset of Spark โ€œDatasetsโ€ - since Dataset API
isnโ€™t exposed in Python yet :(
Quinn Dombrowski
Word count w/Dataframes
df = sqlCtx.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
We can see the difference easily:
Andrew Skudder
*
*Vendor
benchmark.
Trust but verify.
*For a small price of your fun libraries. Bad idea.
That was a bad idea, buuutโ€ฆ..
โ— Work going on in Scala land to translate simple Scala
into SQL expressions - need the Dataset API
โ—‹ Maybe we can try similar approaches with Python?
โ— POC use Jython for simple UDFs (e.g. 2.7 compat & no
native libraries) - SPARK-15369
โ—‹ Early benchmarking w/word count 5% slower than native Scala UDF,
close to 2x faster than regular Python
โ— Willing to share your Python UDFs for benchmarking? -
http://bit.ly/pySparkUDF
*The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its
ok!
Big Data Beyond the JVM - Strata San Jose 2018
The โ€œfutureโ€*: faster interchange
โ— By future I mean availability starting in the next 3-6 months (with more
improvements after).
โ—‹ Yes much of this code exists, it just isnโ€™t released yet so Iโ€™m sure weโ€™ll find all sorts of bugs
and ways to improve.
โ—‹ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
โ— Unifying our cross-language experience
โ—‹ And not just โ€œnormalโ€ languages, CUDA counts yo
Tambako The Jaguar
The present!!!!!!!!!
โ— Arrow support for UDF acceleration is in the latest Spark release
โ— Itโ€™s not finished (e.g. UDAFs still make sadness)
Ren Kuo
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
What does the future look like - in code
@pandas_udf("integer", PandasUDFType.SCALAR)
def add_one(x):
return x + 1
What does the future look like - in code
@pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
v = pdf.v
return pdf.assign(v=(v - v.mean()) / v.std())
What does the future look like - in code
@pandas_udf("word string", PandasUDFType.GROUPED_MAP)
def special_tokenize(s):
if s.strings is not None:
return pandas.DataFrame(reduce(list.__add__, map(lambda x:
x.split(' '), s.strings)))
# This is a little ugly because currently the scalar transform
# doesn't allow flat map like behaviour only map like.
grouped = df.groupby("strings")
tokenized = grouped.apply(special_tokenize)
tokenized.show()
Word
Count #3
$$$$$
Beyond wordcount: depencies?
โ— Your machines probably already have pandas
โ—‹ But maybe an old version
โ— But they might not have โ€œspecial_business_logicโ€
โ—‹ Very special business logic, no one wants change fortran code*.
โ— Option 1: Talk to your vendor**
โ— Option 2: Try some sketchy open source software from
a hack day
โ— Weโ€™re going to focus on option 2!
*Because itโ€™s perfect, it is fortran after all.
** I donโ€™t like this option because the vendor I work for doesnโ€™t have an answer.
coffee_boat to the rescue*
# You can tell it's alpha cause were installing from github
!pip install --upgrade
git+https://github.com/nteract/coffee_boat.git
# Use the coffee boat
from coffee_boat import Captain
captain = Captain(accept_conda_license=True)
captain.add_pip_packages("pyarrow", "edtf")
captain.launch_ship()
sc = SparkContext(master="yarn")
# You can now use pyarrow & edtf
captain.add_pip_packages("yourmagic")
# You can now use your magic in transformations!
Hadoop โ€œstreamingโ€ (Python/R)
โ— Unix pipes!
โ— Involves a data copy, formats get sad
โ— But the overhead of a Map/Reduce task is pretty high anyways...
Lisa Larsson
Kafka: re-implement all the things
โ— Multiple options for connecting to Kafka from outside of the JVM (yay!)
โ— They implement the protocol to talk to Kafka (yay!)
โ— This involves duplicated client work, and sometimes the clients can be slow
(solution, FFI bindings to C instead of Java)
โ— Buuuut -- we canโ€™t access all of the cool Kafka business (like Kafka Streams)
and features depend on client libraries implementing them (easy to slip below
parity)
Smokey Combs
Dask: a new beginning?
โ— Pure* python implementation
โ— Provides real enough DataFrame interface for distributed data
โ— Also your standard-ish distributed collections
โ— Multiple backends
โ— Primary challenge: interacting with the rest of the big data ecosystem
โ—‹ Arrow & friends might make this better with time too, buuutโ€ฆ.
โ— See https://dask.pydata.org/en/latest/ &
http://dask.pydata.org/en/latest/spark.html
Lisa Zins
BEAM Beyond the JVM
โ— Non JVM BEAM doesnโ€™t work outside of Googleโ€™s environment yet, so Iโ€™m
going to skip the details.
โ— tl;dr : uses grpc / protobuf
โ— But exciting new plans to unify the runners and ease the support of different
languages (called SDKS)
โ—‹ See https://beam.apache.org/contribute/portability/
โ— If this is exciting, you can come join me on making BEAM work in Python3
โ—‹ Yes we still donโ€™t have that :(
โ—‹ But we're getting closer!
Why now?
โ— Thereโ€™s been better formats/options for a long time
โ— JVM devs want to use libraries in other languages with lots of data
โ—‹ e.g. startup + Deep Learning + ? => profit
โ— Arrow has solved the chicken-egg problem by building not just the chicken &
the egg, but also a hen house
Andrew Mager
References
โ— Apache Arrow: https://arrow.apache.org/
โ— Brian (IBM) on initial Spark + Arrow
https://arrow.apache.org/blog/2017/07/26/spark-arrow/
โ— Li Jin (two sigma)
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar
k.html
โ— Bill Maimone
https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
You can buy it today! Or come to our book signing at 3:20
and maybe get a free copy.
Only one chapter on non-JVM stuff, Iโ€™m sorry.
Cats love it*
*Or at least the box it comes in. If buying for a cat, get print
rather than e-book.
And some upcoming talks:
โ— March
โ—‹ PyData Ann Arbor March 13 - I just booked my flight 4 hours ago!
โ— April
โ—‹ Flink Forward
โ—‹ Dataworks Summit Berlin
โ—‹ Kafka Summit London
โ—‹ PyData London
โ— May
โ—‹ Stata London:
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/6
4759
k thnx bye :)
If you care about Spark testing and
donโ€™t hate surveys:
http://bit.ly/holdenTestingSpark
I need to give a testing talk in a few
months, help a โ€œfriendโ€ out.
Will tweet results
โ€œeventuallyโ€ @holdenkarau
Do you want more realistic
benchmarks? Share your UDFs!
http://bit.ly/pySparkUDF
Itโ€™s performance review season, so help a friend out and
fill out this survey with your talk feedback
http://bit.ly/holdenTalkFeedback
Bonus Slides
Maybe you ask a question and we go here :)
We can do that w/Kafka streams..
โ— Why bother learning from our mistakes?
โ— Or more seriously, the mistakes werenโ€™t that bad...
Our โ€œspecialโ€ business logic
def transform(input):
"""
Transforms the supplied input.
"""
return str(len(input))
Pargon
Letโ€™s pretend all the world is a string:
override def transform(value: String): String = {
// WARNING: This may summon cuthuluhu
dataOut.writeInt(value.getBytes.size)
dataOut.write(value.getBytes)
dataOut.flush()
val resultSize = dataIn.readInt()
val result = new Array[Byte](resultSize)
dataIn.readFully(result)
// Assume UTF8, what could go wrong? :p
new String(result)
}
From https://github.com/holdenk/kafka-streams-python-cthulhu
Then make an instance to use it...
val testFuncFile =
"kafka_streams_python_cthulhu/strlen.py"
stream.transformValues(
PythonStringValueTransformerSupplier(testFuncFile))
// Or we could wrap this in the bridge but thats effort.
From https://github.com/holdenk/kafka-streams-python-cthulhu
Letโ€™s pretend all the world is a string:
def main(socket):
while (True):
input_length = _read_int(socket)
data = socket.read(input_length)
result = transform(data)
resultBytes = result.encode()
_write_int(len(resultBytes), socket)
socket.write(resultBytes)
socket.flush()
From https://github.com/holdenk/kafka-streams-python-cthulhu
What does that let us do?
โ— You can add a map stage with your data scientists
Python code in the middle
โ— Youโ€™re limited to strings*
โ— Still missing the โ€œdriver sideโ€ integration (e.g. the
interface requires someone to make a Scala class at
some point)
What about things other than strings?
Use another system
โ— Like Spark! (oh wait) or BEAM* or FLINK*?
Write it in a format Python can understand:
โ— Pickling (from Java)
โ— JSON
โ— XML
Purely Python solutions
โ— Currently roll-your-own (but not that bad)
*These are also JVM based solutions calling into Python. Iโ€™m not saying they will also summon Cuthulhu, Iโ€™m just saying hang onto
your souls.

More Related Content

PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Testing and validating distributed systems with Apache Spark and Apache Beam ...
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
PDF
Debugging PySpark - PyCon US 2018
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Introduction to and Extending Spark ML
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
Accelerating Big Data beyond the JVM - Fosdem 2018
Testing and validating distributed systems with Apache Spark and Apache Beam ...
A fast introduction to PySpark with a quick look at Arrow based UDFs
Debugging PySpark - PyCon US 2018
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Introduction to and Extending Spark ML
Making the big data ecosystem work together with python apache arrow, spark,...
Debugging PySpark: Spark Summit East talk by Holden Karau

What's hot (20)

PDF
Extending spark ML for custom models now with python!
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
PDF
Streaming & Scaling Spark - London Spark Meetup 2016
PDF
A super fast introduction to Spark and glance at BEAM
PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
PDF
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Getting started contributing to Apache Spark
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
PDF
Spark ML for custom models - FOSDEM HPC 2017
PDF
Getting the best performance with PySpark - Spark Summit West 2016
PPTX
Beyond shuffling - Strata London 2016
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Holden Karau - Spark ML for Custom Models
ย 
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
PDF
Introduction to Spark ML Pipelines Workshop
PDF
Spark Autotuning Talk - Strata New York
PDF
Validating big data jobs - Spark AI Summit EU
Extending spark ML for custom models now with python!
Apache Spark Super Happy Funtimes - CHUG 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Streaming & Scaling Spark - London Spark Meetup 2016
A super fast introduction to Spark and glance at BEAM
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Getting started contributing to Apache Spark
The magic of (data parallel) distributed systems and where it all breaks - Re...
Spark ML for custom models - FOSDEM HPC 2017
Getting the best performance with PySpark - Spark Summit West 2016
Beyond shuffling - Strata London 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau - Spark ML for Custom Models
ย 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Improving PySpark performance: Spark Performance Beyond the JVM
Introduction to Spark ML Pipelines Workshop
Spark Autotuning Talk - Strata New York
Validating big data jobs - Spark AI Summit EU
Ad

Similar to Big Data Beyond the JVM - Strata San Jose 2018 (20)

PDF
Big data beyond the JVM - DDTX 2018
PDF
Sharing (or stealing) the jewels of python with big data & the jvm (1)
PDF
Getting The Best Performance With PySpark
PDF
Are general purpose big data systems eating the world?
PDF
Kafka Summit SF 2017 - Streaming Processing in Python โ€“ 10 ways to avoid summ...
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
PPTX
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
PDF
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
PPTX
Simplifying training deep and serving learning models with big data in python...
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
Apache Spark for Everyone - Women Who Code Workshop
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Introduction to Spark with Python
PDF
Dive into PySpark
PPTX
Big Data tools in practice
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Big data beyond the JVM - DDTX 2018
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Getting The Best Performance With PySpark
Are general purpose big data systems eating the world?
Kafka Summit SF 2017 - Streaming Processing in Python โ€“ 10 ways to avoid summ...
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Powering tensor flow with big data using apache beam, flink, and spark cern...
Powering Tensorflow with big data using Apache Beam, Flink, and Spark - OSCON...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Simplifying training deep and serving learning models with big data in python...
How does that PySpark thing work? And why Arrow makes it faster?
Apache Spark for Everyone - Women Who Code Workshop
Performant data processing with PySpark, SparkR and DataFrame API
Introduction to Spark with Python
Dive into PySpark
Big Data tools in practice
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Alpine academy apache spark series #1 introduction to cluster computing wit...
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Ad

Recently uploaded (20)

PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
ย 
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
innovation process that make everything different.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
Digital Literacy And Online Safety on internet
PDF
The Internet -By the Numbers, Sri Lanka Edition
ย 
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PPTX
Introduction to Information and Communication Technology
PDF
๐Ÿ’ฐ ๐”๐Š๐“๐ˆ ๐Š๐„๐Œ๐„๐๐€๐๐†๐€๐ ๐Š๐ˆ๐๐„๐‘๐Ÿ’๐ƒ ๐‡๐€๐‘๐ˆ ๐ˆ๐๐ˆ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ“ ๐Ÿ’ฐ
ย 
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
international classification of diseases ICD-10 review PPT.pptx
PDF
Vigrab.top โ€“ Online Tool for Downloading and Converting Social Media Videos a...
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
ย 
Cloud-Scale Log Monitoring _ Datadog.pdf
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
innovation process that make everything different.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PptxGenJS_Demo_Chart_20250317130215833.pptx
Slides PDF The World Game (s) Eco Economic Epochs.pdf
introduction about ICD -10 & ICD-11 ppt.pptx
Digital Literacy And Online Safety on internet
The Internet -By the Numbers, Sri Lanka Edition
ย 
Unit-1 introduction to cyber security discuss about how to secure a system
WebRTC in SignalWire - troubleshooting media negotiation
Introduction to Information and Communication Technology
๐Ÿ’ฐ ๐”๐Š๐“๐ˆ ๐Š๐„๐Œ๐„๐๐€๐๐†๐€๐ ๐Š๐ˆ๐๐„๐‘๐Ÿ’๐ƒ ๐‡๐€๐‘๐ˆ ๐ˆ๐๐ˆ ๐Ÿ๐ŸŽ๐Ÿ๐Ÿ“ ๐Ÿ’ฐ
ย 
Decoding a Decade: 10 Years of Applied CTI Discipline
Design_with_Watersergyerge45hrbgre4top (1).ppt
SASE Traffic Flow - ZTNA Connector-1.pdf
international classification of diseases ICD-10 review PPT.pptx
Vigrab.top โ€“ Online Tool for Downloading and Converting Social Media Videos a...
An introduction to the IFRS (ISSB) Stndards.pdf

Big Data Beyond the JVM - Strata San Jose 2018

  • 1. Accelerating Big Data Beyond the JVM Engine noises go here @holdenkarau & @warre_n_peace
  • 2. Rachel - Rachel Warren โ†’ She/ Her - Data Scientist / Software engineer at Salesforce Einstein - Formerly at Alpine Data (with Holden) - Lots of experience scaling Spark in different production environments - The other half of the High Performance Spark team :) - @warre_n_peace - Linked in: https://www.linkedin.com/in/rachelbwarren/ - Slideshare: https://www.slideshare.net/RachelWarren4/ - Github: https://github.com/rachelwarren
  • 3. Holden: โ— My name is Holden Karau โ— Prefered pronouns are she/her โ— Developer Advocate at Google โ— Apache Spark PMC :) โ— previously IBM, Alpine, Databricks, Google, Foursquare & Amazon โ— co-author of Learning Spark & High Performance Spark โ— @holdenkarau โ— Slide share http://www.slideshare.net/hkarau โ— Linkedin https://www.linkedin.com/in/holdenkarau โ— Github https://github.com/holdenk โ— Spark Videos http://bit.ly/holdenSparkVideos โ— Talk feedback: http://bit.ly/holdenTalkFeedback http://bit.ly/holdenTalkFeedback
  • 5. Who I think you wonderful humans are? โ— Nice enough people โ— Iโ€™m sure you love pictures of cats โ— Might know some the different distributed systems talked about โ— Possibly know some Python or R โ— Trying to learn Python to become a deep learning startup โ— Or are just tired of scala/ java Lori Erickson
  • 6. What will be covered? โ— A more detailed look at the current state of PySpark โ— Why it isnโ€™t good enough and plans for the future โ— Using Arrow for fast Python UDFS with Spark โ— Dask โ— Beam Outside the JVM โ— Our even less subtle attempts to get you to buy our new book โ— Pictures of cats & stuffed animals โ— tl;dr - Weโ€™ve* made some bad** choices historically, and projects like Arrow & friends can save us from some of these (yay!)
  • 7. Whatโ€™s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? โ— Pickling, Strings, JSON, XML, oh my! โ— Unix pipes โ— Sockets What about if we donโ€™t want to copy the data all the time? Dataframe Api + Arrow โ— Or standalone โ€œpureโ€* re-implementations of everything โ—‹ Reasonable option for things like Kafka where you would have the I/O regardless. โ—‹ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  • 8. PySpark: โ— The Python interface to Spark โ— Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API โ— Has some serious performance hurdles from the design โ— Same general technique used as the bases for the other non JVM implementations in Spark โ—‹ C# โ—‹ R โ—‹ Julia โ—‹ Javascript - surprisingly different
  • 9. Yes, we have wordcount! :p lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an โ€œactionโ€ which forces spark to evaluate the RDD These are still combined and executed in one python executor Trish Hamme
  • 10. A quick detour into PySparkโ€™s internals + + JSON
  • 11. Spark in Scala, how does PySpark work? โ— Py4J + pickling + JSON and magic โ—‹ Py4j in the driver โ—‹ Pipes to start python process from java exec โ—‹ cloudPickle to serialize data between JVM and python executors (transmitted via sockets) โ—‹ Json for dataframe schema โ— Data from Spark worker serialized and piped to Python worker --> then piped back to jvm โ—‹ Multiple iterator-to-iterator transformations are still pipelined :) โ—‹ So serialization happens only once per stage โ— Spark SQL (and DataFrames) avoid some of this kristin klein
  • 12. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 13. So how does that impact PySpark? โ— Double serialization cost makes everything more expensive โ— Python worker startup takes a bit of extra time โ— Python memory isnโ€™t controlled by the JVM - easy to go over container limits if deploying on YARN or similar โ— Error messages make ~0 sense โ— Spark Features arenโ€™t automatically exposed, but exposing them is normally simple
  • 14. Our saviour from serialization: DataFrames โ— For the most part keeps data in the JVM โ—‹ Notable exception is UDFs written in Python โ— Takes our python calls and turns it into a query plan โ— if we need more than the native operations in Sparkโ€™s DataFrames we end up in a pickle ;) โ— be wary of Distributed Systems bringing claims of usabilityโ€ฆ. Andy Blackledge
  • 15. So what are Spark DataFrames? โ— More than SQL tables โ— Not Pandas or R DataFrames โ— Semi-structured (have schema information) โ— tabular โ— work on expression as well as lambdas โ—‹ e.g. df.filter(df.col(โ€œhappyโ€) == true) instead of rdd.filter(lambda x: x.happy == true)) โ— Not a subset of Spark โ€œDatasetsโ€ - since Dataset API isnโ€™t exposed in Python yet :( Quinn Dombrowski
  • 16. Word count w/Dataframes df = sqlCtx.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 17. We can see the difference easily: Andrew Skudder * *Vendor benchmark. Trust but verify.
  • 18. *For a small price of your fun libraries. Bad idea.
  • 19. That was a bad idea, buuutโ€ฆ.. โ— Work going on in Scala land to translate simple Scala into SQL expressions - need the Dataset API โ—‹ Maybe we can try similar approaches with Python? โ— POC use Jython for simple UDFs (e.g. 2.7 compat & no native libraries) - SPARK-15369 โ—‹ Early benchmarking w/word count 5% slower than native Scala UDF, close to 2x faster than regular Python โ— Willing to share your Python UDFs for benchmarking? - http://bit.ly/pySparkUDF *The future may or may not have better performance than today. But bun-bun the bunny has some lettuce so its ok!
  • 21. The โ€œfutureโ€*: faster interchange โ— By future I mean availability starting in the next 3-6 months (with more improvements after). โ—‹ Yes much of this code exists, it just isnโ€™t released yet so Iโ€™m sure weโ€™ll find all sorts of bugs and ways to improve. โ—‹ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! โ— Unifying our cross-language experience โ—‹ And not just โ€œnormalโ€ languages, CUDA counts yo Tambako The Jaguar
  • 22. The present!!!!!!!!! โ— Arrow support for UDF acceleration is in the latest Spark release โ— Itโ€™s not finished (e.g. UDAFs still make sadness) Ren Kuo
  • 23. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * *
  • 24. What does the future look like?* *Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html. *Vendor benchmark. Trust but verify.
  • 25. What does the future look like - in code @pandas_udf("integer", PandasUDFType.SCALAR) def add_one(x): return x + 1
  • 26. What does the future look like - in code @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def normalize(pdf): v = pdf.v return pdf.assign(v=(v - v.mean()) / v.std())
  • 27. What does the future look like - in code @pandas_udf("word string", PandasUDFType.GROUPED_MAP) def special_tokenize(s): if s.strings is not None: return pandas.DataFrame(reduce(list.__add__, map(lambda x: x.split(' '), s.strings))) # This is a little ugly because currently the scalar transform # doesn't allow flat map like behaviour only map like. grouped = df.groupby("strings") tokenized = grouped.apply(special_tokenize) tokenized.show() Word Count #3 $$$$$
  • 28. Beyond wordcount: depencies? โ— Your machines probably already have pandas โ—‹ But maybe an old version โ— But they might not have โ€œspecial_business_logicโ€ โ—‹ Very special business logic, no one wants change fortran code*. โ— Option 1: Talk to your vendor** โ— Option 2: Try some sketchy open source software from a hack day โ— Weโ€™re going to focus on option 2! *Because itโ€™s perfect, it is fortran after all. ** I donโ€™t like this option because the vendor I work for doesnโ€™t have an answer.
  • 29. coffee_boat to the rescue* # You can tell it's alpha cause were installing from github !pip install --upgrade git+https://github.com/nteract/coffee_boat.git # Use the coffee boat from coffee_boat import Captain captain = Captain(accept_conda_license=True) captain.add_pip_packages("pyarrow", "edtf") captain.launch_ship() sc = SparkContext(master="yarn") # You can now use pyarrow & edtf captain.add_pip_packages("yourmagic") # You can now use your magic in transformations!
  • 30. Hadoop โ€œstreamingโ€ (Python/R) โ— Unix pipes! โ— Involves a data copy, formats get sad โ— But the overhead of a Map/Reduce task is pretty high anyways... Lisa Larsson
  • 31. Kafka: re-implement all the things โ— Multiple options for connecting to Kafka from outside of the JVM (yay!) โ— They implement the protocol to talk to Kafka (yay!) โ— This involves duplicated client work, and sometimes the clients can be slow (solution, FFI bindings to C instead of Java) โ— Buuuut -- we canโ€™t access all of the cool Kafka business (like Kafka Streams) and features depend on client libraries implementing them (easy to slip below parity) Smokey Combs
  • 32. Dask: a new beginning? โ— Pure* python implementation โ— Provides real enough DataFrame interface for distributed data โ— Also your standard-ish distributed collections โ— Multiple backends โ— Primary challenge: interacting with the rest of the big data ecosystem โ—‹ Arrow & friends might make this better with time too, buuutโ€ฆ. โ— See https://dask.pydata.org/en/latest/ & http://dask.pydata.org/en/latest/spark.html Lisa Zins
  • 33. BEAM Beyond the JVM โ— Non JVM BEAM doesnโ€™t work outside of Googleโ€™s environment yet, so Iโ€™m going to skip the details. โ— tl;dr : uses grpc / protobuf โ— But exciting new plans to unify the runners and ease the support of different languages (called SDKS) โ—‹ See https://beam.apache.org/contribute/portability/ โ— If this is exciting, you can come join me on making BEAM work in Python3 โ—‹ Yes we still donโ€™t have that :( โ—‹ But we're getting closer!
  • 34. Why now? โ— Thereโ€™s been better formats/options for a long time โ— JVM devs want to use libraries in other languages with lots of data โ—‹ e.g. startup + Deep Learning + ? => profit โ— Arrow has solved the chicken-egg problem by building not just the chicken & the egg, but also a hen house Andrew Mager
  • 35. References โ— Apache Arrow: https://arrow.apache.org/ โ— Brian (IBM) on initial Spark + Arrow https://arrow.apache.org/blog/2017/07/26/spark-arrow/ โ— Li Jin (two sigma) https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspar k.html โ— Bill Maimone https://blogs.nvidia.com/blog/2017/06/27/gpu-computation-visualization/
  • 36. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 37. High Performance Spark! You can buy it today! Or come to our book signing at 3:20 and maybe get a free copy. Only one chapter on non-JVM stuff, Iโ€™m sorry. Cats love it* *Or at least the box it comes in. If buying for a cat, get print rather than e-book.
  • 38. And some upcoming talks: โ— March โ—‹ PyData Ann Arbor March 13 - I just booked my flight 4 hours ago! โ— April โ—‹ Flink Forward โ—‹ Dataworks Summit Berlin โ—‹ Kafka Summit London โ—‹ PyData London โ— May โ—‹ Stata London: https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/6 4759
  • 39. k thnx bye :) If you care about Spark testing and donโ€™t hate surveys: http://bit.ly/holdenTestingSpark I need to give a testing talk in a few months, help a โ€œfriendโ€ out. Will tweet results โ€œeventuallyโ€ @holdenkarau Do you want more realistic benchmarks? Share your UDFs! http://bit.ly/pySparkUDF Itโ€™s performance review season, so help a friend out and fill out this survey with your talk feedback http://bit.ly/holdenTalkFeedback
  • 40. Bonus Slides Maybe you ask a question and we go here :)
  • 41. We can do that w/Kafka streams.. โ— Why bother learning from our mistakes? โ— Or more seriously, the mistakes werenโ€™t that bad...
  • 42. Our โ€œspecialโ€ business logic def transform(input): """ Transforms the supplied input. """ return str(len(input)) Pargon
  • 43. Letโ€™s pretend all the world is a string: override def transform(value: String): String = { // WARNING: This may summon cuthuluhu dataOut.writeInt(value.getBytes.size) dataOut.write(value.getBytes) dataOut.flush() val resultSize = dataIn.readInt() val result = new Array[Byte](resultSize) dataIn.readFully(result) // Assume UTF8, what could go wrong? :p new String(result) } From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 44. Then make an instance to use it... val testFuncFile = "kafka_streams_python_cthulhu/strlen.py" stream.transformValues( PythonStringValueTransformerSupplier(testFuncFile)) // Or we could wrap this in the bridge but thats effort. From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 45. Letโ€™s pretend all the world is a string: def main(socket): while (True): input_length = _read_int(socket) data = socket.read(input_length) result = transform(data) resultBytes = result.encode() _write_int(len(resultBytes), socket) socket.write(resultBytes) socket.flush() From https://github.com/holdenk/kafka-streams-python-cthulhu
  • 46. What does that let us do? โ— You can add a map stage with your data scientists Python code in the middle โ— Youโ€™re limited to strings* โ— Still missing the โ€œdriver sideโ€ integration (e.g. the interface requires someone to make a Scala class at some point)
  • 47. What about things other than strings? Use another system โ— Like Spark! (oh wait) or BEAM* or FLINK*? Write it in a format Python can understand: โ— Pickling (from Java) โ— JSON โ— XML Purely Python solutions โ— Currently roll-your-own (but not that bad) *These are also JVM based solutions calling into Python. Iโ€™m not saying they will also summon Cuthulhu, Iโ€™m just saying hang onto your souls.