Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn

2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com

3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4
February 2016: Birth of Apache Arrow
Just a goal…

5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine

6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning

7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S

8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„oﬀ-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow

9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?

10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly diﬀerent memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy

11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!

NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash

13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢

14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24

15 Photo by Niklas Tidbury on Unsplash

16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!

17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm 
/
fletcher

Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

More Related Content

What's hot (20)

Similar to Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

More from Uwe Korn (6)

Recently uploaded (20)

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy