SlideShare a Scribd company logo
1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn
2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com
3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
4
February 2016: Birth of Apache Arrow
Just a goal…
5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine
6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning
7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S
8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„off-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow
9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?
10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly different memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy
11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!
NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash
13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢
14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24
15 Photo by Niklas Tidbury on Unsplash
16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!
17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm

/
fletcher
18
???
Does it work?
19
Does it work?
20
Does it work?
Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21
Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!

More Related Content

PDF
pandas.(to/from)_sql is simple but not fast
PDF
Extending Pandas using Apache Arrow and Numba
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
PDF
Enabling Python to be a Better Big Data Citizen
PPTX
Future of pandas
PDF
Improving data interoperability in Python and R
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
pandas.(to/from)_sql is simple but not fast
Extending Pandas using Apache Arrow and Numba
Apache Arrow: Cross-language Development Platform for In-memory Data
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Enabling Python to be a Better Big Data Citizen
Future of pandas
Improving data interoperability in Python and R
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...

What's hot (20)

PDF
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
PDF
Pandas/Data Analysis at Baypiggies
PDF
DataFrames: The Extended Cut
PDF
Presto
PDF
PyCon Singapore 2013 Keynote
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Presto as a Service - Tips for operation and monitoring
PDF
Presto in my_use_case2
PDF
Rust is for "Big Data"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Fabian Hueske – Juggling with Bits and Bytes
PPTX
Presto Meetup 2016 Small Start
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Apache Spark & MLlib
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Strata2017 sg
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
KEY
Cascalog
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Pandas/Data Analysis at Baypiggies
DataFrames: The Extended Cut
Presto
PyCon Singapore 2013 Keynote
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Apache Arrow -- Cross-language development platform for in-memory data
Presto as a Service - Tips for operation and monitoring
Presto in my_use_case2
Rust is for "Big Data"
Apache Arrow at DataEngConf Barcelona 2018
An Incomplete Data Tools Landscape for Hackers in 2015
Fabian Hueske – Juggling with Bits and Bytes
Presto Meetup 2016 Small Start
Resource-Efficient Deep Learning Model Selection on Apache Spark
Apache Spark & MLlib
Apache Spark MLlib 2.0 Preview: Data Science and Production
Strata2017 sg
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Cascalog
Ad

Similar to Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

PDF
How Apache Arrow and Parquet boost cross-language interoperability
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
Speeding up PySpark with Arrow
PDF
PyCon Ireland 2022 - PyArrow full stack.pdf
PDF
Apache Arrow
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PPTX
An Introduction to Apache Arrow for Python Programmers.pptx
PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Ursa Labs and Apache Arrow in 2019
PDF
Apache Arrow and Python: The latest
PDF
Python Data Wrangling: Preparing for the Future
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PPTX
Apache Arrow - An Overview
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
How Apache Arrow and Parquet boost cross-language interoperability
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Speeding up PySpark with Arrow
PyCon Ireland 2022 - PyArrow full stack.pdf
Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
An Introduction to Apache Arrow for Python Programmers.pptx
PyData Sofia May 2024 - Intro to Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Solving Enterprise Data Challenges with Apache Arrow
Ursa Labs and Apache Arrow in 2019
Apache Arrow and Python: The latest
Python Data Wrangling: Preparing for the Future
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow and Pandas UDF on Apache Spark
How does that PySpark thing work? And why Arrow makes it faster?
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow - An Overview
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Ad

More from Uwe Korn (6)

PDF
Going beyond Apache Parquet's default settings
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
Scalable Scientific Computing with Dask
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Going beyond Apache Parquet's default settings
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Scalable Scientific Computing with Dask
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Global journeys: estimating international migration
PPTX
1_Introduction to advance data techniques.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Launch Your Data Science Career in Kochi – 2025
Introduction-to-Cloud-ComputingFinal.pptx
Fluorescence-microscope_Botany_detailed content
Miokarditis (Inflamasi pada Otot Jantung)
Moving the Public Sector (Government) to a Digital Adoption
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Business Ppt On Nestle.pptx huunnnhhgfvu
Global journeys: estimating international migration
1_Introduction to advance data techniques.pptx
Mega Projects Data Mega Projects Data
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Launch Your Data Science Career in Kochi – 2025

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

  • 1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn
  • 2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
  • 3. 3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
  • 4. 4 February 2016: Birth of Apache Arrow Just a goal…
  • 5. 5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine
  • 6. 6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning
  • 7. 7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S
  • 8. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored„off-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow
  • 9. 9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?
  • 10. 10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly different memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy
  • 11. 11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!
  • 12. NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash
  • 13. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy 😢
  • 14. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24
  • 15. 15 Photo by Niklas Tidbury on Unsplash
  • 16. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!
  • 17. 17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm
 / fletcher
  • 21. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21
  • 22. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!