SlideShare a Scribd company logo
Improving Pandas and
PySpark interoperability
with Apache Arrow
Li Jin
PyData NYC
November 2017
• The information presented here is offered for informational purposes only and should not be used for any other
purpose (including, without limitation, the making of investment decisions). Examples provided herein are for
illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell
or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This
presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so,
such copyrights and/or trademarks are most likely owned by the entity that created the material and are used
purely for identification and comment as fair use under international copyright and/or trademark laws. Use of
such image, copyright or trademark does not imply any association with such organization (or endorsement of
such organization) by Two Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
IMPORTANT LEGAL INFORMATION
About Me
3
• Li Jin (@icexelloss)
• Software Engineer @ Two Sigma Investments
• Apache Arrow Committer
• Analytics Tools Smith
• Other Open Source Projects:
• Flint: A Time Series Library on Spark
• Cook: A Fair Scheduler on Mesos
• PySpark Overview
• PySpark UDF: current state and limitation
• Apache Arrow Overview
• Improvement to PySpark UDF with Apache Arrow
• Future Roadmap
This Talk
4
PySpark Overview
5
• A tool for distributed data analysis
• Apache project
• JVM-based with Python interface (PySpark)
• Functionality:
• Relational: Join, group, aggregate …
• Stats and ML: Spark MLlib
• Streaming
• …
Apache Spark
6
• Bigger Data:
• Pandas: 10G
• Spark: 1000G
• Better Parallelism:
• Pandas: Single core
• Spark: Hundreds of cores
Why Spark
7
• Python interface for Spark
• API front-end for built-in Spark functions
• df.withColumn(‘v2’, df.v1 + 1)
• Translated to Java code, running in JVM
• Interface for native Python code (User-defined function)
• df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1))
• Running in Python runtime
PySpark Overview
8
PySpark UDF:
Current state and
limitation
9
• PySpark’s interface to interact with other Python libraries
• Types of UDFs:
• Row UDF
• Group UDF
PySpark User Defined Function (UDF)
10
• Operates on row by row basis
• Similar to `map` operator
• Example:
• String processing
• Timestamp processing
• Poor performance
• 1-2 orders of magnitude slower comparing to alternatives (built-in Spark
functions or vectorized operations)
Row UDF: Current
11
• UDF that operates on multiple rows
• Similar to `groupBy` followed by `map` operator
• Example:
• Monthly weighted mean
• Not supported out of box
• Poor performance
Group UDF: Current
12
• (values – values.mean()) / values.std()
Group UDF: Example
13
Group UDF: Example
14
Group UDF: Example
15
80% of
the code is
boilerplate
Slow
• Inefficient data movement between Java and Python (Serialization /
Deserialization)
• Scalar computation model
UDF Issues
16
Apache Arrow
17
• In memory columnar format
• Building on the success of Parquet
• Standard from the start:
• Developers from 13+ major open source projects involved
• Benefits:
• Share the effort
• Create an ecosystem
Apache Arrow
18
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
Hbase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
High Performance Sharing & Interchange
Before With Arrow
Columnar Data Format
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’
]
}]
Record Batch Construction
Schema
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
• Maximize CPU throughput
• Pipelining
• SIMD
• Cache locality
• Scatter/gather I/O
In Memory Columnar Format for Speed
• PySpark “toPandas” Improvement
• 53x Speedup
• Streaming Arrow Performance
• 7.75GB/s data movement
• Arrow Parquet C++ Integration
• 4GB/s reads
• Pandas Integration
• 9.71GB/s
Results
Read more on http://arrow.apache.org/blog/
23
Improving PySpark
UDF
24
Vectorizing Row
UDF
25
How PySpark UDF works
26
Executor
Python
Worker
UDF: Row -> Row
Rows (Pickle)
Rows (Pickle)
• Inefficient data movement (Serialization / Deserialization)
• Scalar computation model
Recap: Current issues with UDF
27
Profile lambda x: x+1
8 Mb/s
91.8% in
Ser/Deser
Vectorized UDF
Executor
Python
Worker
UDF: pd.DataFrame -> pd.DataFrame
Rows ->
RB
RB ->
Rows
Row UDF vs Vectorized UDF
* Actual runtime for row UDF is 2s without profiling
20x Speed Up
(Profiler overhead
adjusted*)
Row UDF vs Vectorized UDF
Ser/Deser
Overhead
Removed
Row UDF vs Vectorized UDF
Less System Call
Faster I/O
Improving Group
UDF
33
• Split-apply-combine
• Break a problem into smaller pieces
• Operate on each piece independently
• Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R …
Introduce Group UDF
• Split: groupBy
• Apply: UDF (pd.DataFrame -> pd.DataFrame)
• Combine: Inherently done by Spark
Split-Apply-Combine (UDF)
Introduce groupBy().apply()
Rows
Rows
Rows
Groups
Groups
Groups
Groups
Groups
Groups
Each Group:
pd.DataFrame -> pd.DataFramegroupBy
• (values – values.mean()) / values.std()
Previous Example
37
Group UDF: Before and After
For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Before: After*:
Performance
Reference: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
39
• Available in the upcoming Apache Spark 2.3 release
• Try it with Databricks community version:
• https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html
Try It!
40
• Improving PySpark/Pandas interoperability (SPARK-22216)
• Working towards Arrow 1.0 release
• More Arrow integration
Future Roadmap
41
• dev@spark.apache.org
• dev@arrow.apache.org
Get involved
42
Bryan Cutler
Hyukjin Kwon
Jeff Reback
Leif Walsh
Li Jin
Liang-Chi Hsieh
Reynold Xin
Takuya Ueshin
Wenchen Fan
Wes McKinney
Xiao Li
Collaborators
43
Questions
44

More Related Content

PDF
Python Data Wrangling: Preparing for the Future
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
HUG France - Apache Drill
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PDF
Data Science Languages and Industry Analytics
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Python Data Wrangling: Preparing for the Future
How Apache Arrow and Parquet boost cross-language interoperability
Improving Python and Spark Performance and Interoperability with Apache Arrow
HUG France - Apache Drill
Strata London 2016: The future of column oriented data processing with Arrow ...
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Science Languages and Industry Analytics
If you have your own Columnar format, stop now and use Parquet 😛

What's hot (17)

PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PDF
HUG_Ireland_Apache_Arrow_Tomer_Shiran
PDF
Ursa Labs and Apache Arrow in 2019
PDF
High Performance Python on Apache Spark
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PPTX
Future of pandas
PDF
Sql on everything with drill
PDF
Hadoop 101
PDF
Apache Arrow at DataEngConf Barcelona 2018
DOCX
Hadoop Training in Hyderabad | Online Training
PDF
Apache Drill (ver. 0.2)
PDF
Analyzing Web Archives
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Apache Arrow: Leveling Up the Data Science Stack
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Apache Arrow -- Cross-language development platform for in-memory data
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
HUG_Ireland_Apache_Arrow_Tomer_Shiran
Ursa Labs and Apache Arrow in 2019
High Performance Python on Apache Spark
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Future of pandas
Sql on everything with drill
Hadoop 101
Apache Arrow at DataEngConf Barcelona 2018
Hadoop Training in Hyderabad | Online Training
Apache Drill (ver. 0.2)
Analyzing Web Archives
My Data Journey with Python (SciPy 2015 Keynote)
An Incomplete Data Tools Landscape for Hackers in 2015
Apache Arrow: Leveling Up the Data Science Stack
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Ad

Similar to Improving Pandas and PySpark interoperability with Apache Arrow (20)

PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow
PDF
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PDF
Apache Arrow and Pandas UDF on Apache Spark
PDF
How does that PySpark thing work? And why Arrow makes it faster?
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Speeding up PySpark with Arrow
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Dive into PySpark
PDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
PDF
Getting The Best Performance With PySpark
PDF
Introduction to Spark with Python
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
Introduction to PySpark maka sakinaka loda
PDF
Pandas UDF and Python Type Hint in Apache Spark 3.0
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Vectorized UDF: Scalable Analysis with Python and PySpark with Li Jin
Pandas UDF: Scalable Analysis with Python and PySpark
Apache Arrow and Pandas UDF on Apache Spark
How does that PySpark thing work? And why Arrow makes it faster?
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark (PySpark) Performance and Interoperability
Speeding up PySpark with Arrow
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Dive into PySpark
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
Getting The Best Performance With PySpark
Introduction to Spark with Python
Big Data Beyond the JVM - Strata San Jose 2018
Introduction to Spark Datasets - Functional and relational together at last
Introduction to PySpark maka sakinaka loda
Pandas UDF and Python Type Hint in Apache Spark 3.0
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Machine Learning_overview_presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
The Rise and Fall of 3GPP – Time for a Sabbatical?
gpt5_lecture_notes_comprehensive_20250812015547.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine Learning_overview_presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MIND Revenue Release Quarter 2 2025 Press Release
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation theory and applications.pdf
Assigned Numbers - 2025 - Bluetooth® Document

Improving Pandas and PySpark interoperability with Apache Arrow

  • 1. Improving Pandas and PySpark interoperability with Apache Arrow Li Jin PyData NYC November 2017
  • 2. • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved IMPORTANT LEGAL INFORMATION
  • 3. About Me 3 • Li Jin (@icexelloss) • Software Engineer @ Two Sigma Investments • Apache Arrow Committer • Analytics Tools Smith • Other Open Source Projects: • Flint: A Time Series Library on Spark • Cook: A Fair Scheduler on Mesos
  • 4. • PySpark Overview • PySpark UDF: current state and limitation • Apache Arrow Overview • Improvement to PySpark UDF with Apache Arrow • Future Roadmap This Talk 4
  • 6. • A tool for distributed data analysis • Apache project • JVM-based with Python interface (PySpark) • Functionality: • Relational: Join, group, aggregate … • Stats and ML: Spark MLlib • Streaming • … Apache Spark 6
  • 7. • Bigger Data: • Pandas: 10G • Spark: 1000G • Better Parallelism: • Pandas: Single core • Spark: Hundreds of cores Why Spark 7
  • 8. • Python interface for Spark • API front-end for built-in Spark functions • df.withColumn(‘v2’, df.v1 + 1) • Translated to Java code, running in JVM • Interface for native Python code (User-defined function) • df.withColumn(‘v2’, udf(lambda x: x+1, ‘double’)(df.v1)) • Running in Python runtime PySpark Overview 8
  • 9. PySpark UDF: Current state and limitation 9
  • 10. • PySpark’s interface to interact with other Python libraries • Types of UDFs: • Row UDF • Group UDF PySpark User Defined Function (UDF) 10
  • 11. • Operates on row by row basis • Similar to `map` operator • Example: • String processing • Timestamp processing • Poor performance • 1-2 orders of magnitude slower comparing to alternatives (built-in Spark functions or vectorized operations) Row UDF: Current 11
  • 12. • UDF that operates on multiple rows • Similar to `groupBy` followed by `map` operator • Example: • Monthly weighted mean • Not supported out of box • Poor performance Group UDF: Current 12
  • 13. • (values – values.mean()) / values.std() Group UDF: Example 13
  • 15. Group UDF: Example 15 80% of the code is boilerplate Slow
  • 16. • Inefficient data movement between Java and Python (Serialization / Deserialization) • Scalar computation model UDF Issues 16
  • 18. • In memory columnar format • Building on the success of Parquet • Standard from the start: • Developers from 13+ major open source projects involved • Benefits: • Share the effort • Create an ecosystem Apache Arrow 18 Calcite Cassandra Deeplearning4j Drill Hadoop Hbase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 19. High Performance Sharing & Interchange Before With Arrow
  • 20. Columnar Data Format persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 21. Record Batch Construction Schema Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 22. • Maximize CPU throughput • Pipelining • SIMD • Cache locality • Scatter/gather I/O In Memory Columnar Format for Speed
  • 23. • PySpark “toPandas” Improvement • 53x Speedup • Streaming Arrow Performance • 7.75GB/s data movement • Arrow Parquet C++ Integration • 4GB/s reads • Pandas Integration • 9.71GB/s Results Read more on http://arrow.apache.org/blog/ 23
  • 26. How PySpark UDF works 26 Executor Python Worker UDF: Row -> Row Rows (Pickle) Rows (Pickle)
  • 27. • Inefficient data movement (Serialization / Deserialization) • Scalar computation model Recap: Current issues with UDF 27
  • 28. Profile lambda x: x+1 8 Mb/s 91.8% in Ser/Deser
  • 29. Vectorized UDF Executor Python Worker UDF: pd.DataFrame -> pd.DataFrame Rows -> RB RB -> Rows
  • 30. Row UDF vs Vectorized UDF * Actual runtime for row UDF is 2s without profiling 20x Speed Up (Profiler overhead adjusted*)
  • 31. Row UDF vs Vectorized UDF Ser/Deser Overhead Removed
  • 32. Row UDF vs Vectorized UDF Less System Call Faster I/O
  • 34. • Split-apply-combine • Break a problem into smaller pieces • Operate on each piece independently • Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R … Introduce Group UDF
  • 35. • Split: groupBy • Apply: UDF (pd.DataFrame -> pd.DataFrame) • Combine: Inherently done by Spark Split-Apply-Combine (UDF)
  • 37. • (values – values.mean()) / values.std() Previous Example 37
  • 38. Group UDF: Before and After For updated API, see: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html Before: After*:
  • 40. • Available in the upcoming Apache Spark 2.3 release • Try it with Databricks community version: • https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs- for-pyspark.html Try It! 40
  • 41. • Improving PySpark/Pandas interoperability (SPARK-22216) • Working towards Arrow 1.0 release • More Arrow integration Future Roadmap 41
  • 43. Bryan Cutler Hyukjin Kwon Jeff Reback Leif Walsh Li Jin Liang-Chi Hsieh Reynold Xin Takuya Ueshin Wenchen Fan Wes McKinney Xiao Li Collaborators 43