SlideShare a Scribd company logo
1
Connecting PyData to other Big Data
Landscapes using Arrow and Parquet
Uwe L. Korn, PyCon.DE 2017
2
• Data Scientist & Architect at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas user
About me
xhochy
uwe@apache.org
3
Python is a good companion for a Data Scientist
…but there are other ecosystems out there.
• Large set of files on distributed filesystem
• Non-uniform schema
• Execute query
• Only a subset is interesting
4
Why do I care?
not in Python
5
All are amazing but…
How to get my data out of Python and back in again?
…but there was no fast Parquet access 2 years ago.
Use Parquet!
6
A general problem
• Great interoperability inside ecosystems
• Often based on a common backend (e.g. NumPy)
• Poor integration to other systems
• CSV is your only resort
• „We need to talk!“
• Memory copy is about 10GiB/s
• (De-)serialisation comes on top
7
Columnar Data
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
8
Apache Parquet
9
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
10
Why use Parquet?
1. Columnar format

—> vectorized operations
2. Efficient encodings and compressions

—> small size without the need for a fat CPU
3. Predicate push-down

—> bring computation to the I/O layer
4. Language independent format

—> libs in Java / Scala / C++ / Python /…
Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli

—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)

Snappy: 216 MiB (14 %)
Predicate pushdown
1. Only load used data
• skip columns that are not needed
• skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Which products are sold in $?
File Structure
File
RowGroup
Column Chunks
Page
Statistics
Read & Write Parquet
14
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
Read & Write Parquet
15
Pandas 0.21 will bring
pd.read_parquet(…)	
df.write_parquet(…)	
http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
16
Save in one, load in another ecosystem
…but always persist the intermediate.
17
Zero-Copy DataFrames
2.57s
Converting 1 million longs
from Spark to PySpark
18
(8MiB)
19
Apache Arrow
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R, JavaScript and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
20
Dissecting Arrow C++
• General zero-copy memory management
• jemalloc as the base allocator
• Columnar memory format & metadata
• Schema & DataType
• Columns & Table
21
Dissecting Arrow C++
• Structured data IPC (inter-process communication)
• used in Spark for JVM<->Python
• future extensions include: GRPC backend, shared memory
communication, …
• Columnar in-memory analytics
• be the backbone of Pandas 2.0
0.05s
Converting 1 million longs
from Spark to PySpark
22
with Arrow
https://github.com/apache/spark/pull/15821#issuecomment-282175163
23
Apache Arrow – Real life improvement
Real life example!
Retrieve a dataset from an MPP database and analyze it in Pandas
1. Run a query in the DB
2. Pass it in columnar form to the DB driver
3. The OBDC layer transform it into row-wise form
4. Pandas makes it columnar again
Ugly real-life solution: export as CSV, bypass ODBC
24
Better solution: Turbodbc with Arrow support
1. Retrieve columnar results
2. Pass them in a columnar fashion to Pandas
More systems in the future (without the ODBC overhead)
See also Michael’s talk tomorrow: Turbodbc: Turbocharged database
access for data scientists
Apache Arrow – Real life improvement
25
Ray
GPU Open Analytics Initiative
26
https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/
Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github: https://github.com/apache/arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• Github: https://github.com/apache/parquet-
cpp
27
Get Involved!
Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
28

More Related Content

PDF
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
PPTX
Intro to Python Data Analysis in Wakari
PDF
Extending Pandas using Apache Arrow and Numba
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
PDF
data.table and H2O at LondonR with Matt Dowle
PDF
Python for Financial Data Analysis with pandas
PDF
Gems in the python standard library
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Intro to Python Data Analysis in Wakari
Extending Pandas using Apache Arrow and Numba
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Kamal Hakimzadeh – Reproducible Distributed Experiments
data.table and H2O at LondonR with Matt Dowle
Python for Financial Data Analysis with pandas
Gems in the python standard library

What's hot (20)

PDF
Enabling Python to be a Better Big Data Citizen
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Stories About Spark, HPC and Barcelona by Jordi Torres
PDF
Fabian Hueske – Juggling with Bits and Bytes
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
PDF
Scaling and High Performance Storage System: LeoFS
KEY
Cascalog
PDF
Integrating Deep Learning Libraries with Apache Spark
PDF
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
PPTX
RDF Join Query Processing with Dual Simulation Pruning
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PPTX
HDF Kita Lab: JupyterLab + HDF Service
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPTX
Case study- Real-time OLAP Cubes
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Enabling Python to be a Better Big Data Citizen
Apache Arrow: Cross-language Development Platform for In-memory Data
Stories About Spark, HPC and Barcelona by Jordi Torres
Fabian Hueske – Juggling with Bits and Bytes
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Rental Cars and Industrialized Learning to Rank with Sean Downes
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Workflow Hacks #1 - dots. Tokyo
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Scaling and High Performance Storage System: LeoFS
Cascalog
Integrating Deep Learning Libraries with Apache Spark
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
RDF Join Query Processing with Dual Simulation Pruning
Deep Learning on Apache® Spark™ : Workflows and Best Practices
HDF Kita Lab: JupyterLab + HDF Service
Hw09 Hadoop Development At Facebook Hive And Hdfs
Case study- Real-time OLAP Cubes
Apache Spark MLlib 2.0 Preview: Data Science and Production
Ad

Similar to PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet (20)

PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
Lightning Fast Dataframes with Polars
PDF
The Future of Computing is Distributed
PDF
Fast Python High Performance Techniques For Large Datasets Meap V10 All 10 Ch...
PDF
Python VS GO
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PPTX
Strata NY 2017 Parquet Arrow roadmap
PDF
Big data berlin
PDF
High-Performance Python On Spark
PDF
High Performance Python on Apache Spark
PPTX
Hadoop ppt1
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
PDF
PyData Boston 2013
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Ursa Labs and Apache Arrow in 2019
PDF
From a student to an apache committer practice of apache io tdb
PPTX
carrow - Go bindings to Apache Arrow via C++-API
PPTX
Role of python in hpc
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Lightning Fast Dataframes with Polars
The Future of Computing is Distributed
Fast Python High Performance Techniques For Large Datasets Meap V10 All 10 Ch...
Python VS GO
Spark Summit EU 2015: Lessons from 300+ production users
Strata NY 2017 Parquet Arrow roadmap
Big data berlin
High-Performance Python On Spark
High Performance Python on Apache Spark
Hadoop ppt1
The columnar roadmap: Apache Parquet and Apache Arrow
Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems ...
PyData Boston 2013
ACM TechTalks : Apache Arrow and the Future of Data Frames
Ursa Labs and Apache Arrow in 2019
From a student to an apache committer practice of apache io tdb
carrow - Go bindings to Apache Arrow via C++-API
Role of python in hpc
Ad

More from Uwe Korn (8)

PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Going beyond Apache Parquet's default settings
PDF
pandas.(to/from)_sql is simple but not fast
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
Scalable Scientific Computing with Dask
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PyData Sofia May 2024 - Intro to Apache Arrow
Going beyond Apache Parquet's default settings
pandas.(to/from)_sql is simple but not fast
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Scalable Scientific Computing with Dask
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
How Apache Arrow and Parquet boost cross-language interoperability

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Introduction to Business Data Analytics.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Foundation of Data Science unit number two notes
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
Business Acumen Training GuidePresentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Business Data Analytics.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Clinical guidelines as a resource for EBP(1).pdf
Moving the Public Sector (Government) to a Digital Adoption
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Foundation of Data Science unit number two notes

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

  • 1. 1 Connecting PyData to other Big Data Landscapes using Arrow and Parquet Uwe L. Korn, PyCon.DE 2017
  • 2. 2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy uwe@apache.org
  • 3. 3 Python is a good companion for a Data Scientist …but there are other ecosystems out there.
  • 4. • Large set of files on distributed filesystem • Non-uniform schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python
  • 5. 5 All are amazing but… How to get my data out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!
  • 6. 6 A general problem • Great interoperability inside ecosystems • Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top
  • 7. 7 Columnar Data Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  • 9. 9 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  • 10. 10 Why use Parquet? 1. Columnar format
 —> vectorized operations 2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Predicate push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  • 11. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  • 12. Predicate pushdown 1. Only load used data • skip columns that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?
  • 14. Read & Write Parquet 14 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  • 15. Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…) df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
  • 16. 16 Save in one, load in another ecosystem …but always persist the intermediate.
  • 18. 2.57s Converting 1 million longs from Spark to PySpark 18 (8MiB)
  • 19. 19 Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp
  • 20. 20 Dissecting Arrow C++ • General zero-copy memory management • jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table
  • 21. 21 Dissecting Arrow C++ • Structured data IPC (inter-process communication) • used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0
  • 22. 0.05s Converting 1 million longs from Spark to PySpark 22 with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163
  • 23. 23 Apache Arrow – Real life improvement Real life example! Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC
  • 24. 24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement
  • 26. GPU Open Analytics Initiative 26 https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/
  • 27. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!
  • 28. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28