SlideShare a Scribd company logo
1
Efficient and portable DataFrame
storage with Apache Parquet
Uwe L. Korn, PyData London 2017
2
• Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Work in Python, Cython, C++11 and SQL
• Heavy Pandas User
About me
xhochy
uwe@apache.org
3
Agenda
• History of Apache Parquet
• The format in detail
• Use it in Python
4
About Parquet
1. Columnar on-disk storage format
2. Started in fall 2012 by Cloudera & Twitter
3. July 2013: 1.0 release
4. top-level Apache project
5. Fall 2016: Python & C++ support
6. State of the art format in the Hadoop ecosystem
• often used as the default I/O option
5
Why use Parquet?
1. Columnar format

—> vectorized operations
2. Efficient encodings and compressions

—> small size without the need for a fat CPU
3. Query push-down

—> bring computation to the I/O layer
4. Language independent format

—> libs in Java / Scala / C++ / Python /…
6
Who uses Parquet?
• Query Engines
• Hive
• Impala
• Drill
• Presto
• …
• Frameworks
• Spark
• MapReduce
• …
• Pandas
• Dask
File Structure
File
RowGroup
Column Chunks
Page
Statistics
Encodings
• Know the data
• Exploit the knowledge
• Cheaper than universal compression
• Example dataset:
• NYC TLC Trip Record data for January 2016
• 1629 MiB as CSV
• columns: bool(1), datetime(2), float(12), int(4)
• Source: http://www.nyc.gov/html/tlc/html/about/
trip_record_data.shtml
Encodings — PLAIN
• Simply write the binary representation to disk
• Simple to read & write
• Performance limited by I/O throughput
• —> 1499 MiB
Encodings — RLE & Bit Packing
• bit-packing: only use the necessary bit
• RunLengthEncoding: 378 times „12“
• hybrid: dynamically choose the best
• Used for Definition & Repetition levels
Encodings — Dictionary
• PLAIN_DICTIONARY / RLE_DICTIONARY
• every value is assigned a code
• Dictionary: store a map of code —> value
• Data: store only codes, use RLE on that
• —> 329 MiB (22%)
Compression
1. Shrink data size independent of its content
2. More CPU intensive than encoding
3. encoding+compression performs better than
compression alone with less CPU cost
4. LZO, Snappy, GZIP, Brotli

—> If in doubt: use Snappy
5. GZIP: 174 MiB (11%)

Snappy: 216 MiB (14 %)
Query pushdown
1. Only load used data
1. skip columns that are not needed
2. skip (chunks of) rows that not relevant
2. saves I/O load as the data is not transferred
3. saves CPU as the data is not decoded
Benchmarks (size)
Benchmarks (time)
Benchmarks (size vs time)
Read & Write Parquet
17
https://arrow.apache.org/docs/python/parquet.html
Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
18
Apache Arrow?
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for efficiency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib),
Ruby, Lua, R and the JVM
• This brought Parquet to Pandas without any Python code in
parquet-cpp
Just released 0.3
Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
19
Get Involved!
Blue Yonder GmbH
Ohiostraße 8
76149 Karlsruhe
Germany
+49 721 383117 0
Blue Yonder Software Limited
19 Eastbourne Terrace
London, W2 6LG
United Kingdom
+44 20 3626 0360
Blue Yonder
Best decisions,
delivered daily
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
20

More Related Content

PDF
How Apache Arrow and Parquet boost cross-language interoperability
PPTX
Apache Arrow - An Overview
PDF
Extending Pandas using Apache Arrow and Numba
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PDF
Improving data interoperability in Python and R
PDF
Ursa Labs and Apache Arrow in 2019
PPTX
Future of pandas
How Apache Arrow and Parquet boost cross-language interoperability
Apache Arrow - An Overview
Extending Pandas using Apache Arrow and Numba
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Improving data interoperability in Python and R
Ursa Labs and Apache Arrow in 2019
Future of pandas

What's hot (20)

PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Data Science Languages and Industry Analytics
PPTX
Strata NY 2017 Parquet Arrow roadmap
PDF
Rust is for "Big Data"
PPTX
Strata NY 2018: The deconstructed database
PPTX
Mule soft mar 2017 Parquet Arrow
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
Ibis: Scaling the Python Data Experience
PDF
DataFrames: The Extended Cut
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
From flat files to deconstructed database
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PDF
Python Data Ecosystem: Thoughts on Building for the Future
ACM TechTalks : Apache Arrow and the Future of Data Frames
Strata London 2016: The future of column oriented data processing with Arrow ...
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
An Incomplete Data Tools Landscape for Hackers in 2015
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Data Science Languages and Industry Analytics
Strata NY 2017 Parquet Arrow roadmap
Rust is for "Big Data"
Strata NY 2018: The deconstructed database
Mule soft mar 2017 Parquet Arrow
My Data Journey with Python (SciPy 2015 Keynote)
Apache Arrow Flight: A New Gold Standard for Data Transport
Ibis: Scaling the Python Data Experience
DataFrames: The Extended Cut
If you have your own Columnar format, stop now and use Parquet 😛
From flat files to deconstructed database
Data Eng Conf NY Nov 2016 Parquet Arrow
Python Data Ecosystem: Thoughts on Building for the Future
Ad

Similar to PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet (20)

PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PPTX
What's new in Hadoop Common and HDFS
PPTX
Blazingly-Fast:Introduction to Apache Fury Serialization
PPTX
Taming the resource tiger
PPTX
Taming the resource tiger
PPTX
Realtime traffic analyser
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
Scaling systems for research computing
PDF
Storage in hadoop
PDF
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
PPTX
Hadoop ppt1
PPTX
Unit 6 - Compression and Serialization in Hadoop.pptx
PDF
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
PDF
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PDF
Silicon Valley Code Camp 2014 - Advanced MongoDB
PPTX
Spectrum Scale Unified File and Object with WAN Caching
PPTX
Software Defined Analytics with File and Object Access Plus Geographically Di...
PPTX
Running MongoDB 3.0 on AWS
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
What's new in Hadoop Common and HDFS
Blazingly-Fast:Introduction to Apache Fury Serialization
Taming the resource tiger
Taming the resource tiger
Realtime traffic analyser
OpenPOWER Acceleration of HPCC Systems
Scaling systems for research computing
Storage in hadoop
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Hadoop ppt1
Unit 6 - Compression and Serialization in Hadoop.pptx
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Silicon Valley Code Camp 2014 - Advanced MongoDB
Spectrum Scale Unified File and Object with WAN Caching
Software Defined Analytics with File and Object Access Plus Geographically Di...
Running MongoDB 3.0 on AWS
Ad

More from Uwe Korn (8)

PDF
PyData Sofia May 2024 - Intro to Apache Arrow
PDF
Going beyond Apache Parquet's default settings
PDF
pandas.(to/from)_sql is simple but not fast
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PDF
Scalable Scientific Computing with Dask
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
PyData Sofia May 2024 - Intro to Apache Arrow
Going beyond Apache Parquet's default settings
pandas.(to/from)_sql is simple but not fast
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Scalable Scientific Computing with Dask
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...

Recently uploaded (20)

PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Introduction to Business Data Analytics.
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............
Mega Projects Data Mega Projects Data
Business Acumen Training GuidePresentation.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Fluorescence-microscope_Botany_detailed content
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Business Data Analytics.

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

  • 1. 1 Efficient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017
  • 2. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy uwe@apache.org
  • 3. 3 Agenda • History of Apache Parquet • The format in detail • Use it in Python
  • 4. 4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  • 5. 5 Why use Parquet? 1. Columnar format
 —> vectorized operations 2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Query push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  • 6. 6 Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask
  • 8. Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml
  • 9. Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB
  • 10. Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels
  • 11. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)
  • 12. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  • 13. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded
  • 17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  • 18. 18 Apache Arrow? • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3
  • 19. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!
  • 20. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20