SlideShare a Scribd company logo
2
Most read
4
Most read
7
Most read
Parquet
      overview
     Julien Le Dem
         Twitter
http://parquet.github.com
Format
         Schema definition: for binary
         representation


         Layout: currently PAX, supports one file
         per column when Hadoop allows block
         placement policy.


         Not java centric: encodings, compression
         codecs, etc are ENUMs, not java class
         names. i.e.: formally defined. Impala
         reads Parquet files.


         Footer: contains column chunks offsets




                                                2
Format

 •   Row group: A group of rows in columnar format.
     •   Max size buffered in memory while writing.
     •   One (or more) per split while reading. 
     •   roughly: 10MB < row group < 1 GB


 •   Column chunk: The data for one column in a row group.
     •   Column chunks can be read independently for efficient scans.


 •   Page: Unit of compression in a column chunk
     •   Should be big enough for compression to be efficient.
     •   Minimum size to read to access a single record (when index pages are available).
     •   roughly: 8KB < page < 100KB




                                                                                            3
Dremel’s shredding/assembly
           Schema:
           message Document {
             required int64 DocId;                                            Columns:
             optional group Links {                                           DocId
               repeated int64 Backward;                                       Links.Backward
               repeated int64 Forward; }                                      Links.Forward
             repeated group Name {                                            Name.Language.Code
               repeated group Language {                                      Name.Language.Country
                 required string Code;                                        Name.Url
                 optional string Country; }
               optional string Url; }}


Reference:
http://research.google.com/pubs/pub36632.html
• Each cell is encoded as a triplet: repetition level, definition level, value.
• This allows reconstructing the nested records.
• Level values are bound by the depth of the schema: They are stored in a
compact form.

Example:                               Max repetition level Max definition level

               DocId                                     0                     0
               Links.Backward                            1                     2
               Links.Forward                             1                     2
               Name.Language.Code                        2                     2
               Name.Language.Country                     2                     3
               Name.Url                                  1                     2

                                                                                                      4
Abstractions


 •   Column layer:
     •   Iteration on triplets: repetition level, definition level, value.
     •   Repetition level = 0 indicates a new record.
     •When dictionary encoding and other compact encodings are implemented, can iterate over
     encoded or un-encoded values.


 •   Record layer:
     •   Iteration on fully assembled records.
     •Provides assembled records for any subset of the columns, so that only columns actually
     accessed are loaded.




                                                                                                5
Extensibility

  •   Schema conversion:
      •   Hadoop does not have a notion of schema.
      •   However Pig, Hive, Thrift, Avro, ProtoBufs, etc do.


  •   Record materialization:
      •   Pluggable record materialization layer.
      •   No double conversion.
      •   Sax-style Event base API.


  •   Encodings:
      •   Extensible encoding definitions.
      •   Planned: dictionary encoding, zigzag, rle, ...




                                                                6
Extensibility

  •   Schema conversion:
      •   Hadoop does not have a notion of schema.
      •   However Pig, Hive, Thrift, Avro, ProtoBufs, etc do.


  •   Record materialization:
      •   Pluggable record materialization layer.
      •   No double conversion.
      •   Sax-style Event base API.


  •   Encodings:
      •   Extensible encoding definitions.
      •   Planned: dictionary encoding, zigzag, rle, ...




                                                                6

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Parquet Hadoop Summit 2013
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
Parquet performance tuning: the missing guide
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
Inside Parquet Format
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Parquet Format and Performance Optimization Opportunities
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Parquet Hadoop Summit 2013
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Parquet performance tuning: the missing guide
Apache Arrow Flight: A New Gold Standard for Data Transport
Inside Parquet Format
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

What's hot (20)

PDF
Delta Lake Streaming: Under the Hood
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PPTX
RocksDB compaction
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
The Apache Spark File Format Ecosystem
PDF
Diving into Delta Lake: Unpacking the Transaction Log
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
Apache Arrow Flight Overview
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Change Data Feed in Delta
PPTX
Apache Arrow: In Theory, In Practice
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Delta Lake Streaming: Under the Hood
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Efficient Data Storage for Analytics with Apache Parquet 2.0
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
RocksDB compaction
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
The Apache Spark File Format Ecosystem
Diving into Delta Lake: Unpacking the Transaction Log
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Evening out the uneven: dealing with skew in Flink
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Apache Arrow Flight Overview
The columnar roadmap: Apache Parquet and Apache Arrow
Change Data Feed in Delta
Apache Arrow: In Theory, In Practice
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Ad

Viewers also liked (12)

PDF
Spark, Python and Parquet
 
PPTX
ApacheCon-Flume-Kafka-2016
PPTX
大型电商的数据服务的要点和难点
PDF
Implementing and running a secure datalake from the trenches
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PDF
Moving to a data-centric architecture: Toronto Data Unconference 2015
PDF
Parquet and AVRO
PDF
Paytm labs soyouwanttodatascience
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
Flume vs. kafka
Spark, Python and Parquet
 
ApacheCon-Flume-Kafka-2016
大型电商的数据服务的要点和难点
Implementing and running a secure datalake from the trenches
Data Aggregation At Scale Using Apache Flume
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Moving to a data-centric architecture: Toronto Data Unconference 2015
Parquet and AVRO
Paytm labs soyouwanttodatascience
File Format Benchmark - Avro, JSON, ORC & Parquet
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Flume vs. kafka
Ad

Similar to Parquet overview (20)

PDF
Parquet - Data I/O - Philadelphia 2013
PDF
(Julien le dem) parquet
PDF
Parquet Twitter Seattle open house
PDF
Dremel: interactive analysis of web-scale datasets
PDF
How to use Parquet as a basis for ETL and analytics
PDF
To SQL or No(t)SQL - PFCongres 2012
PDF
Interactive big data analytics
PPTX
Drill dchug-29 nov2012
PPTX
Cassandra 2012 scandit
PDF
Linking UK Government Data, John Sheridan
PPTX
Introduction to Graph Databases
PDF
How to use Parquet as a Sasis for ETL and Analytics
PPT
Hive Object Model
KEY
Protocol Buffers and Hadoop at Twitter
PDF
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
PDF
Column Stride Fields aka. DocValues
PDF
Column Stride Fields aka. DocValues
PDF
Transition from relational to NoSQL Philly DAMA Day
PDF
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
PPTX
H base vs hive srp vs analytics 2-14-2012
Parquet - Data I/O - Philadelphia 2013
(Julien le dem) parquet
Parquet Twitter Seattle open house
Dremel: interactive analysis of web-scale datasets
How to use Parquet as a basis for ETL and analytics
To SQL or No(t)SQL - PFCongres 2012
Interactive big data analytics
Drill dchug-29 nov2012
Cassandra 2012 scandit
Linking UK Government Data, John Sheridan
Introduction to Graph Databases
How to use Parquet as a Sasis for ETL and Analytics
Hive Object Model
Protocol Buffers and Hadoop at Twitter
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
Transition from relational to NoSQL Philly DAMA Day
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
H base vs hive srp vs analytics 2-14-2012

More from Julien Le Dem (20)

PDF
Data and AI summit: data pipelines observability with open lineage
PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Data lineage and observability with Marquez - subsurface 2020
PPTX
Strata NY 2018: The deconstructed database
PDF
From flat files to deconstructed database
PPTX
Strata NY 2017 Parquet Arrow roadmap
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
Sql on everything with drill
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Parquet Strata/Hadoop World, New York 2013
PPTX
Poster Hadoop summit 2011: pig embedding in scripting languages
PPTX
Embedding Pig in scripting languages
Data and AI summit: data pipelines observability with open lineage
Data pipelines observability: OpenLineage & Marquez
Open core summit: Observability for data pipelines with OpenLineage
Data platform architecture principles - ieee infrastructure 2020
Data lineage and observability with Marquez - subsurface 2020
Strata NY 2018: The deconstructed database
From flat files to deconstructed database
Strata NY 2017 Parquet Arrow roadmap
The columnar roadmap: Apache Parquet and Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Mule soft mar 2017 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Sql on everything with drill
If you have your own Columnar format, stop now and use Parquet 😛
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Parquet Strata/Hadoop World, New York 2013
Poster Hadoop summit 2011: pig embedding in scripting languages
Embedding Pig in scripting languages

Parquet overview

  • 1. Parquet overview Julien Le Dem Twitter http://parquet.github.com
  • 2. Format Schema definition: for binary representation Layout: currently PAX, supports one file per column when Hadoop allows block placement policy. Not java centric: encodings, compression codecs, etc are ENUMs, not java class names. i.e.: formally defined. Impala reads Parquet files. Footer: contains column chunks offsets 2
  • 3. Format • Row group: A group of rows in columnar format. • Max size buffered in memory while writing. • One (or more) per split while reading.  • roughly: 10MB < row group < 1 GB • Column chunk: The data for one column in a row group. • Column chunks can be read independently for efficient scans. • Page: Unit of compression in a column chunk • Should be big enough for compression to be efficient. • Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 100KB 3
  • 4. Dremel’s shredding/assembly Schema: message Document { required int64 DocId; Columns: optional group Links { DocId repeated int64 Backward; Links.Backward repeated int64 Forward; } Links.Forward repeated group Name { Name.Language.Code repeated group Language { Name.Language.Country required string Code; Name.Url optional string Country; } optional string Url; }} Reference: http://research.google.com/pubs/pub36632.html • Each cell is encoded as a triplet: repetition level, definition level, value. • This allows reconstructing the nested records. • Level values are bound by the depth of the schema: They are stored in a compact form. Example: Max repetition level Max definition level DocId 0 0 Links.Backward 1 2 Links.Forward 1 2 Name.Language.Code 2 2 Name.Language.Country 2 3 Name.Url 1 2 4
  • 5. Abstractions • Column layer: • Iteration on triplets: repetition level, definition level, value. • Repetition level = 0 indicates a new record. •When dictionary encoding and other compact encodings are implemented, can iterate over encoded or un-encoded values. • Record layer: • Iteration on fully assembled records. •Provides assembled records for any subset of the columns, so that only columns actually accessed are loaded. 5
  • 6. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6
  • 7. Extensibility • Schema conversion: • Hadoop does not have a notion of schema. • However Pig, Hive, Thrift, Avro, ProtoBufs, etc do. • Record materialization: • Pluggable record materialization layer. • No double conversion. • Sax-style Event base API. • Encodings: • Extensible encoding definitions. • Planned: dictionary encoding, zigzag, rle, ... 6