SlideShare a Scribd company logo
How to use Parquet
as a basis for ETL and analytics
Julien Le Dem @J_
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
Outline
2
- Instrumentation and data collection
- Storing data eïŹƒciently for analysis
- Openness and Interoperability
Instrumentation and data collection
Typical data ïŹ‚ow
4
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Happy users
Typical data ïŹ‚ow
5
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
streaming
analysis
periodic
consolidation
snapshots
schema
Typical data ïŹ‚ow
6
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efïŹcient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Typical data ïŹ‚ow
7
Happy
Data Scientist
Serving
Instrumented
Services
Mutable
Serving
stores
mutation
Data collection
log collection
Streaming log
(Kafka, Scribe,
Chukwa ...)
periodic
snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc
queries
(Impala,
Hive,
Drill, ...)
automated
dashboard
Batch computation
(Graph, machine
learning, ...)
Streaming
computation
(Storm, Samza,
SparkStreaming..)
Query-efïŹcient
format
Parquet
streaming
analysis
periodic
consolidation
snapshots
Storing data for analysis
Producing a lot of data is easy
9
Producing a lot of derived data is even easier.

Solution: Compress all the things!
Scanning a lot of data is easy
10
1% completed

 but not necessarily fast.

Waiting is not productive. We want faster turnaround.

Compression but not at the cost of reading speed.
Interoperability not that easy
11
We need a storage format interoperable with all the tools we use
and
keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
13
- Interoperability

- Space eïŹƒciency

- Query eïŹƒciency
EïŹƒciency
Columnar storage
15
Logical table
representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Parquet nested representation
16
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:
docid
links.backward
links.forward
name.language.code
name.language.country
name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for ïŹlter and query optimization
17
Vertical partitioning
(projection push down)
Horizontal partitioning
(predicate push down)
Read only the data
you need!
+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Properties of eïŹƒcient encodings
18
- Minimize CPU pipeline bubbles:

	 highly predictable branching
	 reduce data dependency
!
- Minimize CPU cache misses

	 reduce size of the working set
The right encoding for the right job
19
- Delta encodings:

for sorted datasets or signals where the variation is less important than the absolute
value. (timestamp, auto-generated ids, metrics, 
) Focuses on avoiding branching.
!
- PreïŹx coding (delta encoding for strings)

When dictionary encoding does not work.
!
- Dictionary encoding: 

small (60K) set of values (server IP, experiment id, 
)
!
- Run Length Encoding:

repetitive data.
Interoperability
Interoperable
21
Model agnostic
Language agnostic
Java C++
Avro Thrift
Protocol
Buffer
Pig Tuple Hive SerDe
Assembly/striping
Parquet ïŹle format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Query
execution
Frameworks and libraries integrated with Parquet
22
Query engines:
Hive, Impala, HAWQ,
IBM Big SQL, Drill, Tajo,
Pig, Presto
!
Frameworks:
Spark, MapReduce, Cascading,
Crunch, Scalding, Kite
!
Data Models:
Avro, Thrift, ProtocolBuffers,
POJOs
Schema management
Schema in Hadoop
24
Hadoop does not deïŹne a standard notion of schema but there are many
available:

- Avro

- Thrift

- Protocol BuïŹ€ers

- Pig

- Hive

- 


And they are all diïŹ€erent
What they deïŹne
25
Schema:

Structure of a record

Constraints on the type

!
Row oriented binary format:
How records are represented one at a time
What they *do not* deïŹne
26
	 Column oriented binary format:
Parquet reuses the schema deïŹnitions and provides a common column
oriented binary format
Example: address book
27
AddressBook
Address
street
city
state
zip
comment
addresses
Protocol BuïŹ€ers
28
message AddressBook {!
repeated group addresses = 1 {!
required string street = 2;!
required string city = 3;!
required string state = 4;!
required string zip = 5;!
optional string comment = 6;!
}!
}!
!
- Allows recursive deïŹnition

- Types: Group or primitive

- binary format refers to ïŹeld ids only => Renaming ïŹelds does not impact binary format

- Requires installing a native compiler separated from your build
Fields have ids and can be
optional, required or repeated
Lists are repeated ïŹelds
Thrift
29
struct AddressBook {!
1: required list<Address> addresses;!
}!
struct Addresses {!
1: required string street;!
2: required string city;!
3: required string state;!
4: required string zip;!
5: optional string comment;!
}!
!
- No recursive deïŹnition

- Types: Struct, Map, List, Set, Union or primitive

- binary format refers to ïŹeld ids only => Renaming ïŹelds does not impact binary format

- Requires installing a native compiler separately from the build
Fields have ids and can be
optional or required
explicit collection types
Avro
30
{!
"type": "record", !
"name": "AddressBook",!
"fields" : [{ !
"name": "addresses", !
"type": "array", !
"items": { !
“type”: “record”,!
“fields”: [!
{"name": "street", "type": “string"},!
{"name": "city", "type": “string”}!
{"name": "state", "type": “string"}!
{"name": "zip", "type": “string”}!
{"name": "comment", "type": [“null”, “string”]}!
] !
}!
}]!
}
explicit collection types
- Allows recursive deïŹnition

- Types: Records, Arrays, Maps, Unions or primitive

- Binary format requires knowing the write-time schema

➡ more compact but not self descriptive

➡ renaming ïŹelds does not impact binary format

- generator in java (well integrated in the build)
null is a type
Optional is a union
Write to Parquet
Write to Parquet with Map Reduce
32
Protocol BuïŹ€ers:
job.setOutputFormatClass(ProtoParquetOutputFormat.class);!
ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);!
!
Thrift:
job.setOutputFormatClass(ParquetThriftOutputFormat.class);!
ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);!
!
Avro:
job.setOutputFormatClass(AvroParquetOutputFormat.class);!
AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
Write to Parquet with Scalding
33
// define the Parquet source!
case class AddressBookParquetSource(override implicit val dateRange: DateRange)!
extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!
// load and transform data!

!
pipe.write(ParquetSource())!
Write with Parquet with Pig
34

!
STORE mydata !
! INTO ‘my/data’ !
! USING parquet.pig.ParquetStorer();
Query engines
Scalding
36
	 loading:
new FixedPathParquetThrift[AddressBook](“my”, “data”) {!
val city = StringColumn("city")!
override val withFilter: Option[FilterPredicate] = !
Some(city === “San Jose”)!
}!
!
operations:
p.map( (r) => r.a + r.b )!
p.groupBy( (r) => r.c )!
p.join !


Pig
37
loading:
mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!
!
operations:
A = FOREACH mydata GENERATE a + b;!
B = GROUP mydata BY c;!
C = JOIN A BY a, B BY b;
Hive
38
	 loading:
create table parquet_table_name (x INT, y STRING)!
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'!
STORED AS !
INPUTFORMAT "parquet.hive.MapredParquetInputFormat"!
OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";!
!
	 operations:

	 	 SQL!
Impala
39
	 loading:
create table parquet_table (x int, y string) stored as parquetfile;!
insert into parquet_table select x, y from some_other_table;!
select y from parquet_table where x between 70 and 100;!
	 

	 operations:

	 	 SQL!
Drill
40
SELECT * FROM dfs.`/my/data`
Spark SQL
41
	 loading:
val address = sqlContext.parquetFile(“/my/data/addresses“)!
!
operations:
val result = sqlContext!
! .sql("SELECT city FROM addresses WHERE zip == 94707”)!
result.map((r) => 
)!
Community
Parquet timeline
43
- Fall 2012: Twitter & Cloudera merge eïŹ€orts to develop columnar formats

- March 2013: OSS announcement; Criteo signs on for Hive integration

- July 2013: 1.0 release. 18 contributors from more than 5 organizations.

- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.

- Parquet 2.0 coming as Apache release
Thank you to our contributors
44
Open Source announcement
1.0 release
Get involved
45
Mailing lists:
- dev@parquet.incubator.apache.org
!
Parquet sync ups:
- Regular meetings on google hangout
Questions
46
Questions.foreach( answer(_) )
@ApacheParquet

More Related Content

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
The Apache Spark File Format Ecosystem
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Parquet Strata/Hadoop World, New York 2013
PPTX
Modeling Data and Queries for Wide Column NoSQL
PDF
Spark shuffle introduction
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
The Apache Spark File Format Ecosystem
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
The columnar roadmap: Apache Parquet and Apache Arrow
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Parquet Strata/Hadoop World, New York 2013
Modeling Data and Queries for Wide Column NoSQL
Spark shuffle introduction

What's hot (20)

PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Parquet performance tuning: the missing guide
PDF
Understanding Query Plans and Spark UIs
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PPTX
Optimizing Apache Spark SQL Joins
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
Memory Management in Apache Spark
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
Spark
PPT
Parquet overview
PDF
Productizing Structured Streaming Jobs
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Cost-Based Optimizer in Apache Spark 2.2
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Efficient Data Storage for Analytics with Apache Parquet 2.0
A Deep Dive into Query Execution Engine of Spark SQL
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive: Memory Management in Apache Spark
Parquet performance tuning: the missing guide
Understanding Query Plans and Spark UIs
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Optimizing Apache Spark SQL Joins
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Memory Management in Apache Spark
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark
Parquet overview
Productizing Structured Streaming Jobs
ClickHouse Deep Dive, by Aleksei Milovidov
Dynamic Partition Pruning in Apache Spark
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Cost-Based Optimizer in Apache Spark 2.2
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Ad

Similar to How to use Parquet as a basis for ETL and analytics (20)

PDF
How to use Parquet as a Sasis for ETL and Analytics
PDF
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
PPTX
Transformation Processing Smackdown; Spark vs Hive vs Pig
PPTX
AWS Hadoop and PIG and overview
PDF
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
PDF
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
ODP
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PDF
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PDF
Osd ctw spark
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
ODP
PHP applications/environments monitoring: APM & Pinba
PPTX
Lec05
PDF
New Features in Apache Pinot
PPTX
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Postgres ĐČ ĐŸŃĐœĐŸĐČĐ” ĐČĐ°ŃˆĐ”ĐłĐŸ Юата-Ń†Đ”ĐœŃ‚Ń€Đ°, Bruce Momjian (EnterpriseDB)
 
How to use Parquet as a Sasis for ETL and Analytics
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
Keeping Spark on Track: Productionizing Spark for ETL
Transformation Processing Smackdown; Spark vs Hive vs Pig
AWS Hadoop and PIG and overview
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Big Data LDN 2017: Processing Fast Data With Apache Spark: the Tale of Two APIs
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Apache Arrow (Strata-Hadoop World San Jose 2016)
Sergi Álvarez & Roi Martín - Radare2 Preview [RootedCON 2010]
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Osd ctw spark
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PHP applications/environments monitoring: APM & Pinba
Lec05
New Features in Apache Pinot
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Postgres ĐČ ĐŸŃĐœĐŸĐČĐ” ĐČĐ°ŃˆĐ”ĐłĐŸ Юата-Ń†Đ”ĐœŃ‚Ń€Đ°, Bruce Momjian (EnterpriseDB)
 
Ad

More from Julien Le Dem (20)

PDF
Data and AI summit: data pipelines observability with open lineage
PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Data lineage and observability with Marquez - subsurface 2020
PPTX
Strata NY 2018: The deconstructed database
PDF
From flat files to deconstructed database
PPTX
Strata NY 2017 Parquet Arrow roadmap
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
Sql on everything with drill
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Parquet Hadoop Summit 2013
PDF
Parquet Twitter Seattle open house
PPTX
Poster Hadoop summit 2011: pig embedding in scripting languages
Data and AI summit: data pipelines observability with open lineage
Data pipelines observability: OpenLineage & Marquez
Open core summit: Observability for data pipelines with OpenLineage
Data platform architecture principles - ieee infrastructure 2020
Data lineage and observability with Marquez - subsurface 2020
Strata NY 2018: The deconstructed database
From flat files to deconstructed database
Strata NY 2017 Parquet Arrow roadmap
The columnar roadmap: Apache Parquet and Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Mule soft mar 2017 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Sql on everything with drill
If you have your own Columnar format, stop now and use Parquet 😛
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Parquet Hadoop Summit 2013
Parquet Twitter Seattle open house
Poster Hadoop summit 2011: pig embedding in scripting languages

Recently uploaded (20)

PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
AI in Product Development-omnex systems
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Transform Your Business with a Software ERP System
PPTX
Introduction to Artificial Intelligence
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Design an Analysis of Algorithms I-SECS-1021-03
How Creative Agencies Leverage Project Management Software.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
How to Choose the Right IT Partner for Your Business in Malaysia
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Essential Infomation Tech presentation.pptx
Odoo POS Development Services by CandidRoot Solutions
VVF-Customer-Presentation2025-Ver1.9.pptx
L1 - Introduction to python Backend.pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Online Work Permit System for Fast Permit Processing
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
AI in Product Development-omnex systems
Operating system designcfffgfgggggggvggggggggg
Materi-Enum-and-Record-Data-Type (1).pptx
ISO 45001 Occupational Health and Safety Management System
Transform Your Business with a Software ERP System
Introduction to Artificial Intelligence
Adobe Illustrator 28.6 Crack My Vision of Vector Design

How to use Parquet as a basis for ETL and analytics

  • 1. How to use Parquet as a basis for ETL and analytics Julien Le Dem @J_ Analytics Data Pipeline tech lead, Data Platform @ApacheParquet
  • 2. Outline 2 - Instrumentation and data collection - Storing data eïŹƒciently for analysis - Openness and Interoperability
  • 5. Typical data ïŹ‚ow 5 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull streaming analysis periodic consolidation snapshots schema
  • 6. Typical data ïŹ‚ow 6 Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efïŹcient format Parquet streaming analysis periodic consolidation snapshots
  • 7. Typical data ïŹ‚ow 7 Happy Data Scientist Serving Instrumented Services Mutable Serving stores mutation Data collection log collection Streaming log (Kafka, Scribe, Chukwa ...) periodic snapshots log Pull Pull analysis Storage (HDFS) ad-hoc queries (Impala, Hive, Drill, ...) automated dashboard Batch computation (Graph, machine learning, ...) Streaming computation (Storm, Samza, SparkStreaming..) Query-efïŹcient format Parquet streaming analysis periodic consolidation snapshots
  • 8. Storing data for analysis
  • 9. Producing a lot of data is easy 9 Producing a lot of derived data is even easier. Solution: Compress all the things!
  • 10. Scanning a lot of data is easy 10 1% completed 
 but not necessarily fast. Waiting is not productive. We want faster turnaround. Compression but not at the cost of reading speed.
  • 11. Interoperability not that easy 11 We need a storage format interoperable with all the tools we use and keep our options open for the next big thing.
  • 13. Parquet design goals 13 - Interoperability - Space eïŹƒciency - Query eïŹƒciency
  • 15. Columnar storage 15 Logical table representation Row layout Column layout encoding Nested schema a b c a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5 encoded chunk encoded chunk encoded chunk
  • 16. Parquet nested representation 16 Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Schema: Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 17. Statistics for ïŹlter and query optimization 17 Vertical partitioning (projection push down) Horizontal partitioning (predicate push down) Read only the data you need! + = a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + =
  • 18. Properties of eïŹƒcient encodings 18 - Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency ! - Minimize CPU cache misses reduce size of the working set
  • 19. The right encoding for the right job 19 - Delta encodings: for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, 
) Focuses on avoiding branching. ! - PreïŹx coding (delta encoding for strings) When dictionary encoding does not work. ! - Dictionary encoding: small (60K) set of values (server IP, experiment id, 
) ! - Run Length Encoding: repetitive data.
  • 21. Interoperable 21 Model agnostic Language agnostic Java C++ Avro Thrift Protocol Buffer Pig Tuple Hive SerDe Assembly/striping Parquet ïŹle format Object model parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive Column encoding Impala ... ... Encoding Query execution
  • 22. Frameworks and libraries integrated with Parquet 22 Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto ! Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite ! Data Models: Avro, Thrift, ProtocolBuffers, POJOs
  • 24. Schema in Hadoop 24 Hadoop does not deïŹne a standard notion of schema but there are many available: - Avro - Thrift - Protocol BuïŹ€ers - Pig - Hive - 
 And they are all diïŹ€erent
  • 25. What they deïŹne 25 Schema: Structure of a record Constraints on the type ! Row oriented binary format: How records are represented one at a time
  • 26. What they *do not* deïŹne 26 Column oriented binary format: Parquet reuses the schema deïŹnitions and provides a common column oriented binary format
  • 28. Protocol BuïŹ€ers 28 message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }! }! ! - Allows recursive deïŹnition - Types: Group or primitive - binary format refers to ïŹeld ids only => Renaming ïŹelds does not impact binary format - Requires installing a native compiler separated from your build Fields have ids and can be optional, required or repeated Lists are repeated ïŹelds
  • 29. Thrift 29 struct AddressBook {! 1: required list<Address> addresses;! }! struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;! }! ! - No recursive deïŹnition - Types: Struct, Map, List, Set, Union or primitive - binary format refers to ïŹeld ids only => Renaming ïŹelds does not impact binary format - Requires installing a native compiler separately from the build Fields have ids and can be optional or required explicit collection types
  • 30. Avro 30 {! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]! } explicit collection types - Allows recursive deïŹnition - Types: Records, Arrays, Maps, Unions or primitive - Binary format requires knowing the write-time schema ➡ more compact but not self descriptive ➡ renaming ïŹelds does not impact binary format - generator in java (well integrated in the build) null is a type Optional is a union
  • 32. Write to Parquet with Map Reduce 32 Protocol BuïŹ€ers: job.setOutputFormatClass(ProtoParquetOutputFormat.class);! ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);! ! Thrift: job.setOutputFormatClass(ParquetThriftOutputFormat.class);! ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);! ! Avro: job.setOutputFormatClass(AvroParquetOutputFormat.class);! AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
  • 33. Write to Parquet with Scalding 33 // define the Parquet source! case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)! // load and transform data! 
! pipe.write(ParquetSource())!
  • 34. Write with Parquet with Pig 34 
! STORE mydata ! ! INTO ‘my/data’ ! ! USING parquet.pig.ParquetStorer();
  • 36. Scalding 36 loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)! }! ! operations: p.map( (r) => r.a + r.b )! p.groupBy( (r) => r.c )! p.join ! 

  • 37. Pig 37 loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();! ! operations: A = FOREACH mydata GENERATE a + b;! B = GROUP mydata BY c;! C = JOIN A BY a, B BY b;
  • 38. Hive 38 loading: create table parquet_table_name (x INT, y STRING)! ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'! STORED AS ! INPUTFORMAT "parquet.hive.MapredParquetInputFormat"! OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";! ! operations: SQL!
  • 39. Impala 39 loading: create table parquet_table (x int, y string) stored as parquetfile;! insert into parquet_table select x, y from some_other_table;! select y from parquet_table where x between 70 and 100;! operations: SQL!
  • 40. Drill 40 SELECT * FROM dfs.`/my/data`
  • 41. Spark SQL 41 loading: val address = sqlContext.parquetFile(“/my/data/addresses“)! ! operations: val result = sqlContext! ! .sql("SELECT city FROM addresses WHERE zip == 94707”)! result.map((r) => 
)!
  • 43. Parquet timeline 43 - Fall 2012: Twitter & Cloudera merge eïŹ€orts to develop columnar formats - March 2013: OSS announcement; Criteo signs on for Hive integration - July 2013: 1.0 release. 18 contributors from more than 5 organizations. - May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases. - Parquet 2.0 coming as Apache release
  • 44. Thank you to our contributors 44 Open Source announcement 1.0 release
  • 45. Get involved 45 Mailing lists: - dev@parquet.incubator.apache.org ! Parquet sync ups: - Regular meetings on google hangout