SlideShare a Scribd company logo
Choosing an HDFS data storage format: Avro vs.
Parquet and more
Stephen O’Sullivan | @steveos
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Prioritize for highest
business value when using
emerging technology
• Design with outcomes in
mind
• Be agile: deliver initial
results quickly, then adapt
and iterate
• Collaborate constantly with
our customers and partners
OUR PHILOSOPHY
4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
AGENDA
Introduction
Data formats
How to choose
Schema evolution
Summary
Questions
Introduction
Data formats
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Storage formats
• What they do
DATA FORMATS
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Data format
• Storage Format
• Text
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Text
• More specifically text = csv, tsv, json records…
• Convenient format to use to exchange with other
applications or scripts that produce or read
delimited files
• Human readable and parsable
• Data stores is bulky and not as efficient to query
• Do not support block compression
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Sequence File
• Provides a persistent data structure for binary key-
value pairs
• Row based
• Commonly used to transfer data between Map
Reduce jobs
• Can be used as an archive to pack small files in
Hadoop
• Support splitting even when the data is
compressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Avro
• Widely used as a serialization platform
• Row-based, offers a compact and fast binary
format
• Schema is encoded on the file so the data can be
untagged
• Files support block compression and are splittable
• Supports schema evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Parquet
• Column-oriented binary file format
• Uses the record shredding and assembly algorithm
described in the Dremel paper
• Each data file contains the values for a set of rows
• Efficient in terms of disk I/O when specific columns
need to be queried
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Optimized Row Columnar
• Considered the evolution of the RCFile
• Stores collections of rows and within the collection
the row data is stored in columnar format
• Introduces a lightweight indexing that enables
skipping of irrelevant blocks of rows
• Splittable: allows parallel processing of row
collections
• It comes with basic statistics on columns (min ,max,
sum, and count)
How to choose
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• ..for write
• ..for read
HOW TO CHOOSE
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Functional Requirements:
• What type of data do you have?
• Is the data format compatible with your
processing and querying tools?
• What are your file sizes?
• Do you have schemas that evolve over time?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Speed Concerns
• Parquet and ORC usually needs some
additional parsing to format the data which
increases the overall read time
• Avro as a data serialization format: works well from
system to system, handles schema evolution (more
on that later)
• Text is bulky and inefficient but easily readable and
parsable
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
20
40
60
80
100
120
140
160
TimeinSeconds
Narrow – Hortonworks (Hive 0.14 )
0
500
1000
1500
2000
2500
TimeinSeconds
Wide – Hortonworks (Hive 0.14)
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
10
20
30
40
50
60
70
TimeinSeconds
Narrow - hive-1.1.0+cdh5.4.2
0
100
200
300
400
500
600
700
TimeinSeconds
Wide - hive-1.1.0+cdh5.4.2
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
0
10
20
30
40
50
60
70
Text Avro Parquet
TimeinSeconds
Narrow - Spark 1.3
0
200
400
600
800
1000
1200
Text Avro Parquet
TimeinSeconds
Wide - Spark 1.3
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
200
400
600
800
1000
1200
1400
Megabytes
File sizes for narrow dataset
0
2000
4000
6000
8000
10000
12000
Megabytes
File sizes for wide dataset
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Use case
• Avro – Event data that can change over time
• Sequence File – Datasets shared between MR
jobs
• Text – Adding large amounts of data to HDFS
quickly
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Types of queries:
• Column specific queries, or few groups of
columns -> Use columnar format like Parquet or
ORC
• Compression of the file regardless the format
increases query speed times
• Text is really slow to read
• Parquet and ORC optimize read performance at
the expense of write performance
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Set up:
• Narrow dataset:
• 10 million rows, 10 columns
• Wide dataset:
• 4 million rows, 1000 columns
• Compression:
• Snappy, except for Avro which is deflate
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
100
200
300
400
500
600
700
800
Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
TimeinSeconds
Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
50
100
150
200
250
Query 1 (no
conditions)
Query 2 (5
conditions)
Query 3 (10
conditions)
Query 4 (20
conditions)
TimeinSeconds
Wide Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
1
2
3
4
5
6
7
8
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
5
10
15
20
25
30
Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters)
TimeinSeconds
Wide Dataset - CDH Impala
Text
Avro
Parquet
Sequence
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
Ran 4 queries (using Impala)
over 4 Million rows (70GB raw),
and 1000 columns (wide table)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
Query 1 (0
filters)
Query 2 (5
filters)
Query 3 (10
filters)
Query 4 (20
filters)
Seconds
Query times for different data formats
Avro uncompress
Avro Snappy
Avro Deflate
Parquet
Seq uncompressed
Seq Snappy
Text Snappy
Text uncompressed
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Use case
• Avro – Query datasets that have changed over
time
• Parquet – Query a few columns on a wide
table
Schema Evolution
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• What is schema
evolution?
• Data formats that evolve
• Examples
• Use cases
SCHEMA
EVOLUTION
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• What is schema evolution?
• Adding columns
• Renaming columns
• Removing columns
• Why do we need it?
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Data formats that can evolve
• Avro
• Parquet
• Can only add columns at the end
• ORC
• It’s coming (That’s what they tell me ;) )…
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• The data – Dr Who episodes
• Original Dr Who & new Dr Who
• http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel-
information-is-beautiful
• Avro schema for the original Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"},
{"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"},
{"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"},
{"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original Dr Who data
doctor_who_
season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location
3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other
3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus
3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus
3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs
3 Pertwee 63 The Mutants 1972 2900 Solos
3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis
3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis
3 Pertwee 66 Carnival of Monsters 1972 1928 n
Indian Ocean;
Planet Inter Minor Ocean; alien planet
3 Pertwee 67 Frontier in Space 1972 2540 n
Planet Draconia;
Orgon Planet alien planets
3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet
3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales
3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle
3 Pertwee 71
Invasion of The
Dinosaurs 1200 1974 y Earth UK London
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Lets add, rename, and delete some columns
• Avro schema for the new Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]},
{"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]},
{"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]},
{"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]},
{"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]},
{"name": "hd", "type": "string", "default": "no"}
]}
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original & New Dr Who data
drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd
10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes
10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland
Torchwood house;
Near Balmoral yes
10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes
10 Tennant 204
the Girl in the
Fireplace 1727 1744 Earth France Paris yes
3 Pertwee 51
Spearhead from
Space 1970 1990 Earth England London and other no
3 Pertwee 55
Terror of the
Autons 1971 1971 Earth England Luigi Rossini's Circus no
3 Pertwee 58 Colony in Space 1971 2472
planet
Uxarieus no
3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs no
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Use cases
• New data added to an event stream
• Need to see historic data with new data (and
the schema has changed a lot)
• Business has changed the field/column name
Summary
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
QUESTIONS?
44
Yes, we’re hiring!
info@svds.com
THANK YOU
Stephen O’Sullivan
stephen@svds.com
@steveos
Demo code is here:
github.com/silicon-valley-data-
science/stampedecon-2015

More Related Content

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Kafka 101
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PDF
Apache Hudi: The Path Forward
PDF
SeaweedFS introduction
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Kafka 101
File Format Benchmark - Avro, JSON, ORC & Parquet
Apache Tez: Accelerating Hadoop Query Processing
Apache Hudi: The Path Forward
SeaweedFS introduction
File Format Benchmark - Avro, JSON, ORC & Parquet

What's hot (20)

PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PDF
Apache Spark Introduction
PDF
PPTX
Apache Ranger
PPTX
Securing Hadoop with Apache Ranger
PDF
Can Apache Kafka Replace a Database?
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PDF
Cassandra Introduction & Features
PPTX
Apache NiFi Crash Course Intro
PDF
Hadoop Overview & Architecture
 
PDF
Cassandra at eBay - Cassandra Summit 2012
PDF
What's New in Apache Hive
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
PPTX
Real-Time Data Flows with Apache NiFi
PPTX
Apache hive introduction
PDF
When NOT to use Apache Kafka?
PDF
Apache Spark Overview
PPTX
Delta lake and the delta architecture
PDF
ksqlDB: A Stream-Relational Database System
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Apache Spark Introduction
Apache Ranger
Securing Hadoop with Apache Ranger
Can Apache Kafka Replace a Database?
HBase and HDFS: Understanding FileSystem Usage in HBase
Cassandra Introduction & Features
Apache NiFi Crash Course Intro
Hadoop Overview & Architecture
 
Cassandra at eBay - Cassandra Summit 2012
What's New in Apache Hive
Hadoop Strata Talk - Uber, your hadoop has arrived
Real-Time Data Flows with Apache NiFi
Apache hive introduction
When NOT to use Apache Kafka?
Apache Spark Overview
Delta lake and the delta architecture
ksqlDB: A Stream-Relational Database System
Ad

Viewers also liked (18)

PPTX
The Impala Cookbook
PDF
Hadoop and Data Virtualization - A Case Study by VHA
PPT
Parquet and impala overview external
PDF
HBase 0.20.0 Performance Evaluation
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PDF
Moving to a data-centric architecture: Toronto Data Unconference 2015
PDF
Implementing and running a secure datalake from the trenches
PPTX
ApacheCon-Flume-Kafka-2016
PPTX
大型电商的数据服务的要点和难点
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PDF
Parquet and AVRO
PPT
Parquet overview
PDF
Paytm labs soyouwanttodatascience
PDF
HBase Application Performance Improvement
PDF
Building Streaming Data Applications Using Apache Kafka
PPTX
Flume vs. kafka
The Impala Cookbook
Hadoop and Data Virtualization - A Case Study by VHA
Parquet and impala overview external
HBase 0.20.0 Performance Evaluation
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: A Developer’s Guide to Coprocessors
Moving to a data-centric architecture: Toronto Data Unconference 2015
Implementing and running a secure datalake from the trenches
ApacheCon-Flume-Kafka-2016
大型电商的数据服务的要点和难点
Data Aggregation At Scale Using Apache Flume
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Parquet and AVRO
Parquet overview
Paytm labs soyouwanttodatascience
HBase Application Performance Improvement
Building Streaming Data Applications Using Apache Kafka
Flume vs. kafka
Ad

Similar to Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015 (20)

PPTX
Format Wars: from VHS and Beta to Avro and Parquet
PDF
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
PDF
HadoopFileFormats_2016
PDF
The Apache Spark File Format Ecosystem
PDF
Parquet - Data I/O - Philadelphia 2013
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PPTX
Data storage format in hdfs
PPTX
The Right Data for the Right Job
PDF
Improving performance of decision support queries in columnar cloud database ...
PDF
Apache Hive, data segmentation and bucketing
PDF
Apache Spark's Built-in File Sources in Depth
PDF
Storage in hadoop
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PDF
Parquet Hadoop Summit 2013
PDF
Optimizing Hive Queries
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
PDF
Interactive SQL-on-Hadoop and JethroData
PPTX
Using Apache Hive with High Performance
Format Wars: from VHS and Beta to Avro and Parquet
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
HadoopFileFormats_2016
The Apache Spark File Format Ecosystem
Parquet - Data I/O - Philadelphia 2013
Why you should care about data layout in the file system with Cheng Lian and ...
Data storage format in hdfs
The Right Data for the Right Job
Improving performance of decision support queries in columnar cloud database ...
Apache Hive, data segmentation and bucketing
Apache Spark's Built-in File Sources in Depth
Storage in hadoop
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Parquet Hadoop Summit 2013
Optimizing Hive Queries
File Format Benchmark - Avro, JSON, ORC and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Interactive SQL-on-Hadoop and JethroData
Using Apache Hive with High Performance

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Innovation in the Data Warehouse - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015

  • 1. Choosing an HDFS data storage format: Avro vs. Parquet and more Stephen O’Sullivan | @steveos
  • 2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience
  • 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Prioritize for highest business value when using emerging technology • Design with outcomes in mind • Be agile: deliver initial results quickly, then adapt and iterate • Collaborate constantly with our customers and partners OUR PHILOSOPHY
  • 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience AGENDA Introduction Data formats How to choose Schema evolution Summary Questions
  • 7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • Storage formats • What they do DATA FORMATS
  • 8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Data format • Storage Format • Text • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC)
  • 9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Text • More specifically text = csv, tsv, json records… • Convenient format to use to exchange with other applications or scripts that produce or read delimited files • Human readable and parsable • Data stores is bulky and not as efficient to query • Do not support block compression
  • 10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Sequence File • Provides a persistent data structure for binary key- value pairs • Row based • Commonly used to transfer data between Map Reduce jobs • Can be used as an archive to pack small files in Hadoop • Support splitting even when the data is compressed
  • 11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Avro • Widely used as a serialization platform • Row-based, offers a compact and fast binary format • Schema is encoded on the file so the data can be untagged • Files support block compression and are splittable • Supports schema evolution
  • 12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Parquet • Column-oriented binary file format • Uses the record shredding and assembly algorithm described in the Dremel paper • Each data file contains the values for a set of rows • Efficient in terms of disk I/O when specific columns need to be queried
  • 13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Optimized Row Columnar • Considered the evolution of the RCFile • Stores collections of rows and within the collection the row data is stored in columnar format • Introduces a lightweight indexing that enables skipping of irrelevant blocks of rows • Splittable: allows parallel processing of row collections • It comes with basic statistics on columns (min ,max, sum, and count)
  • 15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • ..for write • ..for read HOW TO CHOOSE
  • 16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Functional Requirements: • What type of data do you have? • Is the data format compatible with your processing and querying tools? • What are your file sizes? • Do you have schemas that evolve over time?
  • 17. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Speed Concerns • Parquet and ORC usually needs some additional parsing to format the data which increases the overall read time • Avro as a data serialization format: works well from system to system, handles schema evolution (more on that later) • Text is bulky and inefficient but easily readable and parsable
  • 18. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 20 40 60 80 100 120 140 160 TimeinSeconds Narrow – Hortonworks (Hive 0.14 ) 0 500 1000 1500 2000 2500 TimeinSeconds Wide – Hortonworks (Hive 0.14) Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 19. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 10 20 30 40 50 60 70 TimeinSeconds Narrow - hive-1.1.0+cdh5.4.2 0 100 200 300 400 500 600 700 TimeinSeconds Wide - hive-1.1.0+cdh5.4.2 Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns
  • 20. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write Narrow: 10 million rows, 10 columns Wide: 4 million rows, 1000 columns 0 10 20 30 40 50 60 70 Text Avro Parquet TimeinSeconds Narrow - Spark 1.3 0 200 400 600 800 1000 1200 Text Avro Parquet TimeinSeconds Wide - Spark 1.3
  • 21. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write 0 200 400 600 800 1000 1200 1400 Megabytes File sizes for narrow dataset 0 2000 4000 6000 8000 10000 12000 Megabytes File sizes for wide dataset
  • 22. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For write • Use case • Avro – Event data that can change over time • Sequence File – Datasets shared between MR jobs • Text – Adding large amounts of data to HDFS quickly
  • 23. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Types of queries: • Column specific queries, or few groups of columns -> Use columnar format like Parquet or ORC • Compression of the file regardless the format increases query speed times • Text is really slow to read • Parquet and ORC optimize read performance at the expense of write performance
  • 24. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Set up: • Narrow dataset: • 10 million rows, 10 columns • Wide dataset: • 4 million rows, 1000 columns • Compression: • Snappy, except for Avro which is deflate
  • 25. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 26. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 100 200 300 400 500 600 700 800 Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2 Text Avro Parquet Sequence ORC
  • 27. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 10 20 30 40 50 60 70 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 28. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 50 100 150 200 250 Query 1 (no conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) TimeinSeconds Wide Dataset - CDH hive-1.1.0+cdh5.4.2 Text Avro Parquet Sequence ORC
  • 29. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 1 2 3 4 5 6 7 8 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) TimeinSeconds Narrow Dataset - CDH Impala Text Avro Parquet Sequence
  • 30. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read 0 5 10 15 20 25 30 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) TimeinSeconds Wide Dataset - CDH Impala Text Avro Parquet Sequence
  • 31. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read Ran 4 queries (using Impala) over 4 Million rows (70GB raw), and 1000 columns (wide table) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters) Seconds Query times for different data formats Avro uncompress Avro Snappy Avro Deflate Parquet Seq uncompressed Seq Snappy Text Snappy Text uncompressed
  • 32. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience How to choose: For read • Use case • Avro – Query datasets that have changed over time • Parquet – Query a few columns on a wide table
  • 34. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience • What is schema evolution? • Data formats that evolve • Examples • Use cases SCHEMA EVOLUTION
  • 35. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • What is schema evolution? • Adding columns • Renaming columns • Removing columns • Why do we need it?
  • 36. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Data formats that can evolve • Avro • Parquet • Can only add columns at the end • ORC • It’s coming (That’s what they tell me ;) )…
  • 37. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • The data – Dr Who episodes • Original Dr Who & new Dr Who • http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel- information-is-beautiful • Avro schema for the original Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"}, {"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"}, {"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"}, {"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"} ]}
  • 38. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original Dr Who data doctor_who_ season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location 3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other 3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus 3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs 3 Pertwee 63 The Mutants 1972 2900 Solos 3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis 3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis 3 Pertwee 66 Carnival of Monsters 1972 1928 n Indian Ocean; Planet Inter Minor Ocean; alien planet 3 Pertwee 67 Frontier in Space 1972 2540 n Planet Draconia; Orgon Planet alien planets 3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet 3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales 3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle 3 Pertwee 71 Invasion of The Dinosaurs 1200 1974 y Earth UK London
  • 39. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Lets add, rename, and delete some columns • Avro schema for the new Dr Who {"namespace": "drwho.avro", "type": "record", "name": "drwho", "fields": [ {"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]}, {"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]}, {"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]}, {"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]}, {"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]}, {"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]}, {"name": "hd", "type": "string", "default": "no"} ]}
  • 40. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Avro Example • Original & New Dr Who data drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd 10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes 10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland Torchwood house; Near Balmoral yes 10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes 10 Tennant 204 the Girl in the Fireplace 1727 1744 Earth France Paris yes 3 Pertwee 51 Spearhead from Space 1970 1990 Earth England London and other no 3 Pertwee 55 Terror of the Autons 1971 1971 Earth England Luigi Rossini's Circus no 3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus no 3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no 3 Pertwee 60 Day of the Daleks 1972 2100 Earth England Auderly House and environs no
  • 41. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience Schema evolution • Use cases • New data added to an event stream • Need to see historic data with new data (and the schema has changed a lot) • Business has changed the field/column name
  • 43. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience QUESTIONS?
  • 44. 44 Yes, we’re hiring! info@svds.com THANK YOU Stephen O’Sullivan stephen@svds.com @steveos Demo code is here: github.com/silicon-valley-data- science/stampedecon-2015

Editor's Notes

  • #6: Description You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
  • #10: Do not support block compression Once they are compressed they are not splittable anymore increasing read performance cost
  • #13: Each data file contains the values for a set of rows Within a data file, the values from each column are organized so that they are adjacent, enabling good compression values
  • #26: No results query 1 (which is count no conditions). This is because stinger is has meta data about the amount of data in the table (only when it’s an internal table).