SlideShare a Scribd company logo
Storage in Hadoop
PUNEET TRIPATHI
1
What are we covering
 Delimited –
> CSV and others
> Sequence Files
> Avro
> Column formats
>> RC/ORC/Parquet
 Compression/Decompression –
Gzip, Bzip2, Snappy, LZO
 Focus – Apache Parquet
2
File Formats
AND WHY WE NEED
THEM
Common consideration for choosing a
file format
• Processing and query tools to use
• Does data structure change over time
• Compression and Splittability
• Processing or Query Performance
• Well space still can’t be ignored!
4
Available storage formats
 Text/CSV
Ubiquitously parsable | Splittable | No Metadata | No block compression
 JSON Records – Always try to avoid JSONs
Each line is JSON datum | Metadata | Schema evolution | embarrassing
native Serdes | No(read optional) block compression
 Sequence Files
Binary format | Similar structure to CSV | Block compression | No Metadata
 Avro
Splittable | Metadata | Superb Schema Evolution | Block compression |
Supported by almost all Hadoop tools | Looks like JSON
 RC
Record Columnar Files | Columnar Formats | Compression | Query
Performance | No schema evolution(rewrite previous files) | Writes
unoptimized
5
Available storage formats
 ORC
Optimized RC | Same benefits as RC but faster | No Schema
evolution | Support could be a problem | ORC files compresses
to be smallest format(some benchmark claim including mine) |
As performant as parquet
 Parquet
Columnar | Superb Compression | Query Performance | Writes
unoptimized | Supports Schema evolution | Highly supported in
Hadoop ecosystem or the support is being added | Spark
supports it out-of-the-box and We use Spark |
6
Moment of truth -
 There is no file format that will do all the things for you
 Consider following for picking a format –
> Hadoop Distribution
> Read/Query Requirements
> Interchange & Extraction of Data
 Different phase may need different formats storage –
> Parquets are best suited if your mart is query heavy
> CSV for porting data to other data stores
 Always avoid XMLs and JSONs, they are not splittable and
Hadoop cares for it intensely.
7
Codecs
AND WHY WE NEED
THEM
Common consideration for choosing codec
• Balance the processing capacity to compress and uncompress
• Compression Degree –Speed tradeoff
• How soon you query data?
• Splittablility – matters a lot in context of Hadoop
9
Available CoDecs
• Gzip
Wrapper around Zlib | Not Splittable | But awesome compression |
supported out-of-the-box | CPU intensive
• Bzip2
Similar to Gzip except Splittable | Provides even better compression ratio |
Slowest possible
• LZO – Lempel-Ziv-Oberhumer
Modest Compression Ratio | Creates index while compression | Splittable
(with Index) | Fast Compression speed | Not CPU intensive | Works good with
Text files too
• Snappy
Belongs to LZO family | Shipped with Hadoop | Fastest Decompression &
compression comparable with LZO | Compression Ratio is poorer than other
codecs | Not CPU intensive
 Snappy often performs better than LZO. However It is worth running tests to
see if you detect a significant difference.
 Hadoop doesn’t support ZIP out-of-the-box.
10
CoDecs – performance comparison
 Space Savings and CPU Time comparison [Yahoo]
11
Focus -
Parquet
IF TIME PERMITS
Columnar Storage – Overview
 Lets say we got table with these observations:-
 Reduction of space consumption & Better column level compression
 Efficient encoding and decoding by storing together values of the same
primitive type
 Reading the same number of column field values for the same number of
records requires a third of the I/O operations compared to row-wise storage
13
Parquet – Columnar file format
• Inspired from Google Dremel, developed by Twitter and
Cloudera
• Storage on Disk –
• Supports Nested Data Structure –
14
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
Parquet – Specifications
• Supports primitive datatypes – Boolean, INT(32,64,96), Float, Double, Byte_Array
• Schema is defined as Protocol Buffer
> has root called message
> fields are required, optional & repeated
• Field types are either Group or Primitive type
• Each cell is encoded as triplet – repetition level, definition level & value
• Structure of Record is captured by 2 ints – repetition level & definition level
• Definition level explain columns & nullity of columns
• Repetition Level explains where a new list
[repeated fields are stored as lists] starts
15
Definition Level
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c;
}
}
}
one column: a.b.c
Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
“
”
Thank You!
16

More Related Content

PDF
Hdfs internals
PDF
In-memory database
PDF
[Nvidia] Divide Your Depots
PPTX
The Hive Think Tank: Rocking the Database World with RocksDB
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
RocksDB detail
PDF
Optimizing RocksDB for Open-Channel SSDs
PPTX
HBase Introduction
Hdfs internals
In-memory database
[Nvidia] Divide Your Depots
The Hive Think Tank: Rocking the Database World with RocksDB
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
RocksDB detail
Optimizing RocksDB for Open-Channel SSDs
HBase Introduction

What's hot (20)

PDF
Log Structured Merge Tree
PDF
RocksDB meetup
PPTX
Merge2013 mwarren-presentation1.pptx(pv6)
PPTX
Inside tempdb
PPTX
RocksDB compaction
ODP
Sdc challenges-2012
PPTX
The Hive Think Tank: Rocking the Database World with RocksDB
PDF
RocksDB storage engine for MySQL and MongoDB
PDF
mogpres
PPTX
Cache options for Data Layer
PDF
TokuDB 高科扩展性 MySQL 和 MariaDB 数据库
PDF
Ceph and RocksDB
PDF
Understanding dasnassan-130205153552-phpapp02
PDF
Migrating from MySQL to MongoDB
PPTX
Get More Out of MySQL with TokuDB
PPTX
WiredTiger Overview
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PPT
No sql landscape_nosqltips
PDF
HPTS 2011: The NoSQL Ecosystem
PDF
What every developer should know about database scalability, PyCon 2010
Log Structured Merge Tree
RocksDB meetup
Merge2013 mwarren-presentation1.pptx(pv6)
Inside tempdb
RocksDB compaction
Sdc challenges-2012
The Hive Think Tank: Rocking the Database World with RocksDB
RocksDB storage engine for MySQL and MongoDB
mogpres
Cache options for Data Layer
TokuDB 高科扩展性 MySQL 和 MariaDB 数据库
Ceph and RocksDB
Understanding dasnassan-130205153552-phpapp02
Migrating from MySQL to MongoDB
Get More Out of MySQL with TokuDB
WiredTiger Overview
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
No sql landscape_nosqltips
HPTS 2011: The NoSQL Ecosystem
What every developer should know about database scalability, PyCon 2010
Ad

Similar to Storage in hadoop (20)

PDF
Hadoop compression strata conference
PDF
(Julien le dem) parquet
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
PDF
Parquet Twitter Seattle open house
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
PDF
Parquet Hadoop Summit 2013
PDF
Parquet - Data I/O - Philadelphia 2013
PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
How to use Parquet as a basis for ETL and analytics
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
PDF
Parquet performance tuning: the missing guide
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PDF
Column and hadoop
PPTX
The Right Data for the Right Job
PDF
New in Hadoop: You should know the Various File Format in Hadoop.
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Hadoop compression strata conference
(Julien le dem) parquet
Data Modeling in Hadoop - Essentials for building data driven applications
Parquet Twitter Seattle open house
Why you should care about data layout in the file system with Cheng Lian and ...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Parquet Hadoop Summit 2013
Parquet - Data I/O - Philadelphia 2013
Hadoop_File_Formats_and_Data_Ingestion.pptx
Parquet Strata/Hadoop World, New York 2013
How to use Parquet as a basis for ETL and analytics
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Parquet performance tuning: the missing guide
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Column and hadoop
The Right Data for the Right Job
New in Hadoop: You should know the Various File Format in Hadoop.
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Ad

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
Drone Technology Electronics components_1
PDF
Digital Logic Computer Design lecture notes
PPTX
CH1 Production IntroductoryConcepts.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Structs to JSON How Go Powers REST APIs.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Drone Technology Electronics components_1
Digital Logic Computer Design lecture notes
CH1 Production IntroductoryConcepts.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Structs to JSON How Go Powers REST APIs.pdf

Storage in hadoop

  • 2. What are we covering  Delimited – > CSV and others > Sequence Files > Avro > Column formats >> RC/ORC/Parquet  Compression/Decompression – Gzip, Bzip2, Snappy, LZO  Focus – Apache Parquet 2
  • 3. File Formats AND WHY WE NEED THEM
  • 4. Common consideration for choosing a file format • Processing and query tools to use • Does data structure change over time • Compression and Splittability • Processing or Query Performance • Well space still can’t be ignored! 4
  • 5. Available storage formats  Text/CSV Ubiquitously parsable | Splittable | No Metadata | No block compression  JSON Records – Always try to avoid JSONs Each line is JSON datum | Metadata | Schema evolution | embarrassing native Serdes | No(read optional) block compression  Sequence Files Binary format | Similar structure to CSV | Block compression | No Metadata  Avro Splittable | Metadata | Superb Schema Evolution | Block compression | Supported by almost all Hadoop tools | Looks like JSON  RC Record Columnar Files | Columnar Formats | Compression | Query Performance | No schema evolution(rewrite previous files) | Writes unoptimized 5
  • 6. Available storage formats  ORC Optimized RC | Same benefits as RC but faster | No Schema evolution | Support could be a problem | ORC files compresses to be smallest format(some benchmark claim including mine) | As performant as parquet  Parquet Columnar | Superb Compression | Query Performance | Writes unoptimized | Supports Schema evolution | Highly supported in Hadoop ecosystem or the support is being added | Spark supports it out-of-the-box and We use Spark | 6
  • 7. Moment of truth -  There is no file format that will do all the things for you  Consider following for picking a format – > Hadoop Distribution > Read/Query Requirements > Interchange & Extraction of Data  Different phase may need different formats storage – > Parquets are best suited if your mart is query heavy > CSV for porting data to other data stores  Always avoid XMLs and JSONs, they are not splittable and Hadoop cares for it intensely. 7
  • 8. Codecs AND WHY WE NEED THEM
  • 9. Common consideration for choosing codec • Balance the processing capacity to compress and uncompress • Compression Degree –Speed tradeoff • How soon you query data? • Splittablility – matters a lot in context of Hadoop 9
  • 10. Available CoDecs • Gzip Wrapper around Zlib | Not Splittable | But awesome compression | supported out-of-the-box | CPU intensive • Bzip2 Similar to Gzip except Splittable | Provides even better compression ratio | Slowest possible • LZO – Lempel-Ziv-Oberhumer Modest Compression Ratio | Creates index while compression | Splittable (with Index) | Fast Compression speed | Not CPU intensive | Works good with Text files too • Snappy Belongs to LZO family | Shipped with Hadoop | Fastest Decompression & compression comparable with LZO | Compression Ratio is poorer than other codecs | Not CPU intensive  Snappy often performs better than LZO. However It is worth running tests to see if you detect a significant difference.  Hadoop doesn’t support ZIP out-of-the-box. 10
  • 11. CoDecs – performance comparison  Space Savings and CPU Time comparison [Yahoo] 11
  • 13. Columnar Storage – Overview  Lets say we got table with these observations:-  Reduction of space consumption & Better column level compression  Efficient encoding and decoding by storing together values of the same primitive type  Reading the same number of column field values for the same number of records requires a third of the I/O operations compared to row-wise storage 13
  • 14. Parquet – Columnar file format • Inspired from Google Dremel, developed by Twitter and Cloudera • Storage on Disk – • Supports Nested Data Structure – 14 Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html
  • 15. Parquet – Specifications • Supports primitive datatypes – Boolean, INT(32,64,96), Float, Double, Byte_Array • Schema is defined as Protocol Buffer > has root called message > fields are required, optional & repeated • Field types are either Group or Primitive type • Each cell is encoded as triplet – repetition level, definition level & value • Structure of Record is captured by 2 ints – repetition level & definition level • Definition level explain columns & nullity of columns • Repetition Level explains where a new list [repeated fields are stored as lists] starts 15 Definition Level message ExampleDefinitionLevel { optional group a { optional group b { optional string c; } } } one column: a.b.c Image Source – Twitter’s Blog - https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html