SlideShare a Scribd company logo
Nick Kabra Hadoop Compression White Paper Page 1 of 4
Hadoop Compression, various file formats and Querying
What is Data Compression: Data compression is storing data in a format that requires less space than
original size.
Advantage: Reduces storage need, speedy data transfer and Less Disk I/O.
Disadvantage: Consumes CPU
As the compression ratio increases, compression speed decreases…. viz. inversely proportional
There are 6 types of compression which can be used for Hadoop. viz: gzip, bzip2, LZO, LZ4, zlib, snappy,
Compression algorithms operate by finding and eliminating redundancy and duplication in data. Thus, truly
random data can never be compressed. Compression strategies generally have three phases: a preprocessing
or transform phase, followed by duplicate elimination and finally a phase that focuses on bit reduction. The
algorithms used in each compression format vary and have a strong impact on the efficacy and speed of
compression.
Why should we compress data: Compression provides several benefits and some disadvantages:
1) MR jobs are almost always I/O bound, compressing data can speed up the IO operations that are more often
than not the performance bottleneck
2) You can do more with less with compression turned on i.e. improve the cluster utilization through space
savings and faster data transfers across the network as you will send less data. This is particularly true as
Hadoop uses a 3x data replication by default for fault tolerance.
3) As a user, you can improve your overall job performance and your jobs may take less time to complete.
Compression does not come for free though. These benefits come at the cost of:
1) Increased CPU utilization in compressing and decompressing data.
So, using compression itself presents a tradeoff between storage savings, faster I/O, and better use of
network bandwidth with increase in CPU load. But given the nature of Hadoop, using compression
generally turns out to be a good tradeoff to make.
How compressions works in MapReduce:
In simplest terms, this process has 5 steps: Input is compressed first – followed by mapper decompress –
followed by mapper output compressed – followed by reducer input compressed – followed by reducer
output decompressed (which is the final output). So, compression is integral to a MapReduce pipeline and
can have a significant impact on the job’s overall performance.
Map outputs to the reducers are sorted on its keys. The process of sorting and transferring data to the reducer
phase is called shuffle. Map writes are buffered in memory, as spilled to the disk when they get full. ((Data
is partitioned before the spills according to the reducer it needs to go to). Several spills files may get
generated as part of the above process that are then merged on the disk to create a bigger partition. The data
on the disk write can again be compressed.
The reducer may need data from several map tasks on other nodes. Transferring this data compressed helps.
The reduce phase merges output from map tasks either in memory or in a combination of in-memory and
disk that is fed to the reducer for further processing upon which the data is written to HDFS (and that can
be written compressed).
Factors which play a role in selecting Compression algos:
 Nature of the data set
 Chained jobs
 Data-storage efficiency requirements
 Frequency of compression vs. decompression
Nick Kabra Hadoop Compression White Paper Page 2 of 4
 Requirement for compatibility with a standard data format
 Splittability requirements (explained later)
 Size of the intermediate and final data
 Alternative implementations of compression libraries
Splittability and its importance: Since Hadoop stores and processes data in blocks you must be able to
begin reading data at any point within a file in order to take fullest advantage of Hadoop’s distributed
processing. Hence, it is best if the blocks can be independently compressed. Snappy and LZO are commonly
used compression technologies that enable efficient block storage and processing. If a file format does not
support block compression then, if compressed, the file is rendered non-splittable. So when processed, the
decompressor must begin reading the file at its beginning in order to obtain any block within the file. For a
large file with many blocks, this could generate a substantial performance penalty. If a compression method
is splittable, every compressed input split can be extracted and processed independently.
Initally Sequential file format was the first one to handle splittability(keys and values). Split capability
could be added to block oriented compression algos such as LZO, Snappy and LZ4.
Findings and results from the test experiment are shared in the Excel sheet:
System used:
16 core CPU
16 GB RAM
Ubuntu
File size=2.8 GB
Compression and Query Formats:
Size of file matters: If your files are smaller than the size of an HDFS block, then splittability and block
compression don’t matter. You may be able to store the data uncompressed or with a simple file
compression algorithm. Of course, small files are the exception in Hadoop and processing too many small
files can cause performance issues. Hadoop wants large, splittable files so that its massively distributed
engine can leverage data locality and parallel processing.
Large files in Hadoop consume a lot of disk -- especially when considering standard 3x replication. So,
there is an economic incentive to compress data. i.e. store more data per byte of disk. There is also a
performance incentive as disk IO is expensive. If you can reduce the disk footprint through compression,
you can relieve IO bottlenecks. As an example, I converted an uncompressed, 1.8GB CSV file into the
following formats, achieving much smaller disk footprints.
Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro w/ Snappy Compression 750 MB
Parquet w/ Snappy Compression 300 MB
I then ran Impala and Hive queries against each of the file formats. As the files became smaller, the query
performance improved. The queries against Parquet were a couple orders of magnitude faster than
uncompressed CSV.
What query format are you using viz: Hive, Avro, Parquet, ORC, RC File format
If you want to use Avro, does the data processing language you’ve chosen include Avro readers and
writers? Likewise, suppose you’ve picked the Cloudera distribution because you like Impala. You should
probably know that Impala currently does not support ORC format. Conversely, if you’ve chosen
Hortonworks and Hive-Stinger, you probably want to avoid Parquet. Yes, it is expected that most of the
tools will end up supporting most of the popular formats, but doublecheck before you make any final
decisions.
Nick Kabra Hadoop Compression White Paper Page 3 of 4
If you have a large enough cluster you can rewrite all of your historical data to add a field, but this is often
not ideal. Being able to add a field and still read historical data may be preferred. If so, we should know
which file formats enable flexible and evolving schema.
Processing or query performance – What matters to you: There are three types of performance to
consider:
Write performance -- how fast can the data be written.
Partial read performance -- how fast can you read individual columns within a file.
Full read performance -- how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance,
but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write
but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple
copies of your data each formatted for a different performance profile.
Comparison of the popular file formats:
1) Avro files: Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro
files store metadata with the data but also allow specification of an independent schema for reading the file.
This makes Avro the epitome of schema evolution support since you can rename, add, delete and change
the data types of fields by defining new independent schema. Additionally, Avro files are splittable, support
block compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.
2) Sequence files: Sequence files store data in a binary format with a similar structure to CSV. Like CSV,
sequence files do not store metadata with the data so the only schema evolution option is appending new
fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of
reading sequence files, they are often only used for “in flight” data such as intermediate data storage used
within a sequence of MapReduce jobs.
3) RC files: RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like
columnar databases, the RC file enjoys significant compression and query performance benefits. However,
the current serdes for RC files in Hive and other tools do not support schema evolution. In order to add a
column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for
query, writing an RC file requires more memory and computation than non-columnar file formats. They
are generally slower to write.
4) ORC files: ORC Files or Optimized RC Files were invented to optimize performance in Hive and are
primarily backed by HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done
better for Hadoop. This means ORC files compress better than RC files, enabling faster queries. However,
they still don’t support schema evolution. Some benchmarks indicate that ORC files compress to be the
smallest of all file formats in Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera
Impala does not support ORC files.
5) Parquet files: Parquet Files are yet another columnar file format that originated from Hadoop creator Doug
Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits,
and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet
serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure.
At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such
as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala.
Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem.
One note on Parquet file support with Hive... It is very important that Parquet column names are lowercase.
If your Parquet file contains mixed case column names, Hive will not be able to read the column and will
return queries on the column with null values and not log any errors. Unlike Hive, Impala handles mixed
case column names. A truly perplexing problem when you encounter it!
Factors to consider for query file format:
Hadoop Distribution- Cloudera and Hortonworks support/favor different formats
Nick Kabra Hadoop Compression White Paper Page 4 of 4
Schema Evolution- Will the structure of your data evolve? In what way?
Processing Requirements - Will you be crunching the data and with what tools?
Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
Extract Requirements- Will you be extracting the data from Hadoop for import into an external database
engine or other platform?
Storage Requirements- Is data volume a significant factor? Will you get significantly more bang for your
storage buck through compression?
For MapReduce, some guidelines on which compression method to use where:
Mapper input: Use splittable algo such as bzip2, or use zlib with RC file, ORC or Parquet format
Mapper output: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks
Reducer input: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks
Reducer output: Use standard utility such as gzip or bzip2 for data interchange and faster codecs for chained
jobs

More Related Content

PPTX
Hadoop hdfs
PPTX
Apache hadoop basics
PDF
Hadoop HDFS
PPTX
Hadoop architecture-tutorial
PPT
Hadoop training by keylabs
PPTX
Introduction to HDFS
PPTX
Hadoop Distributed File System
PPTX
Hadoop distributed file system
Hadoop hdfs
Apache hadoop basics
Hadoop HDFS
Hadoop architecture-tutorial
Hadoop training by keylabs
Introduction to HDFS
Hadoop Distributed File System
Hadoop distributed file system

What's hot (18)

PDF
Lecture 2 part 1
PPTX
Big data- HDFS(2nd presentation)
PPTX
Snapshot in Hadoop Distributed File System
PPTX
2.introduction to hdfs
PPTX
Ravi Namboori Hadoop & HDFS Architecture
PPTX
Hadoop File system (HDFS)
PPTX
Hadoop Distributed File System
PDF
Hadoop architecture-tutorial
PDF
HDFS Architecture
DOCX
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
PPTX
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
PDF
Hadoop Ecosystem
PPTX
Scalding by Adform Research, Alex Gryzlov
PPTX
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
PPTX
PPTX
MapReduce
PPT
Hadoop ppt2
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Lecture 2 part 1
Big data- HDFS(2nd presentation)
Snapshot in Hadoop Distributed File System
2.introduction to hdfs
Ravi Namboori Hadoop & HDFS Architecture
Hadoop File system (HDFS)
Hadoop Distributed File System
Hadoop architecture-tutorial
HDFS Architecture
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Ceph Day Berlin: Measuring and predicting performance of Ceph clusters
Hadoop Ecosystem
Scalding by Adform Research, Alex Gryzlov
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...
MapReduce
Hadoop ppt2
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Ad

Viewers also liked (14)

PDF
Impact Investing Nieuws 1 November 2016
PPT
China Trip
PPT
月餅的由來
PDF
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
PDF
LastMealsProposal
PPT
21 animal kingdom
PDF
Data Stream Controller for Enterprise Cloud Application
PDF
Khatavkar_asu_0010N_15394
PDF
Week5 resume
PPTX
Stock Market Game
PPTX
Errosion Corrosion
PDF
Impact Investing Nieuws 15 december 2016
PPT
Reforma y contra reforma. expansion europea. 2016
PPTX
Erosion Corrosion
Impact Investing Nieuws 1 November 2016
China Trip
月餅的由來
An Enhance Image Retrieval of User Interest Using Query Specific Approach and...
LastMealsProposal
21 animal kingdom
Data Stream Controller for Enterprise Cloud Application
Khatavkar_asu_0010N_15394
Week5 resume
Stock Market Game
Errosion Corrosion
Impact Investing Nieuws 15 december 2016
Reforma y contra reforma. expansion europea. 2016
Erosion Corrosion
Ad

Similar to Hadoop compression strata conference (20)

PDF
Storage in hadoop
PDF
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PDF
New in Hadoop: You should know the Various File Format in Hadoop.
PDF
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
PPTX
Hadoopjsjdkkdkdkdkkdjjjkkkdkkfkfoofofofoofofofkfiif
PDF
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
PDF
Data Modeling in Hadoop - Essentials for building data driven applications
PPTX
Data storage format in hdfs
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PPTX
Unit 6 - Compression and Serialization in Hadoop.pptx
PPT
HadoooIO.ppt
PDF
HadoopFileFormats_2016
PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PDF
Application architectures with hadoop – big data techcon 2014
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
Hadoop introduction
PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
Storage in hadoop
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Compression Options in Hadoop - A Tale of Tradeoffs
New in Hadoop: You should know the Various File Format in Hadoop.
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoopjsjdkkdkdkdkkdjjjkkkdkkfkfoofofofoofofofkfiif
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs
Data Modeling in Hadoop - Essentials for building data driven applications
Data storage format in hdfs
Compression Options in Hadoop - A Tale of Tradeoffs
Unit 6 - Compression and Serialization in Hadoop.pptx
HadoooIO.ppt
HadoopFileFormats_2016
Hadoop_File_Formats_and_Data_Ingestion.pptx
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with hadoop – big data techcon 2014
Why you should care about data layout in the file system with Cheng Lian and ...
Hadoop introduction
Application Architectures with Hadoop - Big Data TechCon SF 2014

More from nkabra (12)

PDF
How i helped rue la la become a one stop ecommerce boutique
PDF
How geo phy built a proprietary automated valuation platform for the commerci...
PDF
How fleet advantage analytics uses predic engine and iot with machine learning
PDF
Building a data science team at michelin tyres
PDF
Inmemory db nick kabra june 2013 discussion at columbia university
PDF
Comparisons of no sql databases march 2014
PDF
Hadoop comparative scorecard nick kabra sr mgmt 04042014 and stack integrati...
PPTX
Harvard case studies presentation 09102013
PDF
Hadoop compression analysis strata conference
PDF
Future of big data nick kabra speaker compendium march 2013
PDF
Solr and ElasticSearch demo and speaker feb 2014
PDF
Big data in marketing at harvard business club nick1 june 15 2013
How i helped rue la la become a one stop ecommerce boutique
How geo phy built a proprietary automated valuation platform for the commerci...
How fleet advantage analytics uses predic engine and iot with machine learning
Building a data science team at michelin tyres
Inmemory db nick kabra june 2013 discussion at columbia university
Comparisons of no sql databases march 2014
Hadoop comparative scorecard nick kabra sr mgmt 04042014 and stack integrati...
Harvard case studies presentation 09102013
Hadoop compression analysis strata conference
Future of big data nick kabra speaker compendium march 2013
Solr and ElasticSearch demo and speaker feb 2014
Big data in marketing at harvard business club nick1 june 15 2013

Recently uploaded (20)

PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Business Data Analytics.
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
.pdf is not working space design for the following data for the following dat...
Introduction to Knowledge Engineering Part 1
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Acumen Training GuidePresentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
1_Introduction to advance data techniques.pptx
Introduction to Business Data Analytics.
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
.pdf is not working space design for the following data for the following dat...

Hadoop compression strata conference

  • 1. Nick Kabra Hadoop Compression White Paper Page 1 of 4 Hadoop Compression, various file formats and Querying What is Data Compression: Data compression is storing data in a format that requires less space than original size. Advantage: Reduces storage need, speedy data transfer and Less Disk I/O. Disadvantage: Consumes CPU As the compression ratio increases, compression speed decreases…. viz. inversely proportional There are 6 types of compression which can be used for Hadoop. viz: gzip, bzip2, LZO, LZ4, zlib, snappy, Compression algorithms operate by finding and eliminating redundancy and duplication in data. Thus, truly random data can never be compressed. Compression strategies generally have three phases: a preprocessing or transform phase, followed by duplicate elimination and finally a phase that focuses on bit reduction. The algorithms used in each compression format vary and have a strong impact on the efficacy and speed of compression. Why should we compress data: Compression provides several benefits and some disadvantages: 1) MR jobs are almost always I/O bound, compressing data can speed up the IO operations that are more often than not the performance bottleneck 2) You can do more with less with compression turned on i.e. improve the cluster utilization through space savings and faster data transfers across the network as you will send less data. This is particularly true as Hadoop uses a 3x data replication by default for fault tolerance. 3) As a user, you can improve your overall job performance and your jobs may take less time to complete. Compression does not come for free though. These benefits come at the cost of: 1) Increased CPU utilization in compressing and decompressing data. So, using compression itself presents a tradeoff between storage savings, faster I/O, and better use of network bandwidth with increase in CPU load. But given the nature of Hadoop, using compression generally turns out to be a good tradeoff to make. How compressions works in MapReduce: In simplest terms, this process has 5 steps: Input is compressed first – followed by mapper decompress – followed by mapper output compressed – followed by reducer input compressed – followed by reducer output decompressed (which is the final output). So, compression is integral to a MapReduce pipeline and can have a significant impact on the job’s overall performance. Map outputs to the reducers are sorted on its keys. The process of sorting and transferring data to the reducer phase is called shuffle. Map writes are buffered in memory, as spilled to the disk when they get full. ((Data is partitioned before the spills according to the reducer it needs to go to). Several spills files may get generated as part of the above process that are then merged on the disk to create a bigger partition. The data on the disk write can again be compressed. The reducer may need data from several map tasks on other nodes. Transferring this data compressed helps. The reduce phase merges output from map tasks either in memory or in a combination of in-memory and disk that is fed to the reducer for further processing upon which the data is written to HDFS (and that can be written compressed). Factors which play a role in selecting Compression algos:  Nature of the data set  Chained jobs  Data-storage efficiency requirements  Frequency of compression vs. decompression
  • 2. Nick Kabra Hadoop Compression White Paper Page 2 of 4  Requirement for compatibility with a standard data format  Splittability requirements (explained later)  Size of the intermediate and final data  Alternative implementations of compression libraries Splittability and its importance: Since Hadoop stores and processes data in blocks you must be able to begin reading data at any point within a file in order to take fullest advantage of Hadoop’s distributed processing. Hence, it is best if the blocks can be independently compressed. Snappy and LZO are commonly used compression technologies that enable efficient block storage and processing. If a file format does not support block compression then, if compressed, the file is rendered non-splittable. So when processed, the decompressor must begin reading the file at its beginning in order to obtain any block within the file. For a large file with many blocks, this could generate a substantial performance penalty. If a compression method is splittable, every compressed input split can be extracted and processed independently. Initally Sequential file format was the first one to handle splittability(keys and values). Split capability could be added to block oriented compression algos such as LZO, Snappy and LZ4. Findings and results from the test experiment are shared in the Excel sheet: System used: 16 core CPU 16 GB RAM Ubuntu File size=2.8 GB Compression and Query Formats: Size of file matters: If your files are smaller than the size of an HDFS block, then splittability and block compression don’t matter. You may be able to store the data uncompressed or with a simple file compression algorithm. Of course, small files are the exception in Hadoop and processing too many small files can cause performance issues. Hadoop wants large, splittable files so that its massively distributed engine can leverage data locality and parallel processing. Large files in Hadoop consume a lot of disk -- especially when considering standard 3x replication. So, there is an economic incentive to compress data. i.e. store more data per byte of disk. There is also a performance incentive as disk IO is expensive. If you can reduce the disk footprint through compression, you can relieve IO bottlenecks. As an example, I converted an uncompressed, 1.8GB CSV file into the following formats, achieving much smaller disk footprints. Uncompressed CSV 1.8 GB Avro 1.5 GB Avro w/ Snappy Compression 750 MB Parquet w/ Snappy Compression 300 MB I then ran Impala and Hive queries against each of the file formats. As the files became smaller, the query performance improved. The queries against Parquet were a couple orders of magnitude faster than uncompressed CSV. What query format are you using viz: Hive, Avro, Parquet, ORC, RC File format If you want to use Avro, does the data processing language you’ve chosen include Avro readers and writers? Likewise, suppose you’ve picked the Cloudera distribution because you like Impala. You should probably know that Impala currently does not support ORC format. Conversely, if you’ve chosen Hortonworks and Hive-Stinger, you probably want to avoid Parquet. Yes, it is expected that most of the tools will end up supporting most of the popular formats, but doublecheck before you make any final decisions.
  • 3. Nick Kabra Hadoop Compression White Paper Page 3 of 4 If you have a large enough cluster you can rewrite all of your historical data to add a field, but this is often not ideal. Being able to add a field and still read historical data may be preferred. If so, we should know which file formats enable flexible and evolving schema. Processing or query performance – What matters to you: There are three types of performance to consider: Write performance -- how fast can the data be written. Partial read performance -- how fast can you read individual columns within a file. Full read performance -- how fast can you read every data element in a file. A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies of your data each formatted for a different performance profile. Comparison of the popular file formats: 1) Avro files: Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro files store metadata with the data but also allow specification of an independent schema for reading the file. This makes Avro the epitome of schema evolution support since you can rename, add, delete and change the data types of fields by defining new independent schema. Additionally, Avro files are splittable, support block compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem. 2) Sequence files: Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs. 3) RC files: RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like columnar databases, the RC file enjoys significant compression and query performance benefits. However, the current serdes for RC files in Hive and other tools do not support schema evolution. In order to add a column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file requires more memory and computation than non-columnar file formats. They are generally slower to write. 4) ORC files: ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done better for Hadoop. This means ORC files compress better than RC files, enabling faster queries. However, they still don’t support schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files. 5) Parquet files: Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala. Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem. One note on Parquet file support with Hive... It is very important that Parquet column names are lowercase. If your Parquet file contains mixed case column names, Hive will not be able to read the column and will return queries on the column with null values and not log any errors. Unlike Hive, Impala handles mixed case column names. A truly perplexing problem when you encounter it! Factors to consider for query file format: Hadoop Distribution- Cloudera and Hortonworks support/favor different formats
  • 4. Nick Kabra Hadoop Compression White Paper Page 4 of 4 Schema Evolution- Will the structure of your data evolve? In what way? Processing Requirements - Will you be crunching the data and with what tools? Read/Query Requirements- Will you be using SQL on Hadoop? Which engine? Extract Requirements- Will you be extracting the data from Hadoop for import into an external database engine or other platform? Storage Requirements- Is data volume a significant factor? Will you get significantly more bang for your storage buck through compression? For MapReduce, some guidelines on which compression method to use where: Mapper input: Use splittable algo such as bzip2, or use zlib with RC file, ORC or Parquet format Mapper output: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks Reducer input: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks Reducer output: Use standard utility such as gzip or bzip2 for data interchange and faster codecs for chained jobs