Hadoop compression strata conference

Nick Kabra Hadoop Compression White Paper Page 1 of 4
Hadoop Compression, various file formats and Querying
What is Data Compression: Data compression is storing data in a format that requires less space than
original size.
Advantage: Reduces storage need, speedy data transfer and Less Disk I/O.
Disadvantage: Consumes CPU
As the compression ratio increases, compression speed decreases…. viz. inversely proportional
There are 6 types of compression which can be used for Hadoop. viz: gzip, bzip2, LZO, LZ4, zlib, snappy,
Compression algorithms operate by finding and eliminating redundancy and duplication in data. Thus, truly
random data can never be compressed. Compression strategies generally have three phases: a preprocessing
or transform phase, followed by duplicate elimination and finally a phase that focuses on bit reduction. The
algorithms used in each compression format vary and have a strong impact on the efficacy and speed of
compression.
Why should we compress data: Compression provides several benefits and some disadvantages:
1) MR jobs are almost always I/O bound, compressing data can speed up the IO operations that are more often
than not the performance bottleneck
2) You can do more with less with compression turned on i.e. improve the cluster utilization through space
savings and faster data transfers across the network as you will send less data. This is particularly true as
Hadoop uses a 3x data replication by default for fault tolerance.
3) As a user, you can improve your overall job performance and your jobs may take less time to complete.
Compression does not come for free though. These benefits come at the cost of:
1) Increased CPU utilization in compressing and decompressing data.
So, using compression itself presents a tradeoff between storage savings, faster I/O, and better use of
network bandwidth with increase in CPU load. But given the nature of Hadoop, using compression
generally turns out to be a good tradeoff to make.
How compressions works in MapReduce:
In simplest terms, this process has 5 steps: Input is compressed first – followed by mapper decompress –
followed by mapper output compressed – followed by reducer input compressed – followed by reducer
output decompressed (which is the final output). So, compression is integral to a MapReduce pipeline and
can have a significant impact on the job’s overall performance.
Map outputs to the reducers are sorted on its keys. The process of sorting and transferring data to the reducer
phase is called shuffle. Map writes are buffered in memory, as spilled to the disk when they get full. ((Data
is partitioned before the spills according to the reducer it needs to go to). Several spills files may get
generated as part of the above process that are then merged on the disk to create a bigger partition. The data
on the disk write can again be compressed.
The reducer may need data from several map tasks on other nodes. Transferring this data compressed helps.
The reduce phase merges output from map tasks either in memory or in a combination of in-memory and
disk that is fed to the reducer for further processing upon which the data is written to HDFS (and that can
be written compressed).
Factors which play a role in selecting Compression algos:
 Nature of the data set
 Chained jobs
 Data-storage efficiency requirements
 Frequency of compression vs. decompression

 Requirement for compatibility with a standard data format
 Splittability requirements (explained later)
 Size of the intermediate and final data
 Alternative implementations of compression libraries
Splittability and its importance: Since Hadoop stores and processes data in blocks you must be able to
begin reading data at any point within a file in order to take fullest advantage of Hadoop’s distributed
processing. Hence, it is best if the blocks can be independently compressed. Snappy and LZO are commonly
used compression technologies that enable efficient block storage and processing. If a file format does not
support block compression then, if compressed, the file is rendered non-splittable. So when processed, the
decompressor must begin reading the file at its beginning in order to obtain any block within the file. For a
large file with many blocks, this could generate a substantial performance penalty. If a compression method
is splittable, every compressed input split can be extracted and processed independently.
Initally Sequential file format was the first one to handle splittability(keys and values). Split capability
could be added to block oriented compression algos such as LZO, Snappy and LZ4.
Findings and results from the test experiment are shared in the Excel sheet:
System used:
16 core CPU
16 GB RAM
Ubuntu
File size=2.8 GB
Compression and Query Formats:
Size of file matters: If your files are smaller than the size of an HDFS block, then splittability and block
compression don’t matter. You may be able to store the data uncompressed or with a simple file
compression algorithm. Of course, small files are the exception in Hadoop and processing too many small
files can cause performance issues. Hadoop wants large, splittable files so that its massively distributed
engine can leverage data locality and parallel processing.
Large files in Hadoop consume a lot of disk -- especially when considering standard 3x replication. So,
there is an economic incentive to compress data. i.e. store more data per byte of disk. There is also a
performance incentive as disk IO is expensive. If you can reduce the disk footprint through compression,
you can relieve IO bottlenecks. As an example, I converted an uncompressed, 1.8GB CSV file into the
following formats, achieving much smaller disk footprints.
Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro w/ Snappy Compression 750 MB
Parquet w/ Snappy Compression 300 MB
I then ran Impala and Hive queries against each of the file formats. As the files became smaller, the query
performance improved. The queries against Parquet were a couple orders of magnitude faster than
uncompressed CSV.
What query format are you using viz: Hive, Avro, Parquet, ORC, RC File format
If you want to use Avro, does the data processing language you’ve chosen include Avro readers and
writers? Likewise, suppose you’ve picked the Cloudera distribution because you like Impala. You should
probably know that Impala currently does not support ORC format. Conversely, if you’ve chosen
Hortonworks and Hive-Stinger, you probably want to avoid Parquet. Yes, it is expected that most of the
tools will end up supporting most of the popular formats, but doublecheck before you make any final
decisions.

If you have a large enough cluster you can rewrite all of your historical data to add a field, but this is often
not ideal. Being able to add a field and still read historical data may be preferred. If so, we should know
which file formats enable flexible and evolving schema.
Processing or query performance – What matters to you: There are three types of performance to
consider:
Write performance -- how fast can the data be written.
Partial read performance -- how fast can you read individual columns within a file.
Full read performance -- how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance,
but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write
but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple
copies of your data each formatted for a different performance profile.
Comparison of the popular file formats:
1) Avro files: Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro
files store metadata with the data but also allow specification of an independent schema for reading the file.
This makes Avro the epitome of schema evolution support since you can rename, add, delete and change
the data types of fields by defining new independent schema. Additionally, Avro files are splittable, support
block compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.
2) Sequence files: Sequence files store data in a binary format with a similar structure to CSV. Like CSV,
sequence files do not store metadata with the data so the only schema evolution option is appending new
fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of
reading sequence files, they are often only used for “in flight” data such as intermediate data storage used
within a sequence of MapReduce jobs.
3) RC files: RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. Like
columnar databases, the RC file enjoys significant compression and query performance benefits. However,
the current serdes for RC files in Hive and other tools do not support schema evolution. In order to add a
column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for
query, writing an RC file requires more memory and computation than non-columnar file formats. They
are generally slower to write.
4) ORC files: ORC Files or Optimized RC Files were invented to optimize performance in Hive and are
primarily backed by HortonWorks. ORC files enjoy the same benefits and limitations as RC files just done
better for Hadoop. This means ORC files compress better than RC files, enabling faster queries. However,
they still don’t support schema evolution. Some benchmarks indicate that ORC files compress to be the
smallest of all file formats in Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera
Impala does not support ORC files.
5) Parquet files: Parquet Files are yet another columnar file format that originated from Hadoop creator Doug
Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits,
and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet
serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure.
At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such
as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala.
Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem.
One note on Parquet file support with Hive... It is very important that Parquet column names are lowercase.
If your Parquet file contains mixed case column names, Hive will not be able to read the column and will
return queries on the column with null values and not log any errors. Unlike Hive, Impala handles mixed
case column names. A truly perplexing problem when you encounter it!
Factors to consider for query file format:
Hadoop Distribution- Cloudera and Hortonworks support/favor different formats

Schema Evolution- Will the structure of your data evolve? In what way?
Processing Requirements - Will you be crunching the data and with what tools?
Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
Extract Requirements- Will you be extracting the data from Hadoop for import into an external database
engine or other platform?
Storage Requirements- Is data volume a significant factor? Will you get significantly more bang for your
storage buck through compression?
For MapReduce, some guidelines on which compression method to use where:
Mapper input: Use splittable algo such as bzip2, or use zlib with RC file, ORC or Parquet format
Mapper output: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks
Reducer input: Use LZO, LZ4, or Snappy or use faster codecs for intermediate tasks
Reducer output: Use standard utility such as gzip or bzip2 for data interchange and faster codecs for chained
jobs

Hadoop compression strata conference

More Related Content

What's hot (18)

Viewers also liked (14)

Similar to Hadoop compression strata conference (20)

More from nkabra (12)

Recently uploaded (20)

Hadoop compression strata conference