SlideShare a Scribd company logo
Compression Options In Hadoop –
A Tale of Tradeoffs
Govind Kamat
39th Bay Area Hadoop Users Group (HUG) Meetup
Yahoo! URL’s Café
Sunnyvale, CA
August 21, 2013
Introduction
2
Sumeet Singh
Director of Products, Hadoop
Cloud Engineering Group
701 First Avenue
Sunnyvale, CA 94089 USA
Govind Kamat
Technical Yahoo!, Hadoop
Cloud Engineering Group
§  Member of Technical Staff in the Hadoop Services
team at Yahoo!
§  Focuses on HBase and Hadoop performance
§  Worked with the Performance Engineering Group on
improving the performance and scalability of several
Yahoo! applications
§  Experience includes development of large-scale
software systems, microprocessor architecture,
instruction-set simulators, compiler technology and
electronic design
701 First Avenue
Sunnyvale, CA 94089 USA
§  Leads Hadoop products team at Yahoo!
§  Responsible for Product Management, Customer
Engagements, Evangelism, and Program
Management
§  Prior to this role, led Strategy functions for the Cloud
Platform Group at Yahoo!
Agenda
3
Data Compression in Hadoop1
Available Compression Options2
Understanding and Working with Compression Options3
Problems Faced at Yahoo! with Large Data Sets4
Performance Evaluations, Native Bzip2, and IPP Libraries5
Wrap-up and Future Work6
Compression Needs and Tradeoffs in Hadoop
4
§  Storage
§  Disk I/O
§  Network bandwidth
§  CPU Time
§  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations
§  MapReduce jobs are almost always I/O bound
§  Compressed data can save storage space and speed up data transfers across the
network
§  Capital allocation for hardware can go further
§  Reduced I/O and network load can bring significant performance improvements
§  MapReduce jobs can finish faster overall
§  On the other hand, CPU utilization and processing time increases during
compression and decompression
§  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance
The Compression Tradeoff
Data Compression in Hadoop’s MR Pipeline
5
Input
splits
Map
Source: Hadoop: The Definitive Guide, Tom White
Output
ReduceBuffer in
memory
Partition and Sort
fetch
Merge
on disk
Merge and sort
Other
maps
Other
reducers
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Map Reduce
Reduce I/P
Map O/P
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Sort & Shuffle
Compress Decompress
Compression Options in Hadoop (1/2)
6
Format Algorithm Strategy Emphasis Comments
zlib
Uses DEFLATE
(LZ77 and Huffman
coding)
Dictionary-based, API Compression ratio Default codec
gzip Wrapper around zlib
Dictionary-based,
standard compression
utility
Same as zlib, codec
operates on and
produces standard gzip
files
For data interchange on
and off Hadoop
bzip2
Burrows-Wheeler
transform
Transform-based,
block-oriented
Higher compression
ratios than zlib
Common for Pig
LZO Variant of LZ77
Dictionary-based,
block-oriented, API
High compression
speeds
Common for
intermediate
compression, HBase
tables
LZ4
Simplified variant of
LZ77
Fast scan, API
Very high compression
speeds
Available in newer
Hadoop distributions
Snappy LZ77 Block-oriented, API
Very high compression
speeds
Came out of Google,
previously known as
Zippy
Compression Options in Hadoop (2/2)
7
Format Codec (Defined in io.compression.codecs) File Extn. Splittable
Java/
Native
zlib/ DEFLATE
(default)
org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y
gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y
bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y
LZO
(download
separately)
com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y
LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y
Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y
NOTES:
§  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other
algorithms require all blocks together for decompression with a single MapReduce task.
§  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still
supported and the codec can be downloaded separately and enabled manually.
§  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
Space-Time Tradeoff of Compression Options
8
64%, 32.3
71%, 60.0
47%, 4.842%, 4.0
44%, 2.4
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
40% 45% 50% 55% 60% 65% 70% 75%
CPUTimeinSec.
(Compress+Decompress)
Space Savings
Bzip2
Zlib
(Deflate, Gzip)
LZOSnappy
LZ4
Note:
A 265 MB corpus from Wikipedia was used for the performance comparisons.
Space savings is defined as [1 – (Compressed/ Uncompressed)]
Codec Performance on the Wikipedia Text Corpus
High Compression Ratio
High Compression Speed
Using Data Compression in Hadoop
9
Phase in MR
Pipeline
Config Values
Input data to
Map
File extension recognized automatically for
decompression
File extensions for supported formats
Note: For SequenceFile, headers have the
information [compression (boolean), block
compression (boolean), and compression
codec]
One of the supported codecs one defined in io.compression.codecs!
Intermediate
(Map) Output
mapreduce.map.output.compress!
false (default), true
mapreduce.map.output.compress.codec!
!
one defined in io.compression.codecs!
Final
(Reduce)
Output
mapreduce.output.fileoutputformat.
compress!
false (default), true
mapreduce.output.fileoutputformat.
compress.codec!
one defined in io.compression.codecs!
mapreduce.output.fileoutputformat.
compress.type!
Type of compression to use for SequenceFile
outputs: NONE, RECORD (default), BLOCK
1
2
3
§  Compress the input data,
if large
§  Always use compression,
particularly if spillage or
slow network transfers
§  Compress for storage/
archival, better write
speeds, or between MR jobs
§  Use splittable algo such as
bzip2, or use zlib with
SequenceFile format
§  Use faster codecs such as
LZO, LZ4, or Snappy
§  Use standard utility such as
gzip or bzip2 for data
interchange, and faster
codecs for chained jobs
When to Use Compression and Which Codec
10
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
I/P
compressed
Mapper
decompresses
Mapper O/P
compressed
1
Reducer I/P
decompresses
Reducer O/P
compressed
2 3
Compress Decompress
Final Reduce Output
Compression in the Hadoop Ecosystem
11
Component When to Use What to Use
Pig
§  Compressing data between MR
job
§  Typical in Pig scripts that include
joins or other operators that
expand your data size
Enable compression and select the codec:
pig.tmpfilecompression = true!
pig.tmpfilecompression.codec = gzip, lzo!
!
Hive
§  Intermediate files produced by
Hive between multiple map-
reduce jobs
§  Hive writes output to a table
Enable intermediate or output compression:
hive.exec.compress.intermediate = true!
hive.exec.compress.output = true!
HBase
§  Compress data at the CF level
(support for LZO, gzip, Snappy,
and LZ4)
List required JNI libraries:
hbase.regionserver.codecs!
!
Enabling compression:
create ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' }!
alter ’table', { NAME => 'colfam',
COMPRESSION => ’LZO' } !
4.2M Jobs, Jun 10-16, 2013
Compression in Hadoop at Yahoo!
12
99.8%
0.2%
LZO 98.3%
gzip 1.1%
zlib / default 0.5%
bzip2 0.1%
Map ReduceShuffle & Sort
Input data to Map Intermediate (Map) Output
1 2 3
Final Reduce Output
39.0%
61.0%
LZO 55%
gzip 35%
bzip2 5%
zlib / default 5%
4.2M Jobs, Jun 10-16, 2013
98%
2%
zlib / default 73%
gzip 22%
bzip2 4%
LZO 1%
380M Files on Jun 16, 2013
(/data, /projects)
Includes
intermediate
Pig/ Hive
compression
Pig
Intermediate
Compressed
Compression for Data Storage Efficiency
§  DSE considerations at Yahoo!
§  RCFile instead of SequenceFile
§  Faster implementation of bzip2
§  Native-code bzip2 codec
§  HADOOP-84621, available in 0.23.7
§  Substituting the IPP library
13
1 Native-code bzip2 implementation done in collaboration with Jason Lowe,
Hadoop Core PMC member
IPP Libraries
§  Integrated Performance Primitives from Intel
§  Algorithmic and architectural optimizations
§  Processor-specific variants of each function
§  Applications remain processor-neutral
§  Compression: LZ, RLE, BWT, LZO
§  High level formats include: zlib, gzip, bzip2 and LZO
14
Measuring Standalone Performance
§  Standard programs (gzip, bzip2) used
§  Driver program written for other cases
§  32-bit mode
§  Single-threaded
§  JVM load overhead discounted
§  Default compression level
§  Quad-core Xeon machine
15
Data Corpuses Used
§  Binary files
§  Generated text from randomtextwriter
§  Wikipedia corpus
§  Silesia corpus
16
Compression Ratio
0
50
100
150
200
250
300
uncomp zlib bzip2 LZO Snappy LZ4
FileSize(MB)
exe rtext wiki silesia
17
Compression Performance
29
23
63
44
26
0
10
20
30
40
50
60
70
80
90
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
18
Compression Performance (Fast Algorithms)
3.2
2.9
1.7
0
0.5
1
1.5
2
2.5
3
3.5
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
19
Decompression Performance
3
2
21
17
12
0
5
10
15
20
25
zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2
CPUTime(sec)
exe rtext wiki silesia
20
Decompression Performance (Fast Algorithms)
1.6
1.1
0.7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
LZO Snappy LZ4
CPUTime(sec)
exe rtext wiki silesia
21
Compression Performance within Hadoop
§  Daytona performance framework
§  GridMix v1
§  Loadgen and sort jobs
§  Input data compressed with zlib / bzip2
§  LZO used for intermediate compression
§  35 datanodes, dual-quad-core machines
22
Map Performance
47 46 46
33
0
5
10
15
20
25
30
35
40
45
50
Java-bzip2 bzip2 IPP-bzip2 zlib
MapTime(sec)
23
Reduce Performance
31
28
18
14
0
5
10
15
20
25
30
35
Java-bzip2 bzip2 IPP-bzip2 zlib
ReduceTime(min)
24
Job Performance
38
34
23
19
38
34
25
18
0
5
10
15
20
25
30
35
40
Java-bzip2 bzip2 IPP-bzip2 zlib
JobTime(min)
sort loadgen
25
Future Work
§  Splittability support for native-code bzip2 codec
§  Enhancing Pig to use common bzip2 codec
§  Optimizing the JNI interface and buffer copies
§  Varying the compression effort parameter
§  Performance evaluation for 64-bit mode
§  Updating the zlib codec to specify alternative libraries
§  Other codec combinations, such as zlib for transient data
§  Other compression algorithms
26
Considerations in Selecting Compression Type
§  Nature of the data set
§  Chained jobs
§  Data-storage efficiency requirements
§  Frequency of compression vs. decompression
§  Requirement for compatibility with a standard data format
§  Splittability requirements
§  Size of the intermediate and final data
§  Alternative implementations of compression libraries
27
August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

More Related Content

PDF
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PDF
HBase @ Twitter
PDF
Storage Infrastructure Behind Facebook Messages
PPTX
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
PDF
Storage infrastructure using HBase behind LINE messages
PDF
Hadoop Distributed File System Reliability and Durability at Facebook
PDF
Introduction to map reduce
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Compression Options in Hadoop - A Tale of Tradeoffs
HBase @ Twitter
Storage Infrastructure Behind Facebook Messages
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Storage infrastructure using HBase behind LINE messages
Hadoop Distributed File System Reliability and Durability at Facebook
Introduction to map reduce

What's hot (20)

PDF
Data Hacking with RHadoop
PPTX
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
PPTX
D02 Evolution of the HADR tool
PDF
Hadoop Successes and Failures to Drive Deployment Evolution
PDF
Integrating data stored in rdbms and hadoop
PDF
Introduction to the Graphics Pipeline of the PS3
PDF
Hadoop, Taming Elephants
PPTX
Hdp r-google charttools-webinar-3-5-2013 (2)
PPTX
Operating and supporting HBase Clusters
PDF
Facebook's HBase Backups - StampedeCon 2012
PDF
Introduction to hadoop and hdfs
PPTX
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
PDF
Facebook keynote-nicolas-qcon
PPT
Hadoop 1.x vs 2
PDF
DB2 V 10 HADR Multiple Standby
KEY
Introduction to Hadoop - ACCU2010
PPTX
MapReduce Paradigm
PDF
Enabling R on Hadoop
PDF
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
PDF
Data Hacking with RHadoop
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
D02 Evolution of the HADR tool
Hadoop Successes and Failures to Drive Deployment Evolution
Integrating data stored in rdbms and hadoop
Introduction to the Graphics Pipeline of the PS3
Hadoop, Taming Elephants
Hdp r-google charttools-webinar-3-5-2013 (2)
Operating and supporting HBase Clusters
Facebook's HBase Backups - StampedeCon 2012
Introduction to hadoop and hdfs
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Facebook keynote-nicolas-qcon
Hadoop 1.x vs 2
DB2 V 10 HADR Multiple Standby
Introduction to Hadoop - ACCU2010
MapReduce Paradigm
Enabling R on Hadoop
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Ad

Similar to August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs (20)

PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PDF
Hadoop compression strata conference
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
Schedulers optimization to handle multiple jobs in hadoop cluster
PPTX
Unit 6 - Compression and Serialization in Hadoop.pptx
PPTX
Overview of big data & hadoop v1
PPT
Apache hadoop, hdfs and map reduce Overview
PPTX
Presentation sreenu dwh-services
ODP
HDFS presented by VIJAY
PDF
Introduction to Hadoop and Big Data Processing
PPTX
Hadoop architecture-tutorial
ODP
Pro PostgreSQL, OSCon 2008
PPTX
A slide share pig in CCS334 for big data analytics
PDF
Non-Specialized File Format Extension
PPTX
Hadoop And Big Data - My Presentation To Selective Audience
PPTX
An Introduction to Apache Pig
PDF
Hadoop J.G.Rohini 2nd M.sc., computer science bon secours college for women
PPTX
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
Hadoop compression strata conference
Hadoop a Natural Choice for Data Intensive Log Processing
Schedulers optimization to handle multiple jobs in hadoop cluster
Unit 6 - Compression and Serialization in Hadoop.pptx
Overview of big data & hadoop v1
Apache hadoop, hdfs and map reduce Overview
Presentation sreenu dwh-services
HDFS presented by VIJAY
Introduction to Hadoop and Big Data Processing
Hadoop architecture-tutorial
Pro PostgreSQL, OSCon 2008
A slide share pig in CCS334 for big data analytics
Non-Specialized File Format Extension
Hadoop And Big Data - My Presentation To Selective Audience
An Introduction to Apache Pig
Hadoop J.G.Rohini 2nd M.sc., computer science bon secours college for women
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Cloud computing and distributed systems.
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
“AI and Expert System Decision Support & Business Intelligence Systems”
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Cloud computing and distributed systems.

August 2013 HUG: Compression Options in Hadoop - A Tale of Tradeoffs

  • 1. Compression Options In Hadoop – A Tale of Tradeoffs Govind Kamat 39th Bay Area Hadoop Users Group (HUG) Meetup Yahoo! URL’s Café Sunnyvale, CA August 21, 2013
  • 2. Introduction 2 Sumeet Singh Director of Products, Hadoop Cloud Engineering Group 701 First Avenue Sunnyvale, CA 94089 USA Govind Kamat Technical Yahoo!, Hadoop Cloud Engineering Group §  Member of Technical Staff in the Hadoop Services team at Yahoo! §  Focuses on HBase and Hadoop performance §  Worked with the Performance Engineering Group on improving the performance and scalability of several Yahoo! applications §  Experience includes development of large-scale software systems, microprocessor architecture, instruction-set simulators, compiler technology and electronic design 701 First Avenue Sunnyvale, CA 94089 USA §  Leads Hadoop products team at Yahoo! §  Responsible for Product Management, Customer Engagements, Evangelism, and Program Management §  Prior to this role, led Strategy functions for the Cloud Platform Group at Yahoo!
  • 3. Agenda 3 Data Compression in Hadoop1 Available Compression Options2 Understanding and Working with Compression Options3 Problems Faced at Yahoo! with Large Data Sets4 Performance Evaluations, Native Bzip2, and IPP Libraries5 Wrap-up and Future Work6
  • 4. Compression Needs and Tradeoffs in Hadoop 4 §  Storage §  Disk I/O §  Network bandwidth §  CPU Time §  Hadoop jobs are data-intensive, compressing data can speed up the I/O operations §  MapReduce jobs are almost always I/O bound §  Compressed data can save storage space and speed up data transfers across the network §  Capital allocation for hardware can go further §  Reduced I/O and network load can bring significant performance improvements §  MapReduce jobs can finish faster overall §  On the other hand, CPU utilization and processing time increases during compression and decompression §  Understanding the tradeoffs is important for MapReduce pipeline’s overall performance The Compression Tradeoff
  • 5. Data Compression in Hadoop’s MR Pipeline 5 Input splits Map Source: Hadoop: The Definitive Guide, Tom White Output ReduceBuffer in memory Partition and Sort fetch Merge on disk Merge and sort Other maps Other reducers I/P compressed Mapper decompresses Mapper O/P compressed 1 Map Reduce Reduce I/P Map O/P Reducer I/P decompresses Reducer O/P compressed 2 3 Sort & Shuffle Compress Decompress
  • 6. Compression Options in Hadoop (1/2) 6 Format Algorithm Strategy Emphasis Comments zlib Uses DEFLATE (LZ77 and Huffman coding) Dictionary-based, API Compression ratio Default codec gzip Wrapper around zlib Dictionary-based, standard compression utility Same as zlib, codec operates on and produces standard gzip files For data interchange on and off Hadoop bzip2 Burrows-Wheeler transform Transform-based, block-oriented Higher compression ratios than zlib Common for Pig LZO Variant of LZ77 Dictionary-based, block-oriented, API High compression speeds Common for intermediate compression, HBase tables LZ4 Simplified variant of LZ77 Fast scan, API Very high compression speeds Available in newer Hadoop distributions Snappy LZ77 Block-oriented, API Very high compression speeds Came out of Google, previously known as Zippy
  • 7. Compression Options in Hadoop (2/2) 7 Format Codec (Defined in io.compression.codecs) File Extn. Splittable Java/ Native zlib/ DEFLATE (default) org.apache.hadoop.io.compress.DefaultCodec !.deflate! N Y/ Y gzip org.apache.hadoop.io.compress.GzipCodec ! .gz! N Y/ Y bzip2 org.apache.hadoop.io.compress.BZip2Codec ! .bz2! Y Y/ Y LZO (download separately) com.hadoop.compression.lzo.LzoCodec ! .lzo! N N/ Y LZ4 org.apache.hadoop.io.compress.Lz4Codec ! .lz4! N N/ Y Snappy org.apache.hadoop.io.compress.SnappyCodec ! .snappy! N N/ Y NOTES: §  Splittability – Bzip2 is “splittable”, can be decompressed in parallel by multiple MapReduce tasks. Other algorithms require all blocks together for decompression with a single MapReduce task. §  LZO – Removed from Hadoop because the LZO libraries are licensed under the GNU GPL. LZO format is still supported and the codec can be downloaded separately and enabled manually. §  Native bzip2 codec – added by Yahoo! as part of this work in Hadoop 0.23
  • 8. Space-Time Tradeoff of Compression Options 8 64%, 32.3 71%, 60.0 47%, 4.842%, 4.0 44%, 2.4 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 40% 45% 50% 55% 60% 65% 70% 75% CPUTimeinSec. (Compress+Decompress) Space Savings Bzip2 Zlib (Deflate, Gzip) LZOSnappy LZ4 Note: A 265 MB corpus from Wikipedia was used for the performance comparisons. Space savings is defined as [1 – (Compressed/ Uncompressed)] Codec Performance on the Wikipedia Text Corpus High Compression Ratio High Compression Speed
  • 9. Using Data Compression in Hadoop 9 Phase in MR Pipeline Config Values Input data to Map File extension recognized automatically for decompression File extensions for supported formats Note: For SequenceFile, headers have the information [compression (boolean), block compression (boolean), and compression codec] One of the supported codecs one defined in io.compression.codecs! Intermediate (Map) Output mapreduce.map.output.compress! false (default), true mapreduce.map.output.compress.codec! ! one defined in io.compression.codecs! Final (Reduce) Output mapreduce.output.fileoutputformat. compress! false (default), true mapreduce.output.fileoutputformat. compress.codec! one defined in io.compression.codecs! mapreduce.output.fileoutputformat. compress.type! Type of compression to use for SequenceFile outputs: NONE, RECORD (default), BLOCK 1 2 3
  • 10. §  Compress the input data, if large §  Always use compression, particularly if spillage or slow network transfers §  Compress for storage/ archival, better write speeds, or between MR jobs §  Use splittable algo such as bzip2, or use zlib with SequenceFile format §  Use faster codecs such as LZO, LZ4, or Snappy §  Use standard utility such as gzip or bzip2 for data interchange, and faster codecs for chained jobs When to Use Compression and Which Codec 10 Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output I/P compressed Mapper decompresses Mapper O/P compressed 1 Reducer I/P decompresses Reducer O/P compressed 2 3 Compress Decompress Final Reduce Output
  • 11. Compression in the Hadoop Ecosystem 11 Component When to Use What to Use Pig §  Compressing data between MR job §  Typical in Pig scripts that include joins or other operators that expand your data size Enable compression and select the codec: pig.tmpfilecompression = true! pig.tmpfilecompression.codec = gzip, lzo! ! Hive §  Intermediate files produced by Hive between multiple map- reduce jobs §  Hive writes output to a table Enable intermediate or output compression: hive.exec.compress.intermediate = true! hive.exec.compress.output = true! HBase §  Compress data at the CF level (support for LZO, gzip, Snappy, and LZ4) List required JNI libraries: hbase.regionserver.codecs! ! Enabling compression: create ’table', { NAME => 'colfam', COMPRESSION => ’LZO' }! alter ’table', { NAME => 'colfam', COMPRESSION => ’LZO' } !
  • 12. 4.2M Jobs, Jun 10-16, 2013 Compression in Hadoop at Yahoo! 12 99.8% 0.2% LZO 98.3% gzip 1.1% zlib / default 0.5% bzip2 0.1% Map ReduceShuffle & Sort Input data to Map Intermediate (Map) Output 1 2 3 Final Reduce Output 39.0% 61.0% LZO 55% gzip 35% bzip2 5% zlib / default 5% 4.2M Jobs, Jun 10-16, 2013 98% 2% zlib / default 73% gzip 22% bzip2 4% LZO 1% 380M Files on Jun 16, 2013 (/data, /projects) Includes intermediate Pig/ Hive compression Pig Intermediate Compressed
  • 13. Compression for Data Storage Efficiency §  DSE considerations at Yahoo! §  RCFile instead of SequenceFile §  Faster implementation of bzip2 §  Native-code bzip2 codec §  HADOOP-84621, available in 0.23.7 §  Substituting the IPP library 13 1 Native-code bzip2 implementation done in collaboration with Jason Lowe, Hadoop Core PMC member
  • 14. IPP Libraries §  Integrated Performance Primitives from Intel §  Algorithmic and architectural optimizations §  Processor-specific variants of each function §  Applications remain processor-neutral §  Compression: LZ, RLE, BWT, LZO §  High level formats include: zlib, gzip, bzip2 and LZO 14
  • 15. Measuring Standalone Performance §  Standard programs (gzip, bzip2) used §  Driver program written for other cases §  32-bit mode §  Single-threaded §  JVM load overhead discounted §  Default compression level §  Quad-core Xeon machine 15
  • 16. Data Corpuses Used §  Binary files §  Generated text from randomtextwriter §  Wikipedia corpus §  Silesia corpus 16
  • 17. Compression Ratio 0 50 100 150 200 250 300 uncomp zlib bzip2 LZO Snappy LZ4 FileSize(MB) exe rtext wiki silesia 17
  • 18. Compression Performance 29 23 63 44 26 0 10 20 30 40 50 60 70 80 90 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 18
  • 19. Compression Performance (Fast Algorithms) 3.2 2.9 1.7 0 0.5 1 1.5 2 2.5 3 3.5 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 19
  • 20. Decompression Performance 3 2 21 17 12 0 5 10 15 20 25 zlib IPP-zlib Java-bzip2 bzip2 IPP-bzip2 CPUTime(sec) exe rtext wiki silesia 20
  • 21. Decompression Performance (Fast Algorithms) 1.6 1.1 0.7 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 LZO Snappy LZ4 CPUTime(sec) exe rtext wiki silesia 21
  • 22. Compression Performance within Hadoop §  Daytona performance framework §  GridMix v1 §  Loadgen and sort jobs §  Input data compressed with zlib / bzip2 §  LZO used for intermediate compression §  35 datanodes, dual-quad-core machines 22
  • 23. Map Performance 47 46 46 33 0 5 10 15 20 25 30 35 40 45 50 Java-bzip2 bzip2 IPP-bzip2 zlib MapTime(sec) 23
  • 26. Future Work §  Splittability support for native-code bzip2 codec §  Enhancing Pig to use common bzip2 codec §  Optimizing the JNI interface and buffer copies §  Varying the compression effort parameter §  Performance evaluation for 64-bit mode §  Updating the zlib codec to specify alternative libraries §  Other codec combinations, such as zlib for transient data §  Other compression algorithms 26
  • 27. Considerations in Selecting Compression Type §  Nature of the data set §  Chained jobs §  Data-storage efficiency requirements §  Frequency of compression vs. decompression §  Requirement for compatibility with a standard data format §  Splittability requirements §  Size of the intermediate and final data §  Alternative implementations of compression libraries 27