SlideShare a Scribd company logo
Facebook’s Approach to Big Data
Storage Challenge


Weiyan Wang
Software Engineer (Data Infrastructure – HStore)
March 1 2013
Agenda
1   Data Warehouse Overview and Challenge

2   Smart Retention

3   Sort Before Compression

4   HDFS Raid

5   Directory XOR & Compaction

6   Q&A
Life of a tag in Data Warehouse
        Periodic Analysis                                      Adhoc Analysis
                                nocron
  Daily report on count of                       hipal          Count photos tagged by
photo tags by country (1day)                                  females age 20-25 yesterday

                                                             Scrapes
                                                            User info reaches
                                         Warehouse          Warehouse (1day)        UDB

                               copier/loader   Log line reaches            User tags
                                               warehouse (1hr)              a photo

                       puma
  Realtime Analytics                  Scribe Log Storage        www.facebook.com
 Count users tagging photos              Log line reaches         Log line generated:
   in the last hour (1min)                 Scribeh (10s)          <user_id, photo_id>
History (2008/03-2012/03)
Data, Data, and more Data


         Facebook                 Scribe Data/    Nodes in
                    Queries/Day                              Size (Total)
           Users                      Day        Warehouse




Growth      14X         60X          250X          260X        2500X
Directions to handle data growth problem
• Improve the software
•   HDFS Federation
•   Prism

• Improve storage efficiency
•   Store more data without increasing capacity
•   More and more important, translate into millions of
    dollars saving.
Ways to Improve Storage Efficiency
• Better capacity management

• Reduce space usage of Hive tables

• Reduce replication factor of data
Smart Retention – Motivation
• Hive table “retention” metadata
 •   Partitions older than retention value are automatically
     purged by system

• Table owners are unaware of table usage
 •   Difficult to set retention value right at the beginning.

• Improper retention setting may waste spaces
 •   Users only accessed recent 30-day partitions of a 3-
     month-retention table
Smart Retention
• Add a post-execute hook that logs table/partition
  names and query start time to MySQL.

• Calculate the “empirical-retention” per table
  Given a partition P whose creation time is CTP:
  Data_age_at_last_queryP =
              max{StartTimeQ - CTP | ∀query Q
  accesses P}
  Given a table T:
  Empirical_retentionT =
            max{Data_age_at_last_queryP | ∀ P ∈ T}
Smart Retention
• Inform Empirical_retentionT to table owners with a
  call to action:
 •   Accept the empirical value and change retention
 •   Review table query history, figure out better setting

• After 2weeks, the system will archive partitions
  that are older than Empirical_retentionT
 •   Free up spaces after partitions get archived
 •   Users need to restore archived data for querying
Smart Retention – Things Learned
• Table query history enables table owners to
  identify outliers:
 •   A table is queried mostly < 32 days olds data but there
     was one time a 42 days old partition was accessed

• Prioritize tables with the most space savings
 •   Save 8PB from the TOP 100 tables!
Sort Before Compression - Motivation
• In RCFile format, data are stored in columns inside
  every row block
 •   Sort by one or two columns with lots of duplicate values
     reduces final compressed data size

• Trade extra computation for space saving
Sort Before Compression
• Identify the best column to sort
 •   Take a sample of table and sort it by every column. Pick
     the one with the most space saving.

• Transfer target partitions from service clusters to
  compute clusters
• Sort them into compressed RCFile format.
• Sorted partitions are transferred back to service
  clusters to replace original ones
How we sort
set hive.exec.reducers.max=1024;
set hive.io.rcfile.record.buffer.size=67108864;
 INSERT OVERWRITE TABLE hive_table PARTITION
(ds='2012-08-06',source_type='mobile_sort')
     SELECT `(ds|source_type)?+.+` from hive_table
     WHERE ds='2012-08-06' and source_type='mobile'
     DISTRIBUTE BY IF (userid <> 0 AND NOT (userid
is null), userid, CAST(RAND() AS STRING))
     SORT BY userid, ip_address;
Sort Before Compression – Things Learned
• Sorting achieves >40% space saving!

• It’s important to verify data correctness
 •   Compare original and sorted partitions’ hash values
 •   Find a hive bug

• Sort cold data first, and gradually move to hot
  data
HDFS Raid
In HDFS, data are 3X replicated
                                    Meta operations       NameNode
  /warehouse/file1                                        (Metadata)

                               Client
   1     2       3

                     Read/Write Data



             1           3                            2                3


                 2             1         3            1                2

        DataNode 1     DataNode 2       DataNode 3 DataNode 4     DataNode 5
HDFS Raid – File-level XOR (10, 1)
                 Before                                    After
            /warehouse/file1                          /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1    2 3      4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10   1    2   3    4   5        6   7   8   9   10

1   2   3   4   5   6   7   8   9   10
                                              Parity file: /raid/warehouse/file1


                                                               11

                                    (10, 1)                    11

                3X                                         2.2X
HDFS Raid
• What if a file has 15 blocks
•   Treat as 20 blocks and generate one parity with 2
    blocks
•   Replication factor = (15*2+2*2)/15 = 2.27

• Reconstruction
•   Online reconstruction – DistributedRaidFileSystem
•   Offline reconstruction – RaidNode

• Block Placement
HDFS Raid – File-level Reed Solomon
(10, 4) Before
      /warehouse/file1
                                After
                           /warehouse/file1

1   2 3     4   5   6   7   8   9   10   1     2 3     4   5   6   7   8    9   10

1   2   3   4   5   6   7   8   9   10

        3                                    Parity file: /raidrs/warehouse/file1
1   2       4   5   6   7   8   9   10


                                                      11   12 13 14


                                    (10, 4)
                3X                                         1.4X
HDFS Raid – Hybrid Storage
                                       Even older
                              ×1.4
                     RS                3months older
                     Raided
                          ×2.2
                                       1day older
                   XOR Raided
                        ×3
                                       Born
Life of file /warehouse/facebook.jpg
HDFS Raid – Things Learned
• Replication factor 3 ->2.65 (12% space saving)

• Avoid flooding namenode with requests
•   Daily pipeline scans fsimage to pick raidable files
    rather than recursively search from namenode

• Small files disallow more replication reduction
•   50% of files in the warehouse have only 1 or 2
    blocks. They are too small to be raided.
Raid Warm Small Files: Directory level XOR
                    Before                                       After
                          /data/file3                                    /data/file3
/data/file1   /data/file2        /data/file4   /data/file1   /data/file2        /data/file4


 1   2 3      4    5   6    7    8   9   10    1    2 3      4   5    6   7   8   9   10

 1   2 3      4    5   6    7    8   9   10    1   2 3       4   5    6   7   8   9   10
                   5   6                 10
/raid/data/file1           /raid/data/file3             Parity file: /dir-raid/data

      11                        12
                                                                     11
      11                        12
                                                                     11

                   2.7X                                      2.2X
Handle Directory Change                                        Directory change
                                                                   happens very
                          /namespace/infra/ds=2013-07-07
                                                                   infrequently in
                  2
                                                                   warehouse
                                      file1    File2     file3
      Client
                                      file1    file2     file3           Try to read file2,
                                                                    3         encounter
                                      parity
                                                                         missing blocks
                                      parity
                                               1                                            Client

                                        Stripe store (MySQL)         Look at the stripe
    RaidNode
                                      Block id         Stripe id     table, figure out
                                                                     that file4 does not
       4                              Blk_file_1       Strp_1
                                                                     belong to the
                                      Blk_file_2       Strp_1        stripe, and file3 is
Re-raid the directory, before file3   Blk_file_3       Strp_1        in trash.
is actually deleted from cluster      Blk_parity       Strp_1
                                                                     Reconstruct file2!!
Raid Cold Small Files: Compaction
• Compact cold small files into large files and apply
  file-level RS
•       No need to handle directory changes for file-level RS
    •    Re-raid a Directory-RS Raided directory is expensive
•       Raid-aware compaction can achieve best space saving
    •    Change block size to produce files with multiples of
         ten blocks
•       Reduce the number of metadata
Raid-Aware Compaction
▪       Compaction settings:
         set mapred.min.split.size = 39*blockSize;
         set mapred.max.split.size = 39*blockSize;
         set mapred.min.split.size.per.node = 39*blockSize;
         set mapred.min.split.size.per.rack = 39*blockSize;
         set dfs.block.size = blockSize;
         set hive.input.format =
              org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

▪       Calculate the best block size for a partition
    ▪    Make sure bestBlockSize * N ≈ Partition size where
         N = 39p + q (p ∈N+ , q ∈ {10, 20, 30})
    ▪    Compaction will generate p 40-block files and one q-
         block file
Raid-Aware Compaction
▪       Compact SeqFile format partition
    ▪    INSERT OVERWRITE TABLE seq_table
         PARTITION (ds = "2012-08-17")
            SELECT `(ds)?+.+` FROM seq_table
            WHERE ds = "2012-08-17";
▪       Compact RCFile format partition
    ▪    ALTER TABLE rc_table PARTITION
             (ds="2009-08-31") CONCATENATE;
Directory XOR & Compaction - Things Learned
 • Replication factor 2.65 ->2.35! (additional 12% space
   saving) Still rolling out

 • Bookkeeping blocks’ checksums could avoid data
   corruption caused by bugs

 • Unawareness of Raid in HDFS causes some issues
  •   Operational error could cause data loss (forget to move
      parity data with source data)

 • Directory XOR & Compaction only work for warehouse
   data
Questions?

More Related Content

PDF
HDFS Design Principles
PPTX
Hadoop HDFS Detailed Introduction
PDF
HDFS User Reference
PPT
Hadoop Architecture
PPTX
Introduction to hadoop and hdfs
PPTX
Hadoop HDFS Concepts
PDF
Interacting with hdfs
PPTX
Hadoop HDFS Architeture and Design
HDFS Design Principles
Hadoop HDFS Detailed Introduction
HDFS User Reference
Hadoop Architecture
Introduction to hadoop and hdfs
Hadoop HDFS Concepts
Interacting with hdfs
Hadoop HDFS Architeture and Design

What's hot (20)

PPTX
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
PDF
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
PPTX
Hadoop HDFS Concepts
PDF
Dynamic Namespace Partitioning with Giraffa File System
PDF
Hadoop introduction
PDF
Hadoop Distributed File System
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
ODP
Hadoop HDFS by rohitkapa
PDF
Apache Hadoop YARN, NameNode HA, HDFS Federation
PPTX
Hadoop Distributed File System(HDFS) : Behind the scenes
PPT
Anatomy of file read in hadoop
PPTX
Hadoop Distributed File System
PPT
Anatomy of file write in hadoop
PDF
HDFS Trunncate: Evolving Beyond Write-Once Semantics
PDF
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
PDF
Hadoop Introduction
PPTX
March 2011 HUG: HDFS Federation
PPTX
Hadoop HDFS NameNode HA
PPTX
Introduction to HDFS and MapReduce
PDF
Hdfs architecture
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Hadoop HDFS Concepts
Dynamic Namespace Partitioning with Giraffa File System
Hadoop introduction
Hadoop Distributed File System
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop HDFS by rohitkapa
Apache Hadoop YARN, NameNode HA, HDFS Federation
Hadoop Distributed File System(HDFS) : Behind the scenes
Anatomy of file read in hadoop
Hadoop Distributed File System
Anatomy of file write in hadoop
HDFS Trunncate: Evolving Beyond Write-Once Semantics
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Hadoop Introduction
March 2011 HUG: HDFS Federation
Hadoop HDFS NameNode HA
Introduction to HDFS and MapReduce
Hdfs architecture
Ad

Similar to Facebook's Approach to Big Data Storage Challenge (20)

PDF
北航云计算公开课03 google file system
PPTX
Hadoop Backup and Disaster Recovery
ODP
Distributed File System
 
PDF
SCM Dashboard
PDF
SCM dashobard using Hadoop, Mongodb, Django
PDF
The google file system
PDF
Gfs论文
PDF
RuG Guest Lecture
PDF
Making Big Data Analytics Interactive and Real-­Time
KEY
NoSQL "Tools in Action" talk at Devoxx
PDF
Federated HDFS
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
PPTX
Distributed file system
PPTX
Google
PPTX
Sector CloudSlam 09
PDF
Cacheconcurrencyconsistency cassandra svcc
PPT
Hadoop 24/7
PDF
Fuse'ing python for rapid development of storage efficient FS
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
北航云计算公开课03 google file system
Hadoop Backup and Disaster Recovery
Distributed File System
 
SCM Dashboard
SCM dashobard using Hadoop, Mongodb, Django
The google file system
Gfs论文
RuG Guest Lecture
Making Big Data Analytics Interactive and Real-­Time
NoSQL "Tools in Action" talk at Devoxx
Federated HDFS
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Distributed file system
Google
Sector CloudSlam 09
Cacheconcurrencyconsistency cassandra svcc
Hadoop 24/7
Fuse'ing python for rapid development of storage efficient FS
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
1. Introduction to Computer Programming.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
What is a Computer? Input Devices /output devices
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
project resource management chapter-09.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Architecture types and enterprise applications.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
1. Introduction to Computer Programming.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Getting started with AI Agents and Multi-Agent Systems
What is a Computer? Input Devices /output devices
observCloud-Native Containerability and monitoring.pptx
1 - Historical Antecedents, Social Consideration.pdf
O2C Customer Invoices to Receipt V15A.pptx
Assigned Numbers - 2025 - Bluetooth® Document
project resource management chapter-09.pdf
OMC Textile Division Presentation 2021.pptx
A novel scalable deep ensemble learning framework for big data classification...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Architecture types and enterprise applications.pdf
Hybrid model detection and classification of lung cancer
DP Operators-handbook-extract for the Mautical Institute
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
NewMind AI Weekly Chronicles – August ’25 Week III
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Group 1 Presentation -Planning and Decision Making .pptx

Facebook's Approach to Big Data Storage Challenge

  • 1. Facebook’s Approach to Big Data Storage Challenge Weiyan Wang Software Engineer (Data Infrastructure – HStore) March 1 2013
  • 2. Agenda 1 Data Warehouse Overview and Challenge 2 Smart Retention 3 Sort Before Compression 4 HDFS Raid 5 Directory XOR & Compaction 6 Q&A
  • 3. Life of a tag in Data Warehouse Periodic Analysis Adhoc Analysis nocron Daily report on count of hipal Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes User info reaches Warehouse Warehouse (1day) UDB copier/loader Log line reaches User tags warehouse (1hr) a photo puma Realtime Analytics Scribe Log Storage www.facebook.com Count users tagging photos Log line reaches Log line generated: in the last hour (1min) Scribeh (10s) <user_id, photo_id>
  • 4. History (2008/03-2012/03) Data, Data, and more Data Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day Warehouse Growth 14X 60X 250X 260X 2500X
  • 5. Directions to handle data growth problem • Improve the software • HDFS Federation • Prism • Improve storage efficiency • Store more data without increasing capacity • More and more important, translate into millions of dollars saving.
  • 6. Ways to Improve Storage Efficiency • Better capacity management • Reduce space usage of Hive tables • Reduce replication factor of data
  • 7. Smart Retention – Motivation • Hive table “retention” metadata • Partitions older than retention value are automatically purged by system • Table owners are unaware of table usage • Difficult to set retention value right at the beginning. • Improper retention setting may waste spaces • Users only accessed recent 30-day partitions of a 3- month-retention table
  • 8. Smart Retention • Add a post-execute hook that logs table/partition names and query start time to MySQL. • Calculate the “empirical-retention” per table Given a partition P whose creation time is CTP: Data_age_at_last_queryP = max{StartTimeQ - CTP | ∀query Q accesses P} Given a table T: Empirical_retentionT = max{Data_age_at_last_queryP | ∀ P ∈ T}
  • 9. Smart Retention • Inform Empirical_retentionT to table owners with a call to action: • Accept the empirical value and change retention • Review table query history, figure out better setting • After 2weeks, the system will archive partitions that are older than Empirical_retentionT • Free up spaces after partitions get archived • Users need to restore archived data for querying
  • 10. Smart Retention – Things Learned • Table query history enables table owners to identify outliers: • A table is queried mostly < 32 days olds data but there was one time a 42 days old partition was accessed • Prioritize tables with the most space savings • Save 8PB from the TOP 100 tables!
  • 11. Sort Before Compression - Motivation • In RCFile format, data are stored in columns inside every row block • Sort by one or two columns with lots of duplicate values reduces final compressed data size • Trade extra computation for space saving
  • 12. Sort Before Compression • Identify the best column to sort • Take a sample of table and sort it by every column. Pick the one with the most space saving. • Transfer target partitions from service clusters to compute clusters • Sort them into compressed RCFile format. • Sorted partitions are transferred back to service clusters to replace original ones
  • 13. How we sort set hive.exec.reducers.max=1024; set hive.io.rcfile.record.buffer.size=67108864; INSERT OVERWRITE TABLE hive_table PARTITION (ds='2012-08-06',source_type='mobile_sort') SELECT `(ds|source_type)?+.+` from hive_table WHERE ds='2012-08-06' and source_type='mobile' DISTRIBUTE BY IF (userid <> 0 AND NOT (userid is null), userid, CAST(RAND() AS STRING)) SORT BY userid, ip_address;
  • 14. Sort Before Compression – Things Learned • Sorting achieves >40% space saving! • It’s important to verify data correctness • Compare original and sorted partitions’ hash values • Find a hive bug • Sort cold data first, and gradually move to hot data
  • 15. HDFS Raid In HDFS, data are 3X replicated Meta operations NameNode /warehouse/file1 (Metadata) Client 1 2 3 Read/Write Data 1 3 2 3 2 1 3 1 2 DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
  • 16. HDFS Raid – File-level XOR (10, 1) Before After /warehouse/file1 /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Parity file: /raid/warehouse/file1 11 (10, 1) 11 3X 2.2X
  • 17. HDFS Raid • What if a file has 15 blocks • Treat as 20 blocks and generate one parity with 2 blocks • Replication factor = (15*2+2*2)/15 = 2.27 • Reconstruction • Online reconstruction – DistributedRaidFileSystem • Offline reconstruction – RaidNode • Block Placement
  • 18. HDFS Raid – File-level Reed Solomon (10, 4) Before /warehouse/file1 After /warehouse/file1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 3 Parity file: /raidrs/warehouse/file1 1 2 4 5 6 7 8 9 10 11 12 13 14 (10, 4) 3X 1.4X
  • 19. HDFS Raid – Hybrid Storage Even older ×1.4 RS 3months older Raided ×2.2 1day older XOR Raided ×3 Born Life of file /warehouse/facebook.jpg
  • 20. HDFS Raid – Things Learned • Replication factor 3 ->2.65 (12% space saving) • Avoid flooding namenode with requests • Daily pipeline scans fsimage to pick raidable files rather than recursively search from namenode • Small files disallow more replication reduction • 50% of files in the warehouse have only 1 or 2 blocks. They are too small to be raided.
  • 21. Raid Warm Small Files: Directory level XOR Before After /data/file3 /data/file3 /data/file1 /data/file2 /data/file4 /data/file1 /data/file2 /data/file4 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 5 6 10 /raid/data/file1 /raid/data/file3 Parity file: /dir-raid/data 11 12 11 11 12 11 2.7X 2.2X
  • 22. Handle Directory Change Directory change happens very /namespace/infra/ds=2013-07-07 infrequently in 2 warehouse file1 File2 file3 Client file1 file2 file3 Try to read file2, 3 encounter parity missing blocks parity 1 Client Stripe store (MySQL) Look at the stripe RaidNode Block id Stripe id table, figure out that file4 does not 4 Blk_file_1 Strp_1 belong to the Blk_file_2 Strp_1 stripe, and file3 is Re-raid the directory, before file3 Blk_file_3 Strp_1 in trash. is actually deleted from cluster Blk_parity Strp_1 Reconstruct file2!!
  • 23. Raid Cold Small Files: Compaction • Compact cold small files into large files and apply file-level RS • No need to handle directory changes for file-level RS • Re-raid a Directory-RS Raided directory is expensive • Raid-aware compaction can achieve best space saving • Change block size to produce files with multiples of ten blocks • Reduce the number of metadata
  • 24. Raid-Aware Compaction ▪ Compaction settings: set mapred.min.split.size = 39*blockSize; set mapred.max.split.size = 39*blockSize; set mapred.min.split.size.per.node = 39*blockSize; set mapred.min.split.size.per.rack = 39*blockSize; set dfs.block.size = blockSize; set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; ▪ Calculate the best block size for a partition ▪ Make sure bestBlockSize * N ≈ Partition size where N = 39p + q (p ∈N+ , q ∈ {10, 20, 30}) ▪ Compaction will generate p 40-block files and one q- block file
  • 25. Raid-Aware Compaction ▪ Compact SeqFile format partition ▪ INSERT OVERWRITE TABLE seq_table PARTITION (ds = "2012-08-17") SELECT `(ds)?+.+` FROM seq_table WHERE ds = "2012-08-17"; ▪ Compact RCFile format partition ▪ ALTER TABLE rc_table PARTITION (ds="2009-08-31") CONCATENATE;
  • 26. Directory XOR & Compaction - Things Learned • Replication factor 2.65 ->2.35! (additional 12% space saving) Still rolling out • Bookkeeping blocks’ checksums could avoid data corruption caused by bugs • Unawareness of Raid in HDFS causes some issues • Operational error could cause data loss (forget to move parity data with source data) • Directory XOR & Compaction only work for warehouse data