SlideShare a Scribd company logo
HBase Storage Internals, present and future!
    Matteo Bertozzi | @Cloudera
    Speaker Name or Subhead Goes Here
    March 2013 - Hadoop Summit Europe




1
What is HBase?
    • Open source Storage Manager that provides random
      read/write on top of HDFS
    • Provides Tables with a “Key:Column/Value” interface
        • Dynamic columns (qualifiers), no schema needed
        • “Fixed” column groups (families)
        • table[row:family:column] = value




2
HBase ecosystem
    • Apache Hadoop HDFS for data durability and
      reliability (Write-Ahead Log)
    • Apache ZooKeeper for distributed coordination   App   MR

    • Apache Hadoop MapReduce built-in support
      for running MapReduce jobs

                                                      ZK    HDFS




3
How HBase Works
    “View from 10000ft”




4
Master, Region Servers and Regions
                Client                                • Region Server
                                                           • Server that contains a set of Regions
                                                           • Responsible to handle reads and writes
                                          ZooKeeper
                                                      • Region

                                            Master         • The basic unit of scalability in HBase
                                                           • Subset of the table’s data
                                                           • Contiguous, sorted range of rows stored together.
    Region Server    Region Server   Region Server    • Master
       Region            Region         Region             • Coordinates the HBase Cluster
       Region            Region         Region                  • Assignment/Balancing of the Regions
       Region            Region         Region             • Handles admin operations
                                                                • create/delete/modify table, …
                         HDFS


5
Autosharding and .META. table
    •   A Region is a Subset of the table’s data
    •   When there is too much data in a Region…
           • a split is triggered, creating 2 regions
    •   The association “Region -> Server” is stored in a System Table
    •   The Location of .META. Is stored in ZooKeeper
                                             Table      Start Key   Region ID   Region Server      machine01
                                                                                                 Region 1 - testTable
                                            testTable    Key-00        1        machine01.host   Region 4 - testTable

                                            testTable    Key-31        2        machine03.host
                                                                                                   machine02
                                            testTable    Key-65        3        machine02.host
                                                                                                 Region 3 - testTable
                                            testTable    Key-83        4        machine01.host    Region 1 - users

                                               …           …           …              …
                                                                                                   machine03
                                             users       Key-AB        1        machine03.host   Region 2 - testTable

                                             users       Key-KG        2        machine02.host    Region 2 - users




6
The Write Path – Create a New Table
• The client asks to the master to create a new Table
                                                                            Client
    • hbase> create ‘myTable’, ‘cf’
                                                                                     createTable()

• The Master
                                                                          Master
    • Store the Table information (“schema”)                                      Store Table
                                                                                  “Metadata”
    • Create Regions based on the key-splits provided
                                                                        Assign the Regions
          • no splits provided, one single region by default                 “enable”


    • Assign the Regions to the Region Servers
                                                               Region       Region               Region
          • The assignment Region -> Server                    Server       Server               Server
                                                                                                 Region
            is written to a system table called “.META.”       Region        Region
                                                               Region        Region              Region




7
The Write Path – “Inserting” data
•   table.put(row-key:family:column, value)
                                                                               Client
                                                                Where is
•   The client asks ZooKeeper the location of .META.            .META.?                     Scan
                                                                                           .META.

•   The client scans .META. searching for the Region Server     ZooKeeper                           Region Server

                                                                                                      Region
    responsible to handle the Key                                                     Insert
                                                                                     KeyValue         Region
•   The client asks the Region Server to insert/update/delete
                                                                      Region Server
    the specified key/value.                                                Region
•   The Region Server process the request and dispatch it to                Region
                                                                            Region
    the Region responsible to handle the Key
       • The operation is written to a Write-Ahead Log (WAL)
       • …and the KeyValues added to the Store: “MemStore”




8
The Write Path – Append Only to Random R/W
• Files in HDFS are                                                                                     RS
                                                                                              Region       Region        Region




                                                                                        WAL
      • Append-Only
      • Immutable once closed                                                                  MemStore + Store Files (HFiles)


• HBase provides Random Writes?
      • …not really from a storage point of view
      • KeyValues are stored in memory and written to disk on pressure
           • Don’t worry your data is safe in the WAL!
                                                                                                              Key0 – value 0
               •   (The Region Server can recover data from the WAL is case of crash)                         Key1 – value 1
                                                                                                              Key2 – value 2
                                                                                                              Key3 – value 3

          • But this allow to sort data by Key before writing on disk                                         Key4 – value 4
                                                                                                              Key5 – value 5



     • Deletes are like Inserts but with a “remove me flag”                                                         Store Files




9
The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
                                                                                     Client
                                                                      Where is
• The client scans .META. searching for the Region Server             .META.?                     Scan
                                                                                                 .META.
  responsible to handle the Key                                       ZooKeeper                           Region Server

                                                                                                            Region
• The client asks the Region Server to get the specified key/value.                    Get Key
                                                                                                            Region
• The Region Server process the request and dispatch it to the
                                                                            Region Server
  Region responsible to handle the Key                                            Region
     • MemStore and Store Files are scanned to find the key                       Region
                                                                                  Region




10
The Read Path – Append Only to Random R/W
     Each flush a new file is created
                                                                       Key0 – value 0.1
•
                                                    Key0 – value 0.0
                                                    Key2 – value 2.0   Key5 – value 5.0
                                                    Key3 – value 3.0   Key1 – value 1.0
                                                    Key5 – value 5.0   Key5 – [deleted]

     Each file have KeyValues sorted by key
                                                                       Key6 – value 6.0
•
                                                    Key8 – value 8.0
                                                    Key9 – value 9.0   Key7– value 7.0




•    Two or more files can contains the same key
     (updates/deletes)
•    To find a Key you need to scan all the files
        • …with some optimizations
        • Filter Files Start/End Key
        • Having a bloom filter on each file




11
HFile
     HBase Store File Format




12
HFile Format
•    Only Sequential Writes, just append(key, value)                        Blocks
                                                                            Header
•    Large Sequential Reads are better
                                                                            Record 0
•    Why grouping records in blocks?                                        Record 1
                                                                               …
        • Easy to split                                                     Record N

        • Easy to read                                 Key/Value            Header
                                                            (record)        Record 0
        • Easy to cache                                 Key Length : int    Record 1
                                                       Value Length : int      …
        • Easy to index (if records are sorted)           Key : byte[]
                                                                            Record N

        • Block Compression (snappy, lz4, gz, …)                            Index 0
                                                                               …
                                                         Value : byte[]
                                                                            Index N

                                                                             Trailer


13
Data Block Encoding
•    “Be aware of the data”
•    Block Encoding allows to compress the Key based on what we know
        • Keys are sorted… prefix may be similar in most cases
        • One file contains keys from one Family only
                                                                    “on-disk”
        • Timestamps are “similar”, we can store the diff           KeyValue
        • Type is “put” most of the time…                           Row Length : short
                                                                               Row : byte[]
                                                                           Family Length : byte
                                                                              Family : byte[]
                                                                             Qualifier : byte[]
                                                                            Timestamp : long
                                                                               Type : byte




14
Compactions
     Optimize the read-path




15
Compactions
     Reduce the number of files to look into during a scan
                                                                                                Key0 – value 0.1
•
                                                                      Key0 – value 0.0
                                                                      Key2 – value 2.0          Key1 – value 1.0
                                                                      Key3 – value 3.0          Key4– value 4.0
                                                                      Key5 – value 5.0          Key5 – [deleted]

        • Removing duplicated keys (updated values)
                                                                      Key8 – value 8.0          Key6 – value 6.0
                                                                      Key9 – value 9.0          Key7– value 7.0




        • Removing deleted keys
•    Creates a new file by merging the content of two or more files              Key0 – value 0.1
                                                                                 Key1 – value 1.0
                                                                                 Key2 – value 2.0
                                                                                 Key4– value 4.0
        • Remove the old files                                                   Key6 – value 6.0
                                                                                 Key7– value 7.0
                                                                                 Key8– value 8.0
                                                                                 Key9– value 9.0




16
Pluggable Compactions
     Try different algorithm
                                                                                   Key0 – value 0.1
•
                                                         Key0 – value 0.0
                                                         Key2 – value 2.0          Key1 – value 1.0
                                                         Key3 – value 3.0          Key4– value 4.0
                                                         Key5 – value 5.0          Key5 – [deleted]

     Be aware of the data
                                                                                   Key6 – value 6.0
•
                                                         Key8 – value 8.0
                                                         Key9 – value 9.0          Key7– value 7.0




        • Time Series? I guess no updates from the 80s
•    Be aware of the requests                                       Key0 – value 0.1
                                                                    Key1 – value 1.0
                                                                    Key2 – value 2.0
                                                                    Key4– value 4.0
        • Compact based on statistics                               Key6 – value 6.0
                                                                    Key7– value 7.0
                                                                    Key8– value 8.0
                                                                    Key9– value 9.0
        • which files are hot and which are not
        • which keys are hot and which are not




17
Snapshots
     Zero-Copy Snapshots and Table Clones




18
What Is a Snapshot?
     • “a Snapshot is not a copy of the table”
     • a Snapshot is a set of metadata information
          • The table “schema” (column families and attributes)
          • The Regions information (start key, end key, …)
              • The list of Store Files
                                                                                                                      ZK         ZK


              • The list of WALs active                   Master                                                            ZK




                                                             RS                                         RS
                                                   Region       Region         Region         Region       Region          Region




                                             WAL




                                                                                        WAL
                                                        Store Files (HFiles)                       Store Files (HFiles)



19
How Taking a Snapshot Works?
     •   The master orchestrate the RSs
           • the communication is done via ZooKeeper
           • using a “2-phase commit like” transaction (prepare/commit)
     •   Each RS is responsible to take its “piece” of snapshot
           • For each Region store the metadata information needed
           • (list of Store Files, WALs, region start/end keys, …)
                                                                                                                             ZK         ZK

                                                                                          Master                                   ZK




                                                                    RS                                         RS
                                                          Region       Region         Region         Region       Region          Region




                                                    WAL




                                                                                               WAL
                                                               Store Files (HFiles)                       Store Files (HFiles)



20
Cloning a Table from a Snapshots
     •   hbase> clone_snapshot ‘snapshotName’, ‘tableName’
         …



     •   Creates a new table with the data “contained” in the snapshot
     •   No data copies involved
            • HFiles are immutable, and shared between tables and snapshots
     •   You can insert/update/remove data from the new table
            • No repercussions on the snapshot, original tables or other cloned tables




21
Compactions & Archiving
     •   HFiles are immutable, and shared between tables and snapshots

     •   On compaction or table deletion, files are removed from disk
     •   If one of these files are referenced by a snapshot or a cloned table
             • The file is moved to an “archive” directory
             • And deleted later, when there’re no references to it




22
Future
     What can be improved?




23
0.96 is coming up
     •   Moving RPC to Protobuf
           •   Allows rolling upgrades with no surprises
     • HBase Snapshots
     • Pluggable Compactions
     • Remove -ROOT-
     • Table Locks




24
0.98 and Beyond
     • Transparent Table/Column-Family Encryption
     • Cell-level security
     • Multiple WALs per Region Server (MTTR)
     • Data Placement Awareness (MTTR)
     • Data Type Awareness
     • Compaction policies, based on the data needs
     • Managing blocks directly (instead of files)


25
DO NOT USE PUBLICLY
                                            PRIOR TO 10/23/12
     Questions?
     Headline Goes Here
     Matteo Name or @Cloudera
     SpeakerBertozzi | Subhead Goes Here




26
27

More Related Content

PDF
Bigtable and Dynamo
PPT
Distributed Server
PPTX
GOOGLE BIGTABLE
PDF
Einfuehrung in Apache Spark
PDF
Dynamo and BigTable - Review and Comparison
PDF
BigTable And Hbase
PPTX
Quantum neural network
PDF
Rnn and lstm
Bigtable and Dynamo
Distributed Server
GOOGLE BIGTABLE
Einfuehrung in Apache Spark
Dynamo and BigTable - Review and Comparison
BigTable And Hbase
Quantum neural network
Rnn and lstm

What's hot (20)

PDF
Intro to HBase
PPTX
network ram parallel computing
PPT
Zookeeper Introduce
ZIP
NoSQL databases
PDF
Google Spanner
PDF
Cluster Schedulers
PDF
An Overview of Spanner: Google's Globally Distributed Database
PPTX
Intro to modern cryptography
PDF
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
PDF
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
PDF
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PPTX
Active and main memory database
PPTX
Data models in NoSQL
PPTX
cloud computing module3 CLOUD COMPUTING ARCHITECTURE
PPTX
Task migration in os
PDF
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
PPTX
Message passing ( in computer science)
PPT
Virtual memory
Intro to HBase
network ram parallel computing
Zookeeper Introduce
NoSQL databases
Google Spanner
Cluster Schedulers
An Overview of Spanner: Google's Globally Distributed Database
Intro to modern cryptography
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
YugaByte DB Internals - Storage Engine and Transactions
Active and main memory database
Data models in NoSQL
cloud computing module3 CLOUD COMPUTING ARCHITECTURE
Task migration in os
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
Message passing ( in computer science)
Virtual memory
Ad

Similar to HBase internals (20)

PDF
HBase Storage Internals
PDF
Realtime Apache Hadoop at Facebook
PPT
Introduction To Maxtable
KEY
Near-realtime analytics with Kafka and HBase
PPTX
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
PDF
Facebook keynote-nicolas-qcon
PDF
支撑Facebook消息处理的h base存储系统
PDF
Facebook Messages & HBase
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
PPT
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
PPT
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
PPT
Clustering
PDF
Omid Efficient Transaction Mgmt and Processing for HBase
PDF
1 Introduction at CloudStack Developer Day
PPTX
PPTX
Apache Hadoop YARN State of the Union
PDF
Rigorous and Multi-tenant HBase Performance
PPTX
Rigorous and Multi-tenant HBase Performance Measurement
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PDF
CloudStack Architecture Future
HBase Storage Internals
Realtime Apache Hadoop at Facebook
Introduction To Maxtable
Near-realtime analytics with Kafka and HBase
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
Facebook Messages & HBase
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Weblogicserveroverviewtopologyconfigurationadministration 1227546826890714-9
Clustering
Omid Efficient Transaction Mgmt and Processing for HBase
1 Introduction at CloudStack Developer Day
Apache Hadoop YARN State of the Union
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance Measurement
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
CloudStack Architecture Future
Ad

HBase internals

  • 1. HBase Storage Internals, present and future! Matteo Bertozzi | @Cloudera Speaker Name or Subhead Goes Here March 2013 - Hadoop Summit Europe 1
  • 2. What is HBase? • Open source Storage Manager that provides random read/write on top of HDFS • Provides Tables with a “Key:Column/Value” interface • Dynamic columns (qualifiers), no schema needed • “Fixed” column groups (families) • table[row:family:column] = value 2
  • 3. HBase ecosystem • Apache Hadoop HDFS for data durability and reliability (Write-Ahead Log) • Apache ZooKeeper for distributed coordination App MR • Apache Hadoop MapReduce built-in support for running MapReduce jobs ZK HDFS 3
  • 4. How HBase Works “View from 10000ft” 4
  • 5. Master, Region Servers and Regions Client • Region Server • Server that contains a set of Regions • Responsible to handle reads and writes ZooKeeper • Region Master • The basic unit of scalability in HBase • Subset of the table’s data • Contiguous, sorted range of rows stored together. Region Server Region Server Region Server • Master Region Region Region • Coordinates the HBase Cluster Region Region Region • Assignment/Balancing of the Regions Region Region Region • Handles admin operations • create/delete/modify table, … HDFS 5
  • 6. Autosharding and .META. table • A Region is a Subset of the table’s data • When there is too much data in a Region… • a split is triggered, creating 2 regions • The association “Region -> Server” is stored in a System Table • The Location of .META. Is stored in ZooKeeper Table Start Key Region ID Region Server machine01 Region 1 - testTable testTable Key-00 1 machine01.host Region 4 - testTable testTable Key-31 2 machine03.host machine02 testTable Key-65 3 machine02.host Region 3 - testTable testTable Key-83 4 machine01.host Region 1 - users … … … … machine03 users Key-AB 1 machine03.host Region 2 - testTable users Key-KG 2 machine02.host Region 2 - users 6
  • 7. The Write Path – Create a New Table • The client asks to the master to create a new Table Client • hbase> create ‘myTable’, ‘cf’ createTable() • The Master Master • Store the Table information (“schema”) Store Table “Metadata” • Create Regions based on the key-splits provided Assign the Regions • no splits provided, one single region by default “enable” • Assign the Regions to the Region Servers Region Region Region • The assignment Region -> Server Server Server Server Region is written to a system table called “.META.” Region Region Region Region Region 7
  • 8. The Write Path – “Inserting” data • table.put(row-key:family:column, value) Client Where is • The client asks ZooKeeper the location of .META. .META.? Scan .META. • The client scans .META. searching for the Region Server ZooKeeper Region Server Region responsible to handle the Key Insert KeyValue Region • The client asks the Region Server to insert/update/delete Region Server the specified key/value. Region • The Region Server process the request and dispatch it to Region Region the Region responsible to handle the Key • The operation is written to a Write-Ahead Log (WAL) • …and the KeyValues added to the Store: “MemStore” 8
  • 9. The Write Path – Append Only to Random R/W • Files in HDFS are RS Region Region Region WAL • Append-Only • Immutable once closed MemStore + Store Files (HFiles) • HBase provides Random Writes? • …not really from a storage point of view • KeyValues are stored in memory and written to disk on pressure • Don’t worry your data is safe in the WAL! Key0 – value 0 • (The Region Server can recover data from the WAL is case of crash) Key1 – value 1 Key2 – value 2 Key3 – value 3 • But this allow to sort data by Key before writing on disk Key4 – value 4 Key5 – value 5 • Deletes are like Inserts but with a “remove me flag” Store Files 9
  • 10. The Read Path – “reading” data • The client asks ZooKeeper the location of .META. Client Where is • The client scans .META. searching for the Region Server .META.? Scan .META. responsible to handle the Key ZooKeeper Region Server Region • The client asks the Region Server to get the specified key/value. Get Key Region • The Region Server process the request and dispatch it to the Region Server Region responsible to handle the Key Region • MemStore and Store Files are scanned to find the key Region Region 10
  • 11. The Read Path – Append Only to Random R/W Each flush a new file is created Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key5 – value 5.0 Key3 – value 3.0 Key1 – value 1.0 Key5 – value 5.0 Key5 – [deleted] Each file have KeyValues sorted by key Key6 – value 6.0 • Key8 – value 8.0 Key9 – value 9.0 Key7– value 7.0 • Two or more files can contains the same key (updates/deletes) • To find a Key you need to scan all the files • …with some optimizations • Filter Files Start/End Key • Having a bloom filter on each file 11
  • 12. HFile HBase Store File Format 12
  • 13. HFile Format • Only Sequential Writes, just append(key, value) Blocks Header • Large Sequential Reads are better Record 0 • Why grouping records in blocks? Record 1 … • Easy to split Record N • Easy to read Key/Value Header (record) Record 0 • Easy to cache Key Length : int Record 1 Value Length : int … • Easy to index (if records are sorted) Key : byte[] Record N • Block Compression (snappy, lz4, gz, …) Index 0 … Value : byte[] Index N Trailer 13
  • 14. Data Block Encoding • “Be aware of the data” • Block Encoding allows to compress the Key based on what we know • Keys are sorted… prefix may be similar in most cases • One file contains keys from one Family only “on-disk” • Timestamps are “similar”, we can store the diff KeyValue • Type is “put” most of the time… Row Length : short Row : byte[] Family Length : byte Family : byte[] Qualifier : byte[] Timestamp : long Type : byte 14
  • 15. Compactions Optimize the read-path 15
  • 16. Compactions Reduce the number of files to look into during a scan Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key1 – value 1.0 Key3 – value 3.0 Key4– value 4.0 Key5 – value 5.0 Key5 – [deleted] • Removing duplicated keys (updated values) Key8 – value 8.0 Key6 – value 6.0 Key9 – value 9.0 Key7– value 7.0 • Removing deleted keys • Creates a new file by merging the content of two or more files Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 • Remove the old files Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0 16
  • 17. Pluggable Compactions Try different algorithm Key0 – value 0.1 • Key0 – value 0.0 Key2 – value 2.0 Key1 – value 1.0 Key3 – value 3.0 Key4– value 4.0 Key5 – value 5.0 Key5 – [deleted] Be aware of the data Key6 – value 6.0 • Key8 – value 8.0 Key9 – value 9.0 Key7– value 7.0 • Time Series? I guess no updates from the 80s • Be aware of the requests Key0 – value 0.1 Key1 – value 1.0 Key2 – value 2.0 Key4– value 4.0 • Compact based on statistics Key6 – value 6.0 Key7– value 7.0 Key8– value 8.0 Key9– value 9.0 • which files are hot and which are not • which keys are hot and which are not 17
  • 18. Snapshots Zero-Copy Snapshots and Table Clones 18
  • 19. What Is a Snapshot? • “a Snapshot is not a copy of the table” • a Snapshot is a set of metadata information • The table “schema” (column families and attributes) • The Regions information (start key, end key, …) • The list of Store Files ZK ZK • The list of WALs active Master ZK RS RS Region Region Region Region Region Region WAL WAL Store Files (HFiles) Store Files (HFiles) 19
  • 20. How Taking a Snapshot Works? • The master orchestrate the RSs • the communication is done via ZooKeeper • using a “2-phase commit like” transaction (prepare/commit) • Each RS is responsible to take its “piece” of snapshot • For each Region store the metadata information needed • (list of Store Files, WALs, region start/end keys, …) ZK ZK Master ZK RS RS Region Region Region Region Region Region WAL WAL Store Files (HFiles) Store Files (HFiles) 20
  • 21. Cloning a Table from a Snapshots • hbase> clone_snapshot ‘snapshotName’, ‘tableName’ … • Creates a new table with the data “contained” in the snapshot • No data copies involved • HFiles are immutable, and shared between tables and snapshots • You can insert/update/remove data from the new table • No repercussions on the snapshot, original tables or other cloned tables 21
  • 22. Compactions & Archiving • HFiles are immutable, and shared between tables and snapshots • On compaction or table deletion, files are removed from disk • If one of these files are referenced by a snapshot or a cloned table • The file is moved to an “archive” directory • And deleted later, when there’re no references to it 22
  • 23. Future What can be improved? 23
  • 24. 0.96 is coming up • Moving RPC to Protobuf • Allows rolling upgrades with no surprises • HBase Snapshots • Pluggable Compactions • Remove -ROOT- • Table Locks 24
  • 25. 0.98 and Beyond • Transparent Table/Column-Family Encryption • Cell-level security • Multiple WALs per Region Server (MTTR) • Data Placement Awareness (MTTR) • Data Type Awareness • Compaction policies, based on the data needs • Managing blocks directly (instead of files) 25
  • 26. DO NOT USE PUBLICLY PRIOR TO 10/23/12 Questions? Headline Goes Here Matteo Name or @Cloudera SpeakerBertozzi | Subhead Goes Here 26
  • 27. 27