HBase internals

HBase Storage Internals, present and future!
Matteo Bertozzi | @Cloudera
Speaker Name or Subhead Goes Here
March 2013 - Hadoop Summit Europe

1

What is HBase?
• Open source Storage Manager that provides random
read/write on top of HDFS
• Provides Tables with a “Key:Column/Value” interface
• Dynamic columns (qualifiers), no schema needed
• “Fixed” column groups (families)
• table[row:family:column] = value

2

HBase ecosystem
• Apache Hadoop HDFS for data durability and
reliability (Write-Ahead Log)
• Apache ZooKeeper for distributed coordination App MR

• Apache Hadoop MapReduce built-in support
for running MapReduce jobs

ZK HDFS

3

How HBase Works
“View from 10000ft”

4

Master, Region Servers and Regions
Client • Region Server
• Server that contains a set of Regions
• Responsible to handle reads and writes
ZooKeeper
• Region

Master • The basic unit of scalability in HBase
• Subset of the table’s data
• Contiguous, sorted range of rows stored together.
Region Server Region Server Region Server • Master
Region Region Region • Coordinates the HBase Cluster
Region Region Region • Assignment/Balancing of the Regions
Region Region Region • Handles admin operations
• create/delete/modify table, …
HDFS

5

Autosharding and .META. table
• A Region is a Subset of the table’s data
• When there is too much data in a Region…
• a split is triggered, creating 2 regions
• The association “Region -> Server” is stored in a System Table
• The Location of .META. Is stored in ZooKeeper
Table Start Key Region ID Region Server machine01
Region 1 - testTable
testTable Key-00 1 machine01.host Region 4 - testTable

testTable Key-31 2 machine03.host
machine02
testTable Key-65 3 machine02.host
Region 3 - testTable
testTable Key-83 4 machine01.host Region 1 - users

… … … …
machine03
users Key-AB 1 machine03.host Region 2 - testTable

users Key-KG 2 machine02.host Region 2 - users

6

The Write Path – Create a New Table
• The client asks to the master to create a new Table
Client
• hbase> create ‘myTable’, ‘cf’
createTable()

• The Master
Master
• Store the Table information (“schema”) Store Table
“Metadata”
• Create Regions based on the key-splits provided
Assign the Regions
• no splits provided, one single region by default “enable”

• Assign the Regions to the Region Servers
Region Region Region
• The assignment Region -> Server Server Server Server
Region
is written to a system table called “.META.” Region Region

7

The Write Path – “Inserting” data
• table.put(row-key:family:column, value)
Client
Where is
• The client asks ZooKeeper the location of .META. .META.? Scan
.META.

• The client scans .META. searching for the Region Server ZooKeeper Region Server

Region
responsible to handle the Key Insert
KeyValue Region
• The client asks the Region Server to insert/update/delete
Region Server
the specified key/value. Region
• The Region Server process the request and dispatch it to Region
Region
the Region responsible to handle the Key
• The operation is written to a Write-Ahead Log (WAL)
• …and the KeyValues added to the Store: “MemStore”

8

The Write Path – Append Only to Random R/W
• Files in HDFS are RS

WAL
• Append-Only
• Immutable once closed MemStore + Store Files (HFiles)

• HBase provides Random Writes?
• …not really from a storage point of view
• KeyValues are stored in memory and written to disk on pressure
• Don’t worry your data is safe in the WAL!
Key0 – value 0
• (The Region Server can recover data from the WAL is case of crash) Key1 – value 1
Key2 – value 2
Key3 – value 3

• But this allow to sort data by Key before writing on disk Key4 – value 4
Key5 – value 5

• Deletes are like Inserts but with a “remove me flag” Store Files

9

The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
Client
Where is
• The client scans .META. searching for the Region Server .META.? Scan
.META.
responsible to handle the Key ZooKeeper Region Server

Region
• The client asks the Region Server to get the specified key/value. Get Key
Region
• The Region Server process the request and dispatch it to the
Region Server
Region responsible to handle the Key Region
• MemStore and Store Files are scanned to find the key Region
Region

10

The Read Path – Append Only to Random R/W
Each flush a new file is created
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key5 – value 5.0
Key5 – value 5.0 Key5 – [deleted]

Each file have KeyValues sorted by key
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0

• Two or more files can contains the same key
(updates/deletes)
• To find a Key you need to scan all the files
• …with some optimizations
• Filter Files Start/End Key
• Having a bloom filter on each file

11

HFile
HBase Store File Format

12

HFile Format
• Only Sequential Writes, just append(key, value) Blocks
Header
• Large Sequential Reads are better
Record 0
• Why grouping records in blocks? Record 1
…
• Easy to split Record N

• Easy to read Key/Value Header
(record) Record 0
• Easy to cache Key Length : int Record 1
Value Length : int …
• Easy to index (if records are sorted) Key : byte[]
Record N

• Block Compression (snappy, lz4, gz, …) Index 0
…
Value : byte[]
Index N

Trailer

13

Data Block Encoding
• “Be aware of the data”
• Block Encoding allows to compress the Key based on what we know
• Keys are sorted… prefix may be similar in most cases
• One file contains keys from one Family only
“on-disk”
• Timestamps are “similar”, we can store the diff KeyValue
• Type is “put” most of the time… Row Length : short
Row : byte[]
Family Length : byte
Family : byte[]
Qualifier : byte[]
Timestamp : long
Type : byte

14

Compactions
Optimize the read-path

15

Compactions
Reduce the number of files to look into during a scan
Key0 – value 0.1
•
Key0 – value 0.0

• Removing duplicated keys (updated values)

• Removing deleted keys
• Creates a new file by merging the content of two or more files Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Remove the old files Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0

16

Pluggable Compactions
Try different algorithm
Key0 – value 0.1
•
Key0 – value 0.0

Be aware of the data
Key6 – value 6.0
•
Key8 – value 8.0

• Time Series? I guess no updates from the 80s
• Be aware of the requests Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Compact based on statistics Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
• which files are hot and which are not
• which keys are hot and which are not

17

Snapshots
Zero-Copy Snapshots and Table Clones

18

What Is a Snapshot?
• “a Snapshot is not a copy of the table”
• a Snapshot is a set of metadata information
• The table “schema” (column families and attributes)
• The Regions information (start key, end key, …)
• The list of Store Files
ZK ZK

• The list of WALs active Master ZK

RS RS
Region Region Region Region Region Region

WAL

WAL
Store Files (HFiles) Store Files (HFiles)

19

How Taking a Snapshot Works?
• The master orchestrate the RSs
• the communication is done via ZooKeeper
• using a “2-phase commit like” transaction (prepare/commit)
• Each RS is responsible to take its “piece” of snapshot
• For each Region store the metadata information needed
• (list of Store Files, WALs, region start/end keys, …)
ZK ZK

Master ZK

RS RS
Region Region Region Region Region Region

WAL

WAL
Store Files (HFiles) Store Files (HFiles)

20

Cloning a Table from a Snapshots
• hbase> clone_snapshot ‘snapshotName’, ‘tableName’
…

• Creates a new table with the data “contained” in the snapshot
• No data copies involved
• HFiles are immutable, and shared between tables and snapshots
• You can insert/update/remove data from the new table
• No repercussions on the snapshot, original tables or other cloned tables

21

Compactions & Archiving
• HFiles are immutable, and shared between tables and snapshots

• On compaction or table deletion, files are removed from disk
• If one of these files are referenced by a snapshot or a cloned table
• The file is moved to an “archive” directory
• And deleted later, when there’re no references to it

22

Future
What can be improved?

23

0.96 is coming up
• Moving RPC to Protobuf
• Allows rolling upgrades with no surprises
• HBase Snapshots
• Pluggable Compactions
• Remove -ROOT-
• Table Locks

24

0.98 and Beyond
• Transparent Table/Column-Family Encryption
• Cell-level security
• Multiple WALs per Region Server (MTTR)
• Data Placement Awareness (MTTR)
• Data Type Awareness
• Compaction policies, based on the data needs
• Managing blocks directly (instead of files)

25

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Questions?
Headline Goes Here
Matteo Name or @Cloudera
SpeakerBertozzi | Subhead Goes Here

26

HBase internals

More Related Content

What's hot (20)

Similar to HBase internals (20)

HBase internals