SlideShare a Scribd company logo
ADVANCED HBASE
Architecture and Schema Design
JAX UK, October 2012

Lars George
Director EMEA Services
About Me

•  Director EMEA Services @ Cloudera
    •  Consulting on Hadoop projects (everywhere)
•  Apache Committer
    •  HBase and Whirr
•  O’Reilly Author
    •  HBase – The Definitive Guide
      •  Now in Japanese!

•  Contact
    •  lars@cloudera.com                      日本語版も出ました!	
  
    •  @larsgeorge
Agenda

•  HBase Architecture
•  Schema Design
HBASE ARCHITECTURE
HBase Tables
HBase Tables
HBase Tables
HBase Tables and Regions

•  Table is made up of any number if regions
•  Region is specified by its startKey and endKey
    •  Empty table: (Table, NULL, NULL)
    •  Two-region table: (Table, NULL, “com.cloudera.www”)
       and (Table, “com.cloudera.www”, NULL)
•  Each region may live on a different node and is
 made up of several HDFS files and blocks, each
 of which is replicated by Hadoop
Distribution
HBase Tables

•  Tables are sorted by Row in lexicographical order
•  Table schema only defines its column families
    •  Each family consists of any number of columns
    •  Each column consists of any number of versions
    •  Columns only exist when inserted, NULLs are free
    •  Columns within a family are sorted and stored
       together
    •  Everything except table names are byte[]


(Table, Row, Family:Column, Timestamp) -> Value
HBase Architecture
HBase Architecture (cont.)

•  HBase uses HDFS (or similar) as its reliable
 storage layer
  •  Handles checksums, replication, failover
•  Native Java API, Gateway for REST, Thrift, Avro
•  Master manages cluster
•  RegionServer manage data
•  ZooKeeper is used the “neural network”
    •  Crucial for HBase
    •  Bootstraps and coordinates cluster
HBase Architecture (cont.)

•  Based on Log-Structured Merge-Trees (LSM-Trees)
•  Inserts are done in write-ahead log first
•  Data is stored in memory and flushed to disk on
   regular intervals or based on size
•  Small flushes are merged in the background to keep
   number of files small
•  Reads read memory stores first and then disk based
   files second
•  Deletes are handled with “tombstone” markers
•  Atomicity on row level no matter how many columns
   •  keeps locking model easy
MemStores
•  After data is written to the WAL the RegionServer
   saves KeyValues in memory store
•  Flush to disk based on size, see
   hbase.hregion.memstore.flush.size
•  Default size is 64MB
•  Uses snapshot mechanism to write flush to disk
   while still serving from it and accepting new data
   at the same time
•  Snapshots are released when flush has
   succeeded
Compactions
•  General Concepts
    •  Two types: Minor and Major Compactions
    •  Asynchronous and transparent to client
    •  Manage file bloat from MemStore flushes
•  Minor Compactions
    •  Combine last “few” flushes
    •  Triggered by number of storage files
•  Major Compactions
    •  Rewrite all storage files
    •  Drop deleted data and those values exceeding TTL and/or number of
       versions
    •  Triggered by time threshold
    •  Cannot be scheduled automatically starting at a specific time (bummer!)
    •  May (most definitely) tax overall HDFS IO performance

Tip: Disable major compactions and schedule to run manually (e.g.
  cron) at off-peak times
Block Cache
•  Acts as very large, in-memory distributed cache
•  Assigned a large part of the JVM heap in the RegionServer process,
   see hfile.block.cache.size
•  Optimizes reads on subsequent columns and rows
•  Has priority to keep “in-memory” column families in cache
    if(inMemory) {
           this.priority = BlockPriority.MEMORY;
    } else {
           this.priority = BlockPriority.SINGLE;
    }

•  Cache needs to be used properly to get best read performance
    •  Turn off block cache on operations that cause large churn
    •  Store related data “close” to each other
•  Uses LRU cache with threaded (asynchronous) evictions based on
  priorities
Region Splits
•  Triggered by configured maximum file size of any
 store file
   •  This is checked directly after the compaction call to
     ensure store files are actually approaching the
     threshold
•  Runs as asynchronous thread on RegionServer
•  Splits are fast and nearly instant
    •  Reference files point to original region files and
       represent each half of the split
•  Compactions take care of splitting original files
 into new region directories
Auto Sharding
Auto Sharding and Distribution

•  Unit of scalability in HBase is the Region
•  Sorted, contiguous range of rows
•  Spread “randomly” across RegionServer
•  Moved around for load balancing and failover
•  Split automatically or manually to scale with
   growing data
•  Capacity is solely a factor of cluster nodes vs.
   regions per node
Column Family vs. Column

•  Use only a few column families
    •  Causes many files that need to stay open per region
       plus class overhead per family
•  Best used when logical separation between data
   and meta columns
•  Sorting per family can be used to convey
   application logic or access pattern
Storage Separation

•  Column Families allow for separation of data
    •  Used by Columnar Databases for fast analytical
       queries, but on column level only
    •  Allows different or no compression depending on the
       content type
•  Segregate information based on access pattern
•  Data is stored in one or more storage file, called
 HFiles
Column Families
SCHEMA DESIGN
Key Cardinality
Key Cardinality

•  The best performance is gained from using row
   keys
•  Time range bound reads can skip store files
   •  So can Bloom Filters
•  Selecting column families reduces the amount of
   data to be scanned
•  Pure value based filtering is a full table scan
   •  Filters often are too, but reduce network traffic
Fold, Store, and Shift
Fold, Store, and Shift

•  Logical layout does not match physical one
•  All values are stored with the full coordinates,
   including: Row Key, Column Family, Column
   Qualifier, and Timestamp
•  Folds columns into “row per column”
•  NULLs are cost free as nothing is stored
•  Versions are multiple “rows” in folded table
Key/Table Design

•  Crucial to gain best performance
    •  Why do I need to know? Well, you also need to know
       that RDBMS is only working well when columns are
       indexed and query plan is OK
•  Absence of secondary indexes forces use of row
   key or column name sorting
•  Transfer multiple indexes into one
   •  Generate large table -> Good since fits architecture
     and spreads across cluster
DDI

•  Stands for Denormalization, Duplication and
   Intelligent Keys
•  Needed to overcome shortcomings of
   architecture
•  Denormalization -> Replacement for JOINs
•  Duplication -> Design for reads
•  Intelligent Keys -> Implement indexing and
   sorting, optimize reads
Pre-materialize Everything

•  Achieve one read per customer request if
   possible
•  Otherwise keep at lowest number
•  Reads between 10ms (cache miss) and 1ms
   (cache hit)
•  Use MapReduce to compute exacts in batch
•  Store and merge updates live
•  Use incrementColumnValue


            Motto: “Design for Reads”
Tall-Narrow vs. Flat-Wide Tables

•  Rows do not split
    •  Might end up with one row per region
•  Same storage footprint
•  Put more details into the row key
    •  Sometimes dummy column only
    •  Make use of partial key scans
•  Tall with Scans, Wide with Gets
    •  Atomicity only on row level
•  Example: Large graphs, stored as adjacency
 matrix
Example: Mail Inbox

        <userId> : <colfam> : <messageId> : <timestamp> : <email-message>

12345   :   data   :   5fc38314-e290-ae5da5fc375d       :   1307097848   :   "Hi Lars, ..."
12345   :   data   :   725aae5f-d72e-f90f3f070419       :   1307099848   :   "Welcome, and ..."
12345   :   data   :   cc6775b3-f249-c6dd2b1a7467       :   1307101848   :   "To Whom It ..."
12345   :   data   :   dcbee495-6d5e-6ed48124632c       :   1307103848   :   "Hi, how are ..."


                                               or
12345-5fc38314-e290-ae5da5fc375d         :   data   :   :   1307097848   :   "Hi Lars, ..."
12345-725aae5f-d72e-f90f3f070419         :   data   :   :   1307099848   :   "Welcome, and ..."
12345-cc6775b3-f249-c6dd2b1a7467         :   data   :   :   1307101848   :   "To Whom It ..."
12345-dcbee495-6d5e-6ed48124632c         :   data   :   :   1307103848   :   "Hi, how are ..."


                           è   Same Storage Requirements
Partial Key Scans
Key	
                                        Descrip+on	
  
<userId>                                     Scan	
  over	
  all	
  messages	
  
                                             for	
  a	
  given	
  user	
  ID	
  
<userId>-<date>                              Scan	
  over	
  all	
  messages	
  
                                             on	
  a	
  given	
  date	
  for	
  the	
  
                                             given	
  user	
  ID	
  
<userId>-<date>-<messageId>                  Scan	
  over	
  all	
  parts	
  of	
  a	
  
                                             message	
  for	
  a	
  given	
  user	
  
                                             ID	
  and	
  date	
  
<userId>-<date>-<messageId>-<attachmentId>   Scan	
  over	
  all	
  
                                             a8achments	
  of	
  a	
  
                                             message	
  for	
  a	
  given	
  user	
  
                                             ID	
  and	
  date	
  
Sequential Keys

  <timestamp><more key>: {CF: {CQ: {TS : Val}}}

•  Hotspotting on Regions: bad!
•  Instead do one of the following:
     •  Salting
      •  Prefix   <timestamp> with distributed value
      •  Binning or bucketing rows across regions
   •  Key field swap/promotion
       •  Move <more key> before the timestamp (see OpenTSDB
          later)
   •  Randomization
       •  Move <timestamp> out of key
Salting

•    Prefix row keys to gain spread
•    Use well known or numbered prefixes
•    Use modulo to spread across servers
•    Enforce common data stay close to each other for
     subsequent scanning or MapReduce processing
      0_rowkey1, 1_rowkey2, 2_rowkey3
      0_rowkey4, 1_rowkey5, 2_rowkey6

•  Sorted by prefix first
    0_rowkey1
    0_rowkey4
    1_rowkey2
    1_rowkey5
    …
Hashing vs. Sequential Keys

•  Uses hashes for best spread
    •  Use for example MD5 to be able to recreate key
      •  Key = MD5(customerID)
   •  Counter productive for range scans


•  Use sequential keys for locality
    •  Makes use of block caches
    •  May tax one server overly, may be avoided by salting
       or splitting regions while keeping them small
Key Design
Key Design Summary

•  Based on access pattern, either use sequential or
   random keys
•  Often a combination of both is needed
   •  Overcome architectural limitations
•  Neither is necessarily bad
    •  Use bulk import for sequential keys and reads
    •  Random keys are good for random access patterns
Example: Facebook Insights

•  > 20B Events per Day
•  1M Counter Updates per Second
    •  100 Nodes Cluster
    •  10K OPS per Node
•  ”Like” button triggers AJAX request
•  Event written to log file
•  30mins current for website owner


     Web	
  ➜	
  Scribe	
  ➜	
  Ptail	
  ➜	
  Puma	
  ➜	
  HBase	
  
HBase Counters

•  Store counters per Domain and per URL
    •  Leverage HBase increment (atomic read-modify-
       write) feature
•  Each row is one specific Domain or URL
•  The columns are the counters for specific metrics
•  Column families are used to group counters by
 time range
   •  Set time-to-live on CF level to auto-expire counters by
     age to save space, e.g., 2 weeks on “Daily Counters”
     family
Key Design
•  Reversed Domains
    •  Examples: “com.cloudera.www”, “com.cloudera.blog”
    •  Helps keeping pages per site close, as HBase efficiently
       scans blocks of sorted keys
•  Domain Row Key =
 MD5(Reversed Domain) + Reversed Domain
   •  Leading MD5 hash spreads keys randomly across all regions
      for load balancing reasons
   •  Only hashing the domain groups per site (and per subdomain
      if needed)
•  URL Row Key =
 MD5(Reversed Domain) + Reversed Domain + URL ID
   •  Unique ID per URL already available, make use of it
Insights Schema
Summary

•  Design for Use-Case
    •  Read, Write, or Both?
•  Avoid Hotspotting
•  Consider using IDs instead of full text
•  Leverage Column Family to HFile relation
•  Shift details to appropriate position
    •  Composite Keys
    •  Column Qualifiers
Summary (cont.)

•  Schema design is a combination of
    •  Designing the keys (row and column)
    •  Segregate data into column families
    •  Choose compression and block sizes
•  Similar techniques are needed to scale most
 systems
   •  Add indexes, partition data, consistent hashing
•  Denormalization, Duplication, and Intelligent
 Keys (DDI)
Ques+ons?	
  

More Related Content

PDF
HBase Storage Internals
PPTX
Apache HBase Performance Tuning
PPTX
HBase Low Latency
PDF
Apache HBase Improvements and Practices at Xiaomi
PDF
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
PPTX
Hive + Tez: A Performance Deep Dive
PPTX
Off-heaping the Apache HBase Read Path
PDF
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBase Storage Internals
Apache HBase Performance Tuning
HBase Low Latency
Apache HBase Improvements and Practices at Xiaomi
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Hive + Tez: A Performance Deep Dive
Off-heaping the Apache HBase Read Path
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase

What's hot (20)

PPTX
HBase Accelerated: In-Memory Flush and Compaction
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Log Structured Merge Tree
PDF
MyRocks Deep Dive
PPTX
Hive: Loading Data
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PPTX
RocksDB detail
PPTX
Introduction to Storm
PDF
Facebook Messages & HBase
PDF
Cassandra Introduction & Features
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
Apache Hudi: The Path Forward
PPT
Cloudera Impala Internals
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
HBase Application Performance Improvement
PDF
Optimizing Hive Queries
PPTX
Transactional operations in Apache Hive: present and future
HBase Accelerated: In-Memory Flush and Compaction
Apache Tez - A New Chapter in Hadoop Data Processing
Log Structured Merge Tree
MyRocks Deep Dive
Hive: Loading Data
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
RocksDB detail
Introduction to Storm
Facebook Messages & HBase
Cassandra Introduction & Features
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Apache Hudi: The Path Forward
Cloudera Impala Internals
Apache Tez: Accelerating Hadoop Query Processing
Ozone- Object store for Apache Hadoop
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Tuning Apache Kafka Connectors for Flink.pptx
HBase Application Performance Improvement
Optimizing Hive Queries
Transactional operations in Apache Hive: present and future
Ad

Viewers also liked (20)

PPTX
Hadoop World 2011: Advanced HBase Schema Design
PDF
HBase Blockcache 101
PDF
HBase for Architects
PDF
Intro to HBase - Lars George
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PDF
Intro to HBase Internals & Schema Design (for HBase users)
PPTX
HBase Coprocessor Introduction
PDF
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
PDF
Valerii Moisieienko Apache hbase workshop
ODP
H base key design
PPTX
HBase In Action - Chapter 04: HBase table design
PPTX
Update on OpenTSDB and AsyncHBase
PPTX
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
PPTX
Introduction to HBase - Phoenix HUG 5/14
PPTX
A 3 dimensional data model in hbase for large time-series dataset-20120915
PDF
HBase from the Trenches - Phoenix Data Conference 2015
PDF
Big Data: HBase and Big SQL self-study lab
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PDF
Cassandra Data Modeling
PPT
Big data hbase
Hadoop World 2011: Advanced HBase Schema Design
HBase Blockcache 101
HBase for Architects
Intro to HBase - Lars George
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
Intro to HBase Internals & Schema Design (for HBase users)
HBase Coprocessor Introduction
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Valerii Moisieienko Apache hbase workshop
H base key design
HBase In Action - Chapter 04: HBase table design
Update on OpenTSDB and AsyncHBase
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Introduction to HBase - Phoenix HUG 5/14
A 3 dimensional data model in hbase for large time-series dataset-20120915
HBase from the Trenches - Phoenix Data Conference 2015
Big Data: HBase and Big SQL self-study lab
HBaseConEast2016: HBase and Spark, State of the Art
Cassandra Data Modeling
Big data hbase
Ad

Similar to HBase Advanced - Lars George (20)

PDF
Hbase schema design and sizing apache-con europe - nov 2012
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PPTX
HBase in Practice
PPTX
HBase in Practice
PPTX
TriHUG January 2012 Talk by Chris Shain
PDF
Базы данных. HBase
PDF
Facebook keynote-nicolas-qcon
PDF
支撑Facebook消息处理的h base存储系统
PPTX
Cassandra 2012 scandit
PDF
Transition from relational to NoSQL Philly DAMA Day
PDF
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
PPTX
HBase Introduction
PPTX
PDF
Nyc hadoop meetup introduction to h base
PPT
Scaling web applications with cassandra presentation
PDF
Cassandra in production
PDF
Intro to HBase
PPTX
HBase and Accumulo | Washington DC Hadoop User Group
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PDF
Rails on HBase
Hbase schema design and sizing apache-con europe - nov 2012
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
HBase in Practice
HBase in Practice
TriHUG January 2012 Talk by Chris Shain
Базы данных. HBase
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
Cassandra 2012 scandit
Transition from relational to NoSQL Philly DAMA Day
Navigating the Transition from relational to NoSQL - CloudCon Expo 2012
HBase Introduction
Nyc hadoop meetup introduction to h base
Scaling web applications with cassandra presentation
Cassandra in production
Intro to HBase
HBase and Accumulo | Washington DC Hadoop User Group
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Rails on HBase

More from JAX London (20)

PDF
Everything I know about software in spaghetti bolognese: managing complexity
PDF
Devops with the S for Sharing - Patrick Debois
PPT
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
PDF
It's code but not as we know: Infrastructure as Code - Patrick Debois
KEY
Locks? We Don't Need No Stinkin' Locks - Michael Barker
PDF
Worse is better, for better or for worse - Kevlin Henney
PDF
Java performance: What's the big deal? - Trisha Gee
PDF
Clojure made-simple - John Stevenson
PDF
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
PDF
Play framework 2 : Peter Hilton
PDF
Complexity theory and software development : Tim Berglund
PDF
Why FLOSS is a Java developer's best friend: Dave Gruber
PDF
Akka in Action: Heiko Seeburger
PDF
NoSQL Smackdown 2012 : Tim Berglund
PDF
Closures, the next "Big Thing" in Java: Russel Winder
KEY
Java and the machine - Martijn Verburg and Kirk Pepperdine
PDF
Mongo DB on the JVM - Brendan McAdams
PDF
New opportunities for connected data - Ian Robinson
PDF
HTML5 Websockets and Java - Arun Gupta
PDF
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
Everything I know about software in spaghetti bolognese: managing complexity
Devops with the S for Sharing - Patrick Debois
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
It's code but not as we know: Infrastructure as Code - Patrick Debois
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Worse is better, for better or for worse - Kevlin Henney
Java performance: What's the big deal? - Trisha Gee
Clojure made-simple - John Stevenson
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
Play framework 2 : Peter Hilton
Complexity theory and software development : Tim Berglund
Why FLOSS is a Java developer's best friend: Dave Gruber
Akka in Action: Heiko Seeburger
NoSQL Smackdown 2012 : Tim Berglund
Closures, the next "Big Thing" in Java: Russel Winder
Java and the machine - Martijn Verburg and Kirk Pepperdine
Mongo DB on the JVM - Brendan McAdams
New opportunities for connected data - Ian Robinson
HTML5 Websockets and Java - Arun Gupta
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker

Recently uploaded (20)

PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced IT Governance
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced IT Governance
Advanced Soft Computing BINUS July 2025.pdf
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development

HBase Advanced - Lars George

  • 1. ADVANCED HBASE Architecture and Schema Design JAX UK, October 2012 Lars George Director EMEA Services
  • 2. About Me •  Director EMEA Services @ Cloudera •  Consulting on Hadoop projects (everywhere) •  Apache Committer •  HBase and Whirr •  O’Reilly Author •  HBase – The Definitive Guide •  Now in Japanese! •  Contact •  lars@cloudera.com 日本語版も出ました!   •  @larsgeorge
  • 8. HBase Tables and Regions •  Table is made up of any number if regions •  Region is specified by its startKey and endKey •  Empty table: (Table, NULL, NULL) •  Two-region table: (Table, NULL, “com.cloudera.www”) and (Table, “com.cloudera.www”, NULL) •  Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • 10. HBase Tables •  Tables are sorted by Row in lexicographical order •  Table schema only defines its column families •  Each family consists of any number of columns •  Each column consists of any number of versions •  Columns only exist when inserted, NULLs are free •  Columns within a family are sorted and stored together •  Everything except table names are byte[] (Table, Row, Family:Column, Timestamp) -> Value
  • 12. HBase Architecture (cont.) •  HBase uses HDFS (or similar) as its reliable storage layer •  Handles checksums, replication, failover •  Native Java API, Gateway for REST, Thrift, Avro •  Master manages cluster •  RegionServer manage data •  ZooKeeper is used the “neural network” •  Crucial for HBase •  Bootstraps and coordinates cluster
  • 13. HBase Architecture (cont.) •  Based on Log-Structured Merge-Trees (LSM-Trees) •  Inserts are done in write-ahead log first •  Data is stored in memory and flushed to disk on regular intervals or based on size •  Small flushes are merged in the background to keep number of files small •  Reads read memory stores first and then disk based files second •  Deletes are handled with “tombstone” markers •  Atomicity on row level no matter how many columns •  keeps locking model easy
  • 14. MemStores •  After data is written to the WAL the RegionServer saves KeyValues in memory store •  Flush to disk based on size, see hbase.hregion.memstore.flush.size •  Default size is 64MB •  Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time •  Snapshots are released when flush has succeeded
  • 15. Compactions •  General Concepts •  Two types: Minor and Major Compactions •  Asynchronous and transparent to client •  Manage file bloat from MemStore flushes •  Minor Compactions •  Combine last “few” flushes •  Triggered by number of storage files •  Major Compactions •  Rewrite all storage files •  Drop deleted data and those values exceeding TTL and/or number of versions •  Triggered by time threshold •  Cannot be scheduled automatically starting at a specific time (bummer!) •  May (most definitely) tax overall HDFS IO performance Tip: Disable major compactions and schedule to run manually (e.g. cron) at off-peak times
  • 16. Block Cache •  Acts as very large, in-memory distributed cache •  Assigned a large part of the JVM heap in the RegionServer process, see hfile.block.cache.size •  Optimizes reads on subsequent columns and rows •  Has priority to keep “in-memory” column families in cache if(inMemory) { this.priority = BlockPriority.MEMORY; } else { this.priority = BlockPriority.SINGLE; } •  Cache needs to be used properly to get best read performance •  Turn off block cache on operations that cause large churn •  Store related data “close” to each other •  Uses LRU cache with threaded (asynchronous) evictions based on priorities
  • 17. Region Splits •  Triggered by configured maximum file size of any store file •  This is checked directly after the compaction call to ensure store files are actually approaching the threshold •  Runs as asynchronous thread on RegionServer •  Splits are fast and nearly instant •  Reference files point to original region files and represent each half of the split •  Compactions take care of splitting original files into new region directories
  • 19. Auto Sharding and Distribution •  Unit of scalability in HBase is the Region •  Sorted, contiguous range of rows •  Spread “randomly” across RegionServer •  Moved around for load balancing and failover •  Split automatically or manually to scale with growing data •  Capacity is solely a factor of cluster nodes vs. regions per node
  • 20. Column Family vs. Column •  Use only a few column families •  Causes many files that need to stay open per region plus class overhead per family •  Best used when logical separation between data and meta columns •  Sorting per family can be used to convey application logic or access pattern
  • 21. Storage Separation •  Column Families allow for separation of data •  Used by Columnar Databases for fast analytical queries, but on column level only •  Allows different or no compression depending on the content type •  Segregate information based on access pattern •  Data is stored in one or more storage file, called HFiles
  • 25. Key Cardinality •  The best performance is gained from using row keys •  Time range bound reads can skip store files •  So can Bloom Filters •  Selecting column families reduces the amount of data to be scanned •  Pure value based filtering is a full table scan •  Filters often are too, but reduce network traffic
  • 27. Fold, Store, and Shift •  Logical layout does not match physical one •  All values are stored with the full coordinates, including: Row Key, Column Family, Column Qualifier, and Timestamp •  Folds columns into “row per column” •  NULLs are cost free as nothing is stored •  Versions are multiple “rows” in folded table
  • 28. Key/Table Design •  Crucial to gain best performance •  Why do I need to know? Well, you also need to know that RDBMS is only working well when columns are indexed and query plan is OK •  Absence of secondary indexes forces use of row key or column name sorting •  Transfer multiple indexes into one •  Generate large table -> Good since fits architecture and spreads across cluster
  • 29. DDI •  Stands for Denormalization, Duplication and Intelligent Keys •  Needed to overcome shortcomings of architecture •  Denormalization -> Replacement for JOINs •  Duplication -> Design for reads •  Intelligent Keys -> Implement indexing and sorting, optimize reads
  • 30. Pre-materialize Everything •  Achieve one read per customer request if possible •  Otherwise keep at lowest number •  Reads between 10ms (cache miss) and 1ms (cache hit) •  Use MapReduce to compute exacts in batch •  Store and merge updates live •  Use incrementColumnValue Motto: “Design for Reads”
  • 31. Tall-Narrow vs. Flat-Wide Tables •  Rows do not split •  Might end up with one row per region •  Same storage footprint •  Put more details into the row key •  Sometimes dummy column only •  Make use of partial key scans •  Tall with Scans, Wide with Gets •  Atomicity only on row level •  Example: Large graphs, stored as adjacency matrix
  • 32. Example: Mail Inbox <userId> : <colfam> : <messageId> : <timestamp> : <email-message> 12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..." 12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..." 12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..." 12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..." or 12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi Lars, ..." 12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..." 12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..." 12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..." è Same Storage Requirements
  • 33. Partial Key Scans Key   Descrip+on   <userId> Scan  over  all  messages   for  a  given  user  ID   <userId>-<date> Scan  over  all  messages   on  a  given  date  for  the   given  user  ID   <userId>-<date>-<messageId> Scan  over  all  parts  of  a   message  for  a  given  user   ID  and  date   <userId>-<date>-<messageId>-<attachmentId> Scan  over  all   a8achments  of  a   message  for  a  given  user   ID  and  date  
  • 34. Sequential Keys <timestamp><more key>: {CF: {CQ: {TS : Val}}} •  Hotspotting on Regions: bad! •  Instead do one of the following: •  Salting •  Prefix <timestamp> with distributed value •  Binning or bucketing rows across regions •  Key field swap/promotion •  Move <more key> before the timestamp (see OpenTSDB later) •  Randomization •  Move <timestamp> out of key
  • 35. Salting •  Prefix row keys to gain spread •  Use well known or numbered prefixes •  Use modulo to spread across servers •  Enforce common data stay close to each other for subsequent scanning or MapReduce processing 0_rowkey1, 1_rowkey2, 2_rowkey3 0_rowkey4, 1_rowkey5, 2_rowkey6 •  Sorted by prefix first 0_rowkey1 0_rowkey4 1_rowkey2 1_rowkey5 …
  • 36. Hashing vs. Sequential Keys •  Uses hashes for best spread •  Use for example MD5 to be able to recreate key •  Key = MD5(customerID) •  Counter productive for range scans •  Use sequential keys for locality •  Makes use of block caches •  May tax one server overly, may be avoided by salting or splitting regions while keeping them small
  • 38. Key Design Summary •  Based on access pattern, either use sequential or random keys •  Often a combination of both is needed •  Overcome architectural limitations •  Neither is necessarily bad •  Use bulk import for sequential keys and reads •  Random keys are good for random access patterns
  • 39. Example: Facebook Insights •  > 20B Events per Day •  1M Counter Updates per Second •  100 Nodes Cluster •  10K OPS per Node •  ”Like” button triggers AJAX request •  Event written to log file •  30mins current for website owner Web  ➜  Scribe  ➜  Ptail  ➜  Puma  ➜  HBase  
  • 40. HBase Counters •  Store counters per Domain and per URL •  Leverage HBase increment (atomic read-modify- write) feature •  Each row is one specific Domain or URL •  The columns are the counters for specific metrics •  Column families are used to group counters by time range •  Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family
  • 41. Key Design •  Reversed Domains •  Examples: “com.cloudera.www”, “com.cloudera.blog” •  Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys •  Domain Row Key = MD5(Reversed Domain) + Reversed Domain •  Leading MD5 hash spreads keys randomly across all regions for load balancing reasons •  Only hashing the domain groups per site (and per subdomain if needed) •  URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID •  Unique ID per URL already available, make use of it
  • 43. Summary •  Design for Use-Case •  Read, Write, or Both? •  Avoid Hotspotting •  Consider using IDs instead of full text •  Leverage Column Family to HFile relation •  Shift details to appropriate position •  Composite Keys •  Column Qualifiers
  • 44. Summary (cont.) •  Schema design is a combination of •  Designing the keys (row and column) •  Segregate data into column families •  Choose compression and block sizes •  Similar techniques are needed to scale most systems •  Add indexes, partition data, consistent hashing •  Denormalization, Duplication, and Intelligent Keys (DDI)