SlideShare a Scribd company logo
Operating HBase –
Things You Need to Know
       Christian Gügi
Outline
●   HBase internals
●   Overview of HBase utilities
●   HBase split visualisation with Hannibal
●   Challenges & lessons learned
●   Resources to get started




                                              2
About me
●   Software Architect @ Sentric
●   Founder and organizer of the Swiss Big
    Data User Group
    http://www.bigdata-usergroup.ch

●   Contact:
    christian.guegi@sentric.ch
    http://www.sentric.ch
    @chrisgugi

                                             3
HBase Internals




                  4
Data Model
●   A sparse, multi-dimensional, sorted map
●   Table consist of rows, each has a row key
●   Each row may have any number of columns
●   Rows are sorted lexicographically based on row key
●   Column = Column Family : Column Qualifier
    –   Cell → {rowkey, column, timestamp}




                                    [Bigtable: A Distributed Storage System for Structured Data]

●   Region: contiguous set of sorted rows
●   Region: unit of distribution and availability                                                  5
Physical Data Organization
    Region
                      content Column Family        anchor Column Family

                   Store                         Store
(WAL on HFDS)




                                 Memstore                        Memstore
    HLog




                        HFile         HFile            HFile
                     (on HDFS)     (on HDFS)        (on HDFS)




●      Column families are stored separately on disk
          –     Unit of access control with different patterns
●      Writes are held (sorted) in memory until flush
●      Sorted on disk in predictable order
          –     By row key, column key, descending timestamp                6
Flushes and Compaction
●   Flushing/compaction per Region
    –   One thread (CompactSplitThread) per region
        server
●   Minor compaction
    –   Merges two or more HFiles into one
●   Major compaction
    –   Picks up all HFiles in the region, merges them and
        removes deleted k/v
●   Regions are split when grown too large

                                                             7
System Architecture

           HBase                        API


                                    RegionServer
                 Master
                                        HFile      Memstore
                                        Write-Ahead Log




                    HDFS                        ZooKeeper



    [HBase: The Definitive Guide]

                                                              8
Key Design & Distribution
●   Bad idea: continuous number or timestamp
    (sequential row keys)
    –   RegionServer hot-spotting
●   Better: use hash function and/or composite
    key
    –   Distribute keys over random regions
    –   Uniform reads/writes across key space
●   Proper key design is very essential
    –   E.g. reversed URL (Bigtable paper)
                                                 9
Overview
HBase Utilities




                  10
Useful Tools
●   hbck – checks and fixes table integrity and
    region consistency
●   HFile – examine contents of HFile
●   HLog – examine contents of HLog file
●   OfflineMetaRepair – rebuild meta table
    from file system
●   HBase web interfaces
    –   Master
    –   RegionsServer
                                                  11
Monitoring Tools
●   Ganglia
●   Nagios
●   OpenTSDB
●   …

    All tools use metrics provided through JMX




                                                 12
Manual Splitting
●   Via master web interface
    –   Split
●   HBase shell split command
●   RegionSplitter
    –   Create table with pre-split regions
    –   Rolling split of all regions on existing table
    –   . /bin/hbase
        org.apache.hadoop.hbase.util.RegionSplitter


                                                         13
Disable Automatic Splitting
●   Determined by hbase.hregion.max.filesize
●   Set to max. 100GB
●   OK, but:
    –   How do I monitor my region growth?
    –   Where do I split when I have irregular data
        growth?




                                                      14
HBase Split Visualisation
    with Hannibal




                            15
Hannibal
●   Open source, project on github
    – https://github.com/sentric/hannibal
●   Web based
●   Implemented in Scala
●   Compatible with HBase 0.90
●   Support > 0.92 added soon
●   Check it out!

                                            16
How well are regions balanced
over the cluster?




                                17
How well are the regions split for
the table?




                                     18
How did the region evolve over
time?




                                 19
Future Plans
●   HBase 0.92 client API changes allow to
    query Compaction-State on Regions
    through HBaseAdmin → differentiate major
    from minor compactions
●   Add tool to find best region-key for irregular
    data growth
●   Expose metrics through JMX



                                                     20
Challenges
& Lessons Learned




                    21
Challenges
●   Everyone is still learning
●   Some issues only appear at scale
    –   At scale, nothing works as advertised
●   Production cluster configuration
    –   Hardware issues
    –   Tuning cluster configuration to our work loads
●   HBase stability
●   Monitoring health of HBase
                                                         22
Lessons Learned
●   Schema & key design
    –   What’s queried together should be stored together
●   Monitoring/Operational tooling is most important
●   Forget “emergency actions”, it takes some time
●   You need DevOps in production
●   Huge know-how curve, you need to know the whole
    ecosystem
    –   Hadoop, HDFS, Map/Red, ZooKeeper



                                                            23
Resources to get started
●   https://github.com/sentric/hannibal
●   http://hbase.apache.org/book.html
●   https://github.com/jmhsieh/hbase-repair-
    scripts
●   http://www.sentric.ch/blog/best-practice-
    why-monitoring-hbase-is-important
●   HBase: The Definitive Guide


                                                24
Thank you!



       Questions?
             @chrisgugi




                          25

More Related Content

PDF
HBase and Impala Notes - Munich HUG - 20131017
PPTX
Dancing with the elephant h base1_final
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PDF
HBaseCon 2015- HBase @ Flipboard
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PDF
An Introduction to Impala – Low Latency Queries for Apache Hadoop
PPTX
HBaseCon 2013: Full-Text Indexing for Apache HBase
PDF
Cloudera impala
HBase and Impala Notes - Munich HUG - 20131017
Dancing with the elephant h base1_final
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
An Introduction to Impala – Low Latency Queries for Apache Hadoop
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera impala

What's hot (20)

PPTX
HBase in Practice
PPTX
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
PPTX
Apache HBase™
PPT
HBaseCon 2013: Apache HBase Replication
PPTX
Getting Started with Hadoop
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PDF
HBase Status Report - Hadoop Summit Europe 2014
PDF
HBase Advanced - Lars George
PDF
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
PDF
Apache Big Data EU 2015 - HBase
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
ODP
Apache hadoop hbase
PPTX
Keynote: The Future of Apache HBase
PDF
Impala Architecture presentation
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
HBase in Practice
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Apache HBase™
HBaseCon 2013: Apache HBase Replication
Getting Started with Hadoop
HBaseConEast2016: HBase and Spark, State of the Art
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBase Status Report - Hadoop Summit Europe 2014
HBase Advanced - Lars George
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
HBase Read High Availability Using Timeline-Consistent Region Replicas
Apache Big Data EU 2015 - HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Apache hadoop hbase
Keynote: The Future of Apache HBase
Impala Architecture presentation
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Ad

Viewers also liked (17)

PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
BIG Data Science: A Path Forward
PDF
Big Analytics: Building Lasting Value
PDF
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
PDF
R + 15 minutes = Hadoop cluster
PPTX
Are You Ready for Big Data Big Analytics?
PDF
High Performance Predictive Analytics in R and Hadoop
PDF
Big Data: SQL query federation for Hadoop and RDBMS data
PDF
Predictive Analytics using R
PPTX
Hortonworks Oracle Big Data Integration
PPTX
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
PPTX
The truth about SQL and Data Warehousing on Hadoop
PPTX
Tame Big Data with Oracle Data Integration
PDF
Big Data: SQL on Hadoop from IBM
PPTX
Information Virtualization: Query Federation on Data Lakes
PPTX
The Impala Cookbook
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Real Time Data Processing Using Spark Streaming
BIG Data Science: A Path Forward
Big Analytics: Building Lasting Value
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R + 15 minutes = Hadoop cluster
Are You Ready for Big Data Big Analytics?
High Performance Predictive Analytics in R and Hadoop
Big Data: SQL query federation for Hadoop and RDBMS data
Predictive Analytics using R
Hortonworks Oracle Big Data Integration
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
The truth about SQL and Data Warehousing on Hadoop
Tame Big Data with Oracle Data Integration
Big Data: SQL on Hadoop from IBM
Information Virtualization: Query Federation on Data Lakes
The Impala Cookbook
Ad

Similar to Apachecon Europe 2012: Operating HBase - Things you need to know (20)

PDF
Hbase 20141003
PDF
Hbase: an introduction
PDF
Facebook keynote-nicolas-qcon
PDF
支撑Facebook消息处理的h base存储系统
PDF
Facebook Messages & HBase
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
HBase, crazy dances on the elephant back.
PPTX
Hbase Introduction
ODP
Training
PPTX
HBase Introduction
PPTX
Big data Hadoop
PDF
Cloudera Impala presentation
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
PPTX
HDFS- What is New and Future
PPTX
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
ODP
HBase introduction talk
PPTX
4. hadoop גיא לבנברג
PDF
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
PPTX
Session 01 - Into to Hadoop
Hbase 20141003
Hbase: an introduction
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
Facebook Messages & HBase
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBase, crazy dances on the elephant back.
Hbase Introduction
Training
HBase Introduction
Big data Hadoop
Cloudera Impala presentation
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
HDFS- What is New and Future
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
HBase introduction talk
4. hadoop גיא לבנברג
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Session 01 - Into to Hadoop

More from Christian Gügi (7)

PPTX
Real-Time Fraud Detection in Payment Transactions
PDF
Building Scalable Big Data Pipelines
PPTX
Case Study: In-Store Analysis
PDF
Apache HBase: Introduction to a column-oriented data store
PDF
Online Media Data Stream Processing with Kafka
PDF
Near Real Time Processing of Social Media Data with HBase
PDF
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Real-Time Fraud Detection in Payment Transactions
Building Scalable Big Data Pipelines
Case Study: In-Store Analysis
Apache HBase: Introduction to a column-oriented data store
Online Media Data Stream Processing with Kafka
Near Real Time Processing of Social Media Data with HBase
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Apachecon Europe 2012: Operating HBase - Things you need to know

  • 1. Operating HBase – Things You Need to Know Christian Gügi
  • 2. Outline ● HBase internals ● Overview of HBase utilities ● HBase split visualisation with Hannibal ● Challenges & lessons learned ● Resources to get started 2
  • 3. About me ● Software Architect @ Sentric ● Founder and organizer of the Swiss Big Data User Group http://www.bigdata-usergroup.ch ● Contact: christian.guegi@sentric.ch http://www.sentric.ch @chrisgugi 3
  • 5. Data Model ● A sparse, multi-dimensional, sorted map ● Table consist of rows, each has a row key ● Each row may have any number of columns ● Rows are sorted lexicographically based on row key ● Column = Column Family : Column Qualifier – Cell → {rowkey, column, timestamp} [Bigtable: A Distributed Storage System for Structured Data] ● Region: contiguous set of sorted rows ● Region: unit of distribution and availability 5
  • 6. Physical Data Organization Region content Column Family anchor Column Family Store Store (WAL on HFDS) Memstore Memstore HLog HFile HFile HFile (on HDFS) (on HDFS) (on HDFS) ● Column families are stored separately on disk – Unit of access control with different patterns ● Writes are held (sorted) in memory until flush ● Sorted on disk in predictable order – By row key, column key, descending timestamp 6
  • 7. Flushes and Compaction ● Flushing/compaction per Region – One thread (CompactSplitThread) per region server ● Minor compaction – Merges two or more HFiles into one ● Major compaction – Picks up all HFiles in the region, merges them and removes deleted k/v ● Regions are split when grown too large 7
  • 8. System Architecture HBase API RegionServer Master HFile Memstore Write-Ahead Log HDFS ZooKeeper [HBase: The Definitive Guide] 8
  • 9. Key Design & Distribution ● Bad idea: continuous number or timestamp (sequential row keys) – RegionServer hot-spotting ● Better: use hash function and/or composite key – Distribute keys over random regions – Uniform reads/writes across key space ● Proper key design is very essential – E.g. reversed URL (Bigtable paper) 9
  • 11. Useful Tools ● hbck – checks and fixes table integrity and region consistency ● HFile – examine contents of HFile ● HLog – examine contents of HLog file ● OfflineMetaRepair – rebuild meta table from file system ● HBase web interfaces – Master – RegionsServer 11
  • 12. Monitoring Tools ● Ganglia ● Nagios ● OpenTSDB ● … All tools use metrics provided through JMX 12
  • 13. Manual Splitting ● Via master web interface – Split ● HBase shell split command ● RegionSplitter – Create table with pre-split regions – Rolling split of all regions on existing table – . /bin/hbase org.apache.hadoop.hbase.util.RegionSplitter 13
  • 14. Disable Automatic Splitting ● Determined by hbase.hregion.max.filesize ● Set to max. 100GB ● OK, but: – How do I monitor my region growth? – Where do I split when I have irregular data growth? 14
  • 15. HBase Split Visualisation with Hannibal 15
  • 16. Hannibal ● Open source, project on github – https://github.com/sentric/hannibal ● Web based ● Implemented in Scala ● Compatible with HBase 0.90 ● Support > 0.92 added soon ● Check it out! 16
  • 17. How well are regions balanced over the cluster? 17
  • 18. How well are the regions split for the table? 18
  • 19. How did the region evolve over time? 19
  • 20. Future Plans ● HBase 0.92 client API changes allow to query Compaction-State on Regions through HBaseAdmin → differentiate major from minor compactions ● Add tool to find best region-key for irregular data growth ● Expose metrics through JMX 20
  • 22. Challenges ● Everyone is still learning ● Some issues only appear at scale – At scale, nothing works as advertised ● Production cluster configuration – Hardware issues – Tuning cluster configuration to our work loads ● HBase stability ● Monitoring health of HBase 22
  • 23. Lessons Learned ● Schema & key design – What’s queried together should be stored together ● Monitoring/Operational tooling is most important ● Forget “emergency actions”, it takes some time ● You need DevOps in production ● Huge know-how curve, you need to know the whole ecosystem – Hadoop, HDFS, Map/Red, ZooKeeper 23
  • 24. Resources to get started ● https://github.com/sentric/hannibal ● http://hbase.apache.org/book.html ● https://github.com/jmhsieh/hbase-repair- scripts ● http://www.sentric.ch/blog/best-practice- why-monitoring-hbase-is-important ● HBase: The Definitive Guide 24
  • 25. Thank you! Questions? @chrisgugi 25