SlideShare a Scribd company logo
S
Apache HBase:
Introduction & Use Cases
Subash D’Souza
What is HBase?
S HBase is an open source, distributed, sorted map modeled
after Google’s Big Table
S NoSQL solution built atop Apache Hadoop
S Top level Apache Project
CAP Theorem
S Consistency (all nodes see the same data at the same time)
S Availability (a guarantee that every request receives a
response about whether it was successful or failed)
S Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
According to the theorem, a distributed system can satisfy any
two of these guarantees at the same time, but not all three.
Ref: http://blog.nahurst.com/visual-guide-to-nosql-systems
Usage Scenarios
S Lots of Data - 100s of Gigs to Petabytes
S High Throughput – 1000’s of records/sec
S Scalable cache capacity – Adding several nodes adds to
available cache
S Data Layout – Excels at key lookup and no penalty for sparse
columns
Column Oriented Databases
S HBase belongs to family of databases called as Column-oriented
S Column-oriented databases save their data grouped by columns.
S The reason to store values on a per-column basis instead is based on
the assumption that, for specific queries, not all of the values are
needed.
S Reduced I/O is one of the primary reasons for this new layout
S Specialized algorithms—for example, delta and/or prefix
compression—selected based on the type of the column (i.e., on the
data stored) can yield huge improvements in compression ratios.
Better ratios result in more efficient bandwidth usage.
HBase as a Column Oriented
Database
S HBase is not a column-oriented database in the typical RDBMS
sense, but utilizes an on-disk column storage format.
S This is also where the majority of similarities end, because although
HBase stores data on disk in a column-oriented format, it is distinctly
different from traditional columnar databases.
S Whereas columnar databases excel at providing real-time analytical
access to data, HBase excels at providing key-based access to a
specific cell of data, or a sequential range of cells.
HBase and Hadoop
S Hadoop excels at storing data of arbitrary, semi-, or even
unstructured formats, since it lets you decide how to interpret
the data at analysis time, allowing you to change the way you
classify the data at any time: once you have updated the
algorithms, you simply run the analysis again.
S HBase sits atop Hadoop using all the best features of HDFS
such as scalability and data replication
HBase UnUsage
S When data access patterns are unknown – HBase follows a
data centric model rather than relationship centric, Hence it
does not make sense doing an ERD model for HBase
S Small amounts of data – Just use an RDBMS
S Limited/No random reads and writes – Just use HDFS directly
HBase Use Cases - Facebook
S One of the earliest and largest users of HBase
S Facebook messaging platform built atop HBase in 2010
S Chosen because of the high write throughput and low latency
random reads
S Other features such as Horizontal Scalability, Strong
Consistency and High Availability via Automatic Failover.
HBase Use Cases - Facebook
S In addition to online transaction processing workloads like
messages, it is also used for online analytic processing
workloads where large data scans are prevalent.
S Also used in production by other Facebook services, including
the internal monitoring system, the recently launched Nearby
Friends feature, search indexing, streaming data analysis, and
data scraping for their internal data warehouses.
Seek vs. Transfer
S One of the fundamental differences
between typical RDBMS and nosql
ones is the use of B or B+ trees and
Log Structured Merge Trees(LSM)
which was the basis of Google’s
BigTable
B+ Trees
S B+ Trees allow for efficient insertion, lookup and deletion of
records that are identified by keys.
S Represent dynamic, multilevel indexes with lower and upper
bounds per segment or page
S This allows for higher fanout compared to binary trees
resulting in lower number of I/O operations
S Range scans are also very efficient
LSM Trees
S Incoming data first stored in logfile, completely sequentially
S Once the log has modification saved, updates an in-memory
store.
S Once enough updates are accrued in the in-memory store, it
flushes a sorted list of key->record pairs to disks creating store
files.
S At this point all updates to log can be deleted since
modifications have been persisted.
Fundamental Difference
S Disk Drives
S Too Many Modifications force costly optimizations.
S More Data at random locations cause faster fragmentation
S Updates and deletes are done at disk seek rates rather than
disk transfer rates
Fundamental Difference(Contd)
S Works at disk transfer rates
S Scales better to handle large amounts of data.
S Guarantees consistent insert rate
S Transform random writes into sequential writes using logfiles
plus in-memory store
S Reads independent from writes so no contention between the
two
HBase Basics
S When data is added to HBase, it is first written to the WAL(Write
ahead log) called HLog.
S Once the write is done, it is then written to an in memory called
MemStore
S Once the memory exceeds a certain threshold, it flushes to disk as
an HFile
S Over time HBase merges smaller HFiles into larger ones. This process
is called compaction
Ref: https://www.altamiracorp.com/blog/employee-posts/handling-big-data-with-
hbase-part-3-architecture-overview
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
Facebook-Hydrabase
S In HBase, when a regionserver fails, all regions hosted by that
regionserver are moved to another regionserver.
S Depending on HBase has been setup, this typically entails
splitting and replaying the WAL files which could take time and
lengthens the failover
S Hydrabase differs from HBase in this. Instead of having a
region by a single region server, it is hosted by a set of
regionservers.
S When a regionserver fails, there are standby regionservers
ready to take over
Facebook-Hydrabase
S The standby region servers can be spread across different
racks or even data centers, providing availability.
S The set of region servers serving each region form a quorum.
Each quorum has a leader that services read and write
requests from the client.
S HydraBase uses the RAFT consensus protocol to ensure
consistency across the quorum.
S With a quorum of 2F+1, HydraBase can tolerate up to F
failures.
S Increases reliability from 99.99% to 99.999% ~ 5 mins
downtime/year.
HBase Users - Flurry
S Mobile analytics, monetization and advertising company
founded in 2005
S Recently acquired by Yahoo
S 2 data centers with 2 clusters each, bi directional replication
S 1000 slave nodes per cluster – 32 GB RAM, 4 drives(1 or 2 TB),
1 Gig E, Dual Quad Core processors *2 HT = 16 procs
S ~30 tables, 250k regions, 430TB(after LZO compression)
S 2 big tables are approx 90% of that, 1 wide table with 3 CF, 4
billion rows with 1 MM cells per row. The other tall table with
1 CF, 1 trillion rows and 1 cell per row
HBase Security – 0.98
S Cell Tags – All values in HBase are now written in cells, can also
carry arbitrary no. of tags such as metadata
S Cell ACLs – enables the checking of (R)ead, (W)rite, E(X)excute,
(A)dmin & (C)reate
S Cell Labels – Visibility expression support via new security
coprocessor
S Transparent Encryption – data is encrypted on disk – HFiles are
encrypted when written and decrypted when read
S RBAC – Uses Hadoop Group Mapping Service and ACL’s to
implement
Apache Phoenix
S SQL layer atop HBase – Has a query engine, metadata
repository & embedded JDBC driver, top level apache project,
currently only for HBase
S Fastest way to access HBase data – HBase specific push down,
compiles queries into native, direct HBase calls(no map-
reduce), executes scans in parallel
S Integrates with Pig, Flume & Sqoop
S Phoenix maps HBase data model to relational world
Ref: Taming HBase with Apache Phoenix and SQL, HBaseCon 2014
Open TSDB 2.0
S Distributed, Scalable Time Series Database on top of HBase
S Time Series – data points for identity over time.
S Stores trillions of data points, never loses precision, scales
using HBase
S Good for system monitoring & measurement – servers &
networks, Sensor data – The internet of things, SCADA,
Financial data, Results of Scientific experiments, etc.
Open TSDB 2.0
S Users – OVH(3rd largest cloud/hosting provider) to monitor
everything from networking, temperature, voltage to resource
utilization, etc.
S Yahoo uses it to monitor application performance & statistics
S Arista networking uses it for high performance networking
S Other users such as Pinterest, Ebay, Box, etc.
Apache Slider(Incubator)
S YARN application to deploy existing distributed applications on
YARN, monitor them and make them larger or smaller as
desired -even while the application is running.
S Incubator Apache Project; Similar to Tez for Hive/Pig
S Applications can be stopped, "frozen" and restarted, "thawed"
later; It allows users to create and run multiple instances of
applications, even with different application versions if needed
S Applications such as HBase, Accumulo & Storm can run atop it
Thanks!!
S Credits – Apache, Cloudera, Hortonworks, MapR, Facebook,
Flurry & HBaseCon
S @sawjd22
S www.linkedin.com/in/sawjd/
S Q & A

More Related Content

PDF
HBase for Architects
PPTX
Apache HBase™
PPTX
HBase in Practice
PPTX
Curb your insecurity with HDP
PPTX
HBase: Just the Basics
PDF
Integration of HIve and HBase
PDF
SQOOP - RDBMS to Hadoop
PDF
The Heterogeneous Data lake
HBase for Architects
Apache HBase™
HBase in Practice
Curb your insecurity with HDP
HBase: Just the Basics
Integration of HIve and HBase
SQOOP - RDBMS to Hadoop
The Heterogeneous Data lake

What's hot (20)

PPTX
Introduction To HBase
PPTX
Hadoop World 2011: Advanced HBase Schema Design
PDF
2013 July 23 Toronto Hadoop User Group Hive Tuning
PDF
Data Evolution in HBase
PPTX
Hadoop And Their Ecosystem
PDF
Apache HBase - Just the Basics
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PDF
Apache Hadoop and HBase
PDF
Migrating structured data between Hadoop and RDBMS
PPTX
PPTX
Apache HBase Application Archetypes
PDF
Cisco connect toronto 2015 big data sean mc keown
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
ODP
Apache hadoop hbase
PDF
Apache sqoop
PPTX
Apache phoenix
PDF
Hoodie - DataEngConf 2017
PPTX
Apache hive introduction
PPTX
Apache Phoenix + Apache HBase
PDF
Building a Hadoop Data Warehouse with Impala
Introduction To HBase
Hadoop World 2011: Advanced HBase Schema Design
2013 July 23 Toronto Hadoop User Group Hive Tuning
Data Evolution in HBase
Hadoop And Their Ecosystem
Apache HBase - Just the Basics
Apache Hive 2.0: SQL, Speed, Scale
Apache Hadoop and HBase
Migrating structured data between Hadoop and RDBMS
Apache HBase Application Archetypes
Cisco connect toronto 2015 big data sean mc keown
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Apache hadoop hbase
Apache sqoop
Apache phoenix
Hoodie - DataEngConf 2017
Apache hive introduction
Apache Phoenix + Apache HBase
Building a Hadoop Data Warehouse with Impala
Ad

Viewers also liked (20)

PPT
Hive Training -- Motivations and Real World Use Cases
PDF
Intro to HBase
PDF
Intro to HBase Internals & Schema Design (for HBase users)
PDF
Realtime Analytics with Hadoop and HBase
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PDF
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PDF
20090713 Hbase Schema Design Case Studies
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PDF
Integration of Hive and HBase
PPT
Chicago Data Summit: Apache HBase: An Introduction
PPTX
Hadoop and HBase @eBay
PDF
Searching Billions of Product Logs in Real Time (Use Case)
PPTX
Introduction to Apache HBase
PPTX
Datacubes in Apache Hive at ApacheCon
PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
PDF
Case Study (Paper): How Cisco Spark is used at ZOOM International
Hive Training -- Motivations and Real World Use Cases
Intro to HBase
Intro to HBase Internals & Schema Design (for HBase users)
Realtime Analytics with Hadoop and HBase
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
20090713 Hbase Schema Design Case Studies
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
HIVE: Data Warehousing & Analytics on Hadoop
Integration of Hive and HBase
Chicago Data Summit: Apache HBase: An Introduction
Hadoop and HBase @eBay
Searching Billions of Product Logs in Real Time (Use Case)
Introduction to Apache HBase
Datacubes in Apache Hive at ApacheCon
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Case Study (Paper): How Cisco Spark is used at ZOOM International
Ad

Similar to Apache HBase - Introduction & Use Cases (20)

PPT
HBASE Overview
PDF
Dsm project-h base-cassandra
PDF
How can Hadoop & SAP be integrated
PDF
Hbase 20141003
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
PPTX
PPTX
No SQL introduction
PDF
Data Storage Management
PDF
Uint-5 Big data Frameworks.pdf
PDF
Techincal Talk Hbase-Ditributed,no-sql database
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
PPTX
Managing Big data with Hadoop
PPT
Big data hbase
PPT
Hw09 Practical HBase Getting The Most From Your H Base Install
PPTX
HBase.pptx
PDF
The ABC of Big Data
PDF
Data Storage and Management project Report
PPTX
Big data and tools
PDF
Databases Comparison in nosql databases.
HBASE Overview
Dsm project-h base-cassandra
How can Hadoop & SAP be integrated
Hbase 20141003
Unit II Hadoop Ecosystem_Updated.pptx
No SQL introduction
Data Storage Management
Uint-5 Big data Frameworks.pdf
Techincal Talk Hbase-Ditributed,no-sql database
Big Data Analytics Presentation on the resourcefulness of Big data
Managing Big data with Hadoop
Big data hbase
Hw09 Practical HBase Getting The Most From Your H Base Install
HBase.pptx
The ABC of Big Data
Data Storage and Management project Report
Big data and tools
Databases Comparison in nosql databases.

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Modernizing your data center with Dell and AMD
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
Modernizing your data center with Dell and AMD
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Apache HBase - Introduction & Use Cases

  • 1. S Apache HBase: Introduction & Use Cases Subash D’Souza
  • 2. What is HBase? S HBase is an open source, distributed, sorted map modeled after Google’s Big Table S NoSQL solution built atop Apache Hadoop S Top level Apache Project
  • 3. CAP Theorem S Consistency (all nodes see the same data at the same time) S Availability (a guarantee that every request receives a response about whether it was successful or failed) S Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.
  • 5. Usage Scenarios S Lots of Data - 100s of Gigs to Petabytes S High Throughput – 1000’s of records/sec S Scalable cache capacity – Adding several nodes adds to available cache S Data Layout – Excels at key lookup and no penalty for sparse columns
  • 6. Column Oriented Databases S HBase belongs to family of databases called as Column-oriented S Column-oriented databases save their data grouped by columns. S The reason to store values on a per-column basis instead is based on the assumption that, for specific queries, not all of the values are needed. S Reduced I/O is one of the primary reasons for this new layout S Specialized algorithms—for example, delta and/or prefix compression—selected based on the type of the column (i.e., on the data stored) can yield huge improvements in compression ratios. Better ratios result in more efficient bandwidth usage.
  • 7. HBase as a Column Oriented Database S HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format. S This is also where the majority of similarities end, because although HBase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases. S Whereas columnar databases excel at providing real-time analytical access to data, HBase excels at providing key-based access to a specific cell of data, or a sequential range of cells.
  • 8. HBase and Hadoop S Hadoop excels at storing data of arbitrary, semi-, or even unstructured formats, since it lets you decide how to interpret the data at analysis time, allowing you to change the way you classify the data at any time: once you have updated the algorithms, you simply run the analysis again. S HBase sits atop Hadoop using all the best features of HDFS such as scalability and data replication
  • 9. HBase UnUsage S When data access patterns are unknown – HBase follows a data centric model rather than relationship centric, Hence it does not make sense doing an ERD model for HBase S Small amounts of data – Just use an RDBMS S Limited/No random reads and writes – Just use HDFS directly
  • 10. HBase Use Cases - Facebook S One of the earliest and largest users of HBase S Facebook messaging platform built atop HBase in 2010 S Chosen because of the high write throughput and low latency random reads S Other features such as Horizontal Scalability, Strong Consistency and High Availability via Automatic Failover.
  • 11. HBase Use Cases - Facebook S In addition to online transaction processing workloads like messages, it is also used for online analytic processing workloads where large data scans are prevalent. S Also used in production by other Facebook services, including the internal monitoring system, the recently launched Nearby Friends feature, search indexing, streaming data analysis, and data scraping for their internal data warehouses.
  • 12. Seek vs. Transfer S One of the fundamental differences between typical RDBMS and nosql ones is the use of B or B+ trees and Log Structured Merge Trees(LSM) which was the basis of Google’s BigTable
  • 13. B+ Trees S B+ Trees allow for efficient insertion, lookup and deletion of records that are identified by keys. S Represent dynamic, multilevel indexes with lower and upper bounds per segment or page S This allows for higher fanout compared to binary trees resulting in lower number of I/O operations S Range scans are also very efficient
  • 14. LSM Trees S Incoming data first stored in logfile, completely sequentially S Once the log has modification saved, updates an in-memory store. S Once enough updates are accrued in the in-memory store, it flushes a sorted list of key->record pairs to disks creating store files. S At this point all updates to log can be deleted since modifications have been persisted.
  • 15. Fundamental Difference S Disk Drives S Too Many Modifications force costly optimizations. S More Data at random locations cause faster fragmentation S Updates and deletes are done at disk seek rates rather than disk transfer rates
  • 16. Fundamental Difference(Contd) S Works at disk transfer rates S Scales better to handle large amounts of data. S Guarantees consistent insert rate S Transform random writes into sequential writes using logfiles plus in-memory store S Reads independent from writes so no contention between the two
  • 17. HBase Basics S When data is added to HBase, it is first written to the WAL(Write ahead log) called HLog. S Once the write is done, it is then written to an in memory called MemStore S Once the memory exceeds a certain threshold, it flushes to disk as an HFile S Over time HBase merges smaller HFiles into larger ones. This process is called compaction
  • 22. Facebook-Hydrabase S In HBase, when a regionserver fails, all regions hosted by that regionserver are moved to another regionserver. S Depending on HBase has been setup, this typically entails splitting and replaying the WAL files which could take time and lengthens the failover S Hydrabase differs from HBase in this. Instead of having a region by a single region server, it is hosted by a set of regionservers. S When a regionserver fails, there are standby regionservers ready to take over
  • 23. Facebook-Hydrabase S The standby region servers can be spread across different racks or even data centers, providing availability. S The set of region servers serving each region form a quorum. Each quorum has a leader that services read and write requests from the client. S HydraBase uses the RAFT consensus protocol to ensure consistency across the quorum. S With a quorum of 2F+1, HydraBase can tolerate up to F failures. S Increases reliability from 99.99% to 99.999% ~ 5 mins downtime/year.
  • 24. HBase Users - Flurry S Mobile analytics, monetization and advertising company founded in 2005 S Recently acquired by Yahoo S 2 data centers with 2 clusters each, bi directional replication S 1000 slave nodes per cluster – 32 GB RAM, 4 drives(1 or 2 TB), 1 Gig E, Dual Quad Core processors *2 HT = 16 procs S ~30 tables, 250k regions, 430TB(after LZO compression) S 2 big tables are approx 90% of that, 1 wide table with 3 CF, 4 billion rows with 1 MM cells per row. The other tall table with 1 CF, 1 trillion rows and 1 cell per row
  • 25. HBase Security – 0.98 S Cell Tags – All values in HBase are now written in cells, can also carry arbitrary no. of tags such as metadata S Cell ACLs – enables the checking of (R)ead, (W)rite, E(X)excute, (A)dmin & (C)reate S Cell Labels – Visibility expression support via new security coprocessor S Transparent Encryption – data is encrypted on disk – HFiles are encrypted when written and decrypted when read S RBAC – Uses Hadoop Group Mapping Service and ACL’s to implement
  • 26. Apache Phoenix S SQL layer atop HBase – Has a query engine, metadata repository & embedded JDBC driver, top level apache project, currently only for HBase S Fastest way to access HBase data – HBase specific push down, compiles queries into native, direct HBase calls(no map- reduce), executes scans in parallel S Integrates with Pig, Flume & Sqoop S Phoenix maps HBase data model to relational world
  • 27. Ref: Taming HBase with Apache Phoenix and SQL, HBaseCon 2014
  • 28. Open TSDB 2.0 S Distributed, Scalable Time Series Database on top of HBase S Time Series – data points for identity over time. S Stores trillions of data points, never loses precision, scales using HBase S Good for system monitoring & measurement – servers & networks, Sensor data – The internet of things, SCADA, Financial data, Results of Scientific experiments, etc.
  • 29. Open TSDB 2.0 S Users – OVH(3rd largest cloud/hosting provider) to monitor everything from networking, temperature, voltage to resource utilization, etc. S Yahoo uses it to monitor application performance & statistics S Arista networking uses it for high performance networking S Other users such as Pinterest, Ebay, Box, etc.
  • 30. Apache Slider(Incubator) S YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the application is running. S Incubator Apache Project; Similar to Tez for Hive/Pig S Applications can be stopped, "frozen" and restarted, "thawed" later; It allows users to create and run multiple instances of applications, even with different application versions if needed S Applications such as HBase, Accumulo & Storm can run atop it
  • 31. Thanks!! S Credits – Apache, Cloudera, Hortonworks, MapR, Facebook, Flurry & HBaseCon S @sawjd22 S www.linkedin.com/in/sawjd/ S Q & A

Editor's Notes

  • #16: How their architecture makes use of modern hardware especially disk drives. B+ Trees work well until there are too many modifications, because they force you to perform costly optimizations to retain that advantage for a limited amount of time. The more and faster you add data at random locations, the faster the pages become fragmented again. Eventually, you may take in data at a higher rate than the optimization process takes to rewrite the existing files. The updates and deletes are done at disk seek rates, rather than disk transfer rates.
  • #17: LSM-trees work at disk transfer rates and scale much better to handle large amounts of data. They also guarantee a very consistent insert rate, as they transform random writes into sequential writes using the logfile plus in-memory store. The reads are independent from the writes, so you also get no contention between these two operations.