SlideShare a Scribd company logo
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook
Realtime Big Data at Facebook
with Hadoop and HBase


Jonathan Gray
November ,
Hadoop World NYC
Agenda


   Why Hadoop and HBase?

   Applications of HBase at Facebook

   Future of HBase at Facebook
About Me                                  Jonathan Gray
▪ Previous life as Co-Founder of Streamy.com
  ▪ Realtime Social News Aggregator

  ▪ Big Data problems led us to Hadoop/HBase

  ▪ HBase committer and Hadoop user/complainer



▪ Software Engineer at Facebook
  ▪ Develop, support, and evangelize HBase across teams

  ▪ Recently joined Database Infrastructure Engineering
      MySQL and HBase together at last!
Why Hadoop and HBase?
For Realtime Data?
Cache      Data analysis




OS   Web server    Database       Language
Problems with existing stack
▪ MySQL is stable, but...
  ▪ Limited throughput

  ▪ Not inherently distributed

  ▪ Table size limits

  ▪ Inflexible schema


▪ Memcached is fast, but...
  ▪ Only key-value so data is opaque

  ▪ No write-through
Problems with existing stack
▪ Hadoop is scalable, but...
  ▪ MapReduce is slow

  ▪ Writing MapReduce is difficult

  ▪ Does not support random writes

  ▪ Poor support for random reads
Specialized solutions
▪ Inbox Search
  ▪ Cassandra


▪ High-throughput, persistent key-value
  ▪ Tokyo Cabinet


▪ Large scale data warehousing
  ▪ Hive


▪ Custom C++ servers for lots of other stuff
Finding a new online data store
▪ Consistent patterns emerge
  ▪ Massive datasets, often largely inactive

  ▪ Lots of writes

  ▪ Fewer reads

  ▪ Dictionaries and lists

  ▪ Entity-centric schemas
     ▪   per-user, per-domain, per-app
Finding a new online data store
▪ Other requirements laid out
  ▪ Elasticity

  ▪ High availability

  ▪ Strong consistency within a datacenter

  ▪ Fault isolation


▪ Some non-requirements
  ▪ Network partitions within a single datacenter

  ▪ Active-active serving from multiple datacenters
Finding a new online data store
▪ In         , engineers at FB compared DBs
   ▪ Apache Cassandra, Apache HBase, Sharded MySQL


▪ Compared performance, scalability, and features
   ▪ HBase gave excellent write performance, good reads

   ▪ HBase already included many nice-to-have features
       ▪   Atomic read-modify-write operations
       ▪   Multiple shards per server
       ▪   Bulk importing
       ▪   Range scans
HBase uses HDFS
We get the benefits of HDFS as a storage
system for free
▪ Fault tolerance

▪ Scalability

▪ Checksums fix corruptions

▪ MapReduce

▪ Fault isolation of disks

▪ HDFS battle tested at petabyte scale at Facebook

  ▪   Lots of existing operational experience
HBase in a nutshell
▪ Sorted and column-oriented

▪ High write throughput

▪ Horizontal scalability

▪ Automatic failover

▪ Regions sharded dynamically
Applications of HBase at Facebook
Use Case
    Titan
(Facebook Messages)
The New Facebook Messages




Messages   IM/Chat   email   SMS
Facebook Messaging
▪ Largest engineering effort in the history of FB
  ▪    engineers over more than a year
  ▪ Incorporates over   infrastructure technologies
      ▪ Hadoop, HBase, Haystack, ZooKeeper, etc...



▪ A product at massive scale on day one

  ▪ Hundreds of millions of active users

  ▪   + billion messages a month
  ▪   k instant messages a second on average
Messaging Challenges
▪ High write throughput
  ▪ Every message, instant message, SMS, and e-mail

  ▪ Search indexes and metadata for all of the above

  ▪ Denormalized schema


▪ Massive clusters
  ▪ So much data and usage requires a large server footprint

  ▪ Do not want outages to impact availability

  ▪ Must be able to easily scale out
High Write Throughput
            Write
            Key Value

                                                  Sequential
  Key val                          Key val          write

  Key val                          Key val
  Key val                          Key val
    .
    .                                  .
                                       .
    .                                  . memory
                               Sorted in
  Key val                          Key val
    .
    .
    .
  Key val         Sequential       Key val
                  write

Commit Log                      Memstore
Horizontal Scalability
Region

         ...                ...
Automatic Failover
                              Find new
               HBase client   server from
                              META

server
 died
Facebook Messages Stats
▪   B+ messages per day
    ▪   B+ read/write ops to HBase per day
        ▪ . M ops/sec at peak

        ▪     read,     write
        ▪~   columns per operation across multiple families

▪   PB+ of online data in HBase
    ▪ LZO compressed and un-replicated (   PB replicated)
    ▪ Growing at    TB/month
Use Case
   Puma
(Facebook Insights)
Before Puma
                     Offline ETL
Web Tier            HDFS         Hive         MySQL
           Scribe          MR           SQL




                           SQL


                      8-24 hours
Puma
                    Realtime ETL
Web Tier            HDFS             Puma            HBase
           Scribe           PTail           HTable




                            Thrift


                    2-30 seconds
Puma as Realtime MapReduce
▪ Map phase with PTail
  ▪ Divide the input log stream into N shards

  ▪ First version supported random bucketing

  ▪ Now supports application-level bucketing


▪ Reduce phase with HBase
  ▪ Every row+column in HBase is an output key

  ▪ Aggregate key counts using atomic counters

  ▪ Can also maintain per-key lists or other structures
Puma for Facebook Insights
▪ Realtime URL/Domain Insights
  ▪ Domain owners can see deep analytics for their site

  ▪ Clicks, Likes, Shares, Comments, Impressions

  ▪ Detailed demographic breakdowns (anonymized)

  ▪ Top URLs calculated per-domain and globally


▪ Massive Throughput
  ▪ Billions of URLs

  ▪>   Million counter increments per second
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook
Future of Puma
▪ Centrally managed service for many products


▪ Several other applications in production
  ▪ Commerce Tracking

  ▪ Ad Insights


▪ Making Puma generic
  ▪ Dynamically configured by product teams

  ▪ Custom query language
Use Case
         ODS
(Facebook Internal Metrics)
ODS
▪ Operational Data Store
  ▪ System metrics (CPU, Memory, IO, Network)

  ▪ Application metrics (Web, DB, Caches)

  ▪ Facebook metrics (Usage, Revenue)
      ▪   Easily graph this data over time
      ▪   Supports complex aggregation, transformations, etc.

▪ Difficult to scale with MySQL

  ▪ Millions of unique time-series with billions of points

  ▪ Irregular data growth patterns
Dynamic sharding of regions
Region

                ...           ...




           server
         overloaded
Future of HBase at Facebook
User and Graph Data
      in HBase
Why now?
▪ MySQL+Memcached hard to replace, but...
  ▪ Joins and other RDBMS functionality are gone

  ▪ From writing SQL to using APIs

  ▪ Next generation of services and caches make the

   persistent storage engine transparent to www

▪ Primarily a financially motivated decision
  ▪ MySQL works, but can HBase save us money?

  ▪ Also, are there things we just couldn’t do before?
HBase vs. MySQL
▪ MySQL at Facebook
  ▪ Tier size determined solely by IOPS

  ▪ Heavy on random IO for reads and writes

  ▪ Rely on fast disks or flash to scale individual nodes



▪ HBase showing promise of cost savings
  ▪ Fewer IOPS on write-heavy workloads

  ▪ Larger tables on denser, cheaper nodes

  ▪ Simpler operations and replication “for free”
HBase vs. MySQL
▪ MySQL is not going anywhere soon
  ▪ It works!



▪ But HBase is a great addition to the tool belt
  ▪ Different set of trade-offs

  ▪ Great at storing key-values, dictionaries, and lists

  ▪ Products with heavy write requirements

  ▪ Generated data

  ▪ Potential capital and operational cost savings
UDB Challenges
▪ MySQL has a       + year head start
  ▪ HBase is still a pre-   . database system

▪ Insane Requirements
  ▪ Zero data loss, low latency, very high throughput

  ▪ Reads, writes, and atomic read-modify-writes

  ▪ WAN replication, backups w/ point-in-time recovery

  ▪ Live migration of critical user data w/ existing shards

  ▪ queryf() and other fun edge cases to deal with
Technical/Developer oriented talk tomorrow:




Apache HBase Road Map
A short history of nearly everything HBase. Past, present, and future.

Wednesday @ 1PM in the Met Ballroom
Check out the HBase at Facebook Page:

facebook.com/UsingHbase


    Thanks! Questions?

More Related Content

PDF
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
PDF
Realtime Apache Hadoop at Facebook
PDF
Facebook keynote-nicolas-qcon
PDF
Storage infrastructure using HBase behind LINE messages
PPTX
Hadoop World 2011: Advanced HBase Schema Design
PPTX
Apache HBase™
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PPTX
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Realtime Apache Hadoop at Facebook
Facebook keynote-nicolas-qcon
Storage infrastructure using HBase behind LINE messages
Hadoop World 2011: Advanced HBase Schema Design
Apache HBase™
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica

What's hot (20)

PPT
Hw09 Practical HBase Getting The Most From Your H Base Install
PDF
[Hi c2011]building mission critical messaging system(guoqiang jerry)
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
PDF
Intro to HBase - Lars George
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PDF
Apache HBase for Architects
PPTX
Hoodie: Incremental processing on hadoop
PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Intro to HBase Internals & Schema Design (for HBase users)
PPTX
Getting Started with Hadoop
PPTX
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
PPTX
MongoDB at eBay
PPTX
NoSQL: Cassadra vs. HBase
PPTX
SQL Server 2012 and Big Data
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PDF
Integration of HIve and HBase
PDF
HBase Advanced - Lars George
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Hw09 Practical HBase Getting The Most From Your H Base Install
[Hi c2011]building mission critical messaging system(guoqiang jerry)
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
HBaseCon 2013: Compaction Improvements in Apache HBase
Intro to HBase - Lars George
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
Apache HBase for Architects
Hoodie: Incremental processing on hadoop
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Building a Hadoop Data Warehouse with Impala
Intro to HBase Internals & Schema Design (for HBase users)
Getting Started with Hadoop
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
MongoDB at eBay
NoSQL: Cassadra vs. HBase
SQL Server 2012 and Big Data
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
Integration of HIve and HBase
HBase Advanced - Lars George
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Ad

Viewers also liked (20)

PPTX
Hadoop: Extending your Data Warehouse
PPT
Sync your facebook friends with your database
PPTX
Advance Facebook Techniques
KEY
The Secrets of Building Realtime Big Data Systems
PPTX
Instagram - Digital Marketing Tool
PDF
IS OLAP DEAD IN THE AGE OF BIG DATA?
PPTX
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
PDF
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
PDF
PPT
Facebook App Dev 201 App Launch Distrib
PDF
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
PPT
Power Prospecting Using Social Media
PDF
Facebook Messages & HBase
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
Introduction to Firebase [Google I/O Extended Bangkok 2016]
PPT
Big Data Real Time Analytics - A Facebook Case Study
PDF
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
PPTX
Netflix Big Data Paris 2017
PDF
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
PDF
Integration of Hive and HBase
Hadoop: Extending your Data Warehouse
Sync your facebook friends with your database
Advance Facebook Techniques
The Secrets of Building Realtime Big Data Systems
Instagram - Digital Marketing Tool
IS OLAP DEAD IN THE AGE OF BIG DATA?
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
並列分散処理基盤Hadoopの紹介と、開発者が語るHadoopの使いどころ (Silicon Valley x 日本 / Tech x Business ...
Facebook App Dev 201 App Launch Distrib
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Power Prospecting Using Social Media
Facebook Messages & HBase
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Introduction to Firebase [Google I/O Extended Bangkok 2016]
Big Data Real Time Analytics - A Facebook Case Study
ちょっと理解に自信がないな という皆さまに贈るHadoop/Sparkのキホン (IBM Datapalooza Tokyo 2016講演資料)
Netflix Big Data Paris 2017
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
Integration of Hive and HBase
Ad

Similar to Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook (20)

PDF
Realtime hadoopsigmod2011
PPTX
Apache HBase - Introduction & Use Cases
PPT
Chicago Data Summit: Apache HBase: An Introduction
PPTX
Practical HBase - Hadoop World2011
PDF
支撑Facebook消息处理的h base存储系统
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
PDF
HBase ArcheTypes
PDF
Conhecendo o Apache HBase
PDF
Facebook Hadoop Usecase
ODP
Hadoop demo ppt
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
No sql findings
KEY
HBase and Hadoop at Urban Airship
PDF
20080611accel
PDF
Xldb2011 tue 0940_facebook_realtimeanalytics
PDF
Facebook - Jonthan Gray - Hadoop World 2010
PDF
Hbase jdd
PPTX
Big Data & Hadoop Introduction
PPTX
Big Data (NJ SQL Server User Group)
PPT
CouchBase The Complete NoSql Solution for Big Data
Realtime hadoopsigmod2011
Apache HBase - Introduction & Use Cases
Chicago Data Summit: Apache HBase: An Introduction
Practical HBase - Hadoop World2011
支撑Facebook消息处理的h base存储系统
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBase ArcheTypes
Conhecendo o Apache HBase
Facebook Hadoop Usecase
Hadoop demo ppt
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
No sql findings
HBase and Hadoop at Urban Airship
20080611accel
Xldb2011 tue 0940_facebook_realtimeanalytics
Facebook - Jonthan Gray - Hadoop World 2010
Hbase jdd
Big Data & Hadoop Introduction
Big Data (NJ SQL Server User Group)
CouchBase The Complete NoSql Solution for Big Data

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Advanced IT Governance
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
GamePlan Trading System Review: Professional Trader's Honest Take
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced Soft Computing BINUS July 2025.pdf
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Advanced IT Governance
Understanding_Digital_Forensics_Presentation.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025

Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoop and HBase - Jonathan Gray, Facebook

  • 2. Realtime Big Data at Facebook with Hadoop and HBase Jonathan Gray November , Hadoop World NYC
  • 3. Agenda Why Hadoop and HBase? Applications of HBase at Facebook Future of HBase at Facebook
  • 4. About Me Jonathan Gray ▪ Previous life as Co-Founder of Streamy.com ▪ Realtime Social News Aggregator ▪ Big Data problems led us to Hadoop/HBase ▪ HBase committer and Hadoop user/complainer ▪ Software Engineer at Facebook ▪ Develop, support, and evangelize HBase across teams ▪ Recently joined Database Infrastructure Engineering MySQL and HBase together at last!
  • 5. Why Hadoop and HBase? For Realtime Data?
  • 6. Cache Data analysis OS Web server Database Language
  • 7. Problems with existing stack ▪ MySQL is stable, but... ▪ Limited throughput ▪ Not inherently distributed ▪ Table size limits ▪ Inflexible schema ▪ Memcached is fast, but... ▪ Only key-value so data is opaque ▪ No write-through
  • 8. Problems with existing stack ▪ Hadoop is scalable, but... ▪ MapReduce is slow ▪ Writing MapReduce is difficult ▪ Does not support random writes ▪ Poor support for random reads
  • 9. Specialized solutions ▪ Inbox Search ▪ Cassandra ▪ High-throughput, persistent key-value ▪ Tokyo Cabinet ▪ Large scale data warehousing ▪ Hive ▪ Custom C++ servers for lots of other stuff
  • 10. Finding a new online data store ▪ Consistent patterns emerge ▪ Massive datasets, often largely inactive ▪ Lots of writes ▪ Fewer reads ▪ Dictionaries and lists ▪ Entity-centric schemas ▪ per-user, per-domain, per-app
  • 11. Finding a new online data store ▪ Other requirements laid out ▪ Elasticity ▪ High availability ▪ Strong consistency within a datacenter ▪ Fault isolation ▪ Some non-requirements ▪ Network partitions within a single datacenter ▪ Active-active serving from multiple datacenters
  • 12. Finding a new online data store ▪ In , engineers at FB compared DBs ▪ Apache Cassandra, Apache HBase, Sharded MySQL ▪ Compared performance, scalability, and features ▪ HBase gave excellent write performance, good reads ▪ HBase already included many nice-to-have features ▪ Atomic read-modify-write operations ▪ Multiple shards per server ▪ Bulk importing ▪ Range scans
  • 13. HBase uses HDFS We get the benefits of HDFS as a storage system for free ▪ Fault tolerance ▪ Scalability ▪ Checksums fix corruptions ▪ MapReduce ▪ Fault isolation of disks ▪ HDFS battle tested at petabyte scale at Facebook ▪ Lots of existing operational experience
  • 14. HBase in a nutshell ▪ Sorted and column-oriented ▪ High write throughput ▪ Horizontal scalability ▪ Automatic failover ▪ Regions sharded dynamically
  • 15. Applications of HBase at Facebook
  • 16. Use Case Titan (Facebook Messages)
  • 17. The New Facebook Messages Messages IM/Chat email SMS
  • 18. Facebook Messaging ▪ Largest engineering effort in the history of FB ▪ engineers over more than a year ▪ Incorporates over infrastructure technologies ▪ Hadoop, HBase, Haystack, ZooKeeper, etc... ▪ A product at massive scale on day one ▪ Hundreds of millions of active users ▪ + billion messages a month ▪ k instant messages a second on average
  • 19. Messaging Challenges ▪ High write throughput ▪ Every message, instant message, SMS, and e-mail ▪ Search indexes and metadata for all of the above ▪ Denormalized schema ▪ Massive clusters ▪ So much data and usage requires a large server footprint ▪ Do not want outages to impact availability ▪ Must be able to easily scale out
  • 20. High Write Throughput Write Key Value Sequential Key val Key val write Key val Key val Key val Key val . . . . . . memory Sorted in Key val Key val . . . Key val Sequential Key val write Commit Log Memstore
  • 22. Automatic Failover Find new HBase client server from META server died
  • 23. Facebook Messages Stats ▪ B+ messages per day ▪ B+ read/write ops to HBase per day ▪ . M ops/sec at peak ▪ read, write ▪~ columns per operation across multiple families ▪ PB+ of online data in HBase ▪ LZO compressed and un-replicated ( PB replicated) ▪ Growing at TB/month
  • 24. Use Case Puma (Facebook Insights)
  • 25. Before Puma Offline ETL Web Tier HDFS Hive MySQL Scribe MR SQL SQL 8-24 hours
  • 26. Puma Realtime ETL Web Tier HDFS Puma HBase Scribe PTail HTable Thrift 2-30 seconds
  • 27. Puma as Realtime MapReduce ▪ Map phase with PTail ▪ Divide the input log stream into N shards ▪ First version supported random bucketing ▪ Now supports application-level bucketing ▪ Reduce phase with HBase ▪ Every row+column in HBase is an output key ▪ Aggregate key counts using atomic counters ▪ Can also maintain per-key lists or other structures
  • 28. Puma for Facebook Insights ▪ Realtime URL/Domain Insights ▪ Domain owners can see deep analytics for their site ▪ Clicks, Likes, Shares, Comments, Impressions ▪ Detailed demographic breakdowns (anonymized) ▪ Top URLs calculated per-domain and globally ▪ Massive Throughput ▪ Billions of URLs ▪> Million counter increments per second
  • 31. Future of Puma ▪ Centrally managed service for many products ▪ Several other applications in production ▪ Commerce Tracking ▪ Ad Insights ▪ Making Puma generic ▪ Dynamically configured by product teams ▪ Custom query language
  • 32. Use Case ODS (Facebook Internal Metrics)
  • 33. ODS ▪ Operational Data Store ▪ System metrics (CPU, Memory, IO, Network) ▪ Application metrics (Web, DB, Caches) ▪ Facebook metrics (Usage, Revenue) ▪ Easily graph this data over time ▪ Supports complex aggregation, transformations, etc. ▪ Difficult to scale with MySQL ▪ Millions of unique time-series with billions of points ▪ Irregular data growth patterns
  • 34. Dynamic sharding of regions Region ... ... server overloaded
  • 35. Future of HBase at Facebook
  • 36. User and Graph Data in HBase
  • 37. Why now? ▪ MySQL+Memcached hard to replace, but... ▪ Joins and other RDBMS functionality are gone ▪ From writing SQL to using APIs ▪ Next generation of services and caches make the persistent storage engine transparent to www ▪ Primarily a financially motivated decision ▪ MySQL works, but can HBase save us money? ▪ Also, are there things we just couldn’t do before?
  • 38. HBase vs. MySQL ▪ MySQL at Facebook ▪ Tier size determined solely by IOPS ▪ Heavy on random IO for reads and writes ▪ Rely on fast disks or flash to scale individual nodes ▪ HBase showing promise of cost savings ▪ Fewer IOPS on write-heavy workloads ▪ Larger tables on denser, cheaper nodes ▪ Simpler operations and replication “for free”
  • 39. HBase vs. MySQL ▪ MySQL is not going anywhere soon ▪ It works! ▪ But HBase is a great addition to the tool belt ▪ Different set of trade-offs ▪ Great at storing key-values, dictionaries, and lists ▪ Products with heavy write requirements ▪ Generated data ▪ Potential capital and operational cost savings
  • 40. UDB Challenges ▪ MySQL has a + year head start ▪ HBase is still a pre- . database system ▪ Insane Requirements ▪ Zero data loss, low latency, very high throughput ▪ Reads, writes, and atomic read-modify-writes ▪ WAN replication, backups w/ point-in-time recovery ▪ Live migration of critical user data w/ existing shards ▪ queryf() and other fun edge cases to deal with
  • 41. Technical/Developer oriented talk tomorrow: Apache HBase Road Map A short history of nearly everything HBase. Past, present, and future. Wednesday @ 1PM in the Met Ballroom
  • 42. Check out the HBase at Facebook Page: facebook.com/UsingHbase Thanks! Questions?