SlideShare a Scribd company logo
Facebook - Jonthan Gray - Hadoop World 2010
HBase at Facebook
Jonathan Gray
Hadoop World
October 12, 2010
1 Data at Facebook
2 HBase Development
3 Future Work
Agenda
Data at Facebook
500 Million active monthly users
>500 Billion page views per month
25 Billion pieces of content per month
Cache
OS Web server Database Language
Data analysis
Daily Federated MySQLSilver Cluster
Platinum
Hadoop
Cluster
5-15 minutes5-15 minutes
Cluster 1 Cluster 2
Scribe/Hadoop
cluster
Scribe/Hadoop
cluster
Cache
OS Web server Database Language
Data analysis
Cache
OS Web server Database Language
Data analysis
HBase
Key Properties
▪ Linearly scalable
▪ Fast indexed writes
▪ Tight integration with Hadoop
Bridges gap between online and offline
Use Case #1
Incremental Updates into Data Warehouse
▪ Currently
▪ Nightly dumps of UDBs into Warehouse
▪ With HBase
▪ Tail UDB replication logs into HBase
UDB to Warehouse in minutes
Use Case #2
High Frequency Counters and Realtime Analytics
▪ Currently
▪ Scribe to HDFS, periodically aggregate to UDB
▪ With HBase
▪ Scribe to HBase, read in realtime with API or MR
Storage, serving, and analysis in one
Use Case #3
User-facing Database for Write Intensive Workloads
▪ Currently
▪ Constantly expanding UDB and Memcache tiers
▪ With HBase
▪ Fast writes, automatic partitioning, linear scaling
Fast and scalable writes, just add nodes
HBase Development
Hive Integration
HBase and Hive
▪ HBase Tables usable as Hive Tables
▪ ETL data target
▪ Query data source
▪ Support for different read/write patterns
▪ API random write or MR bulk load
▪ API random read or MR table scan
HBase Master
Re-architected for HA and Testability
▪ Increased usage of ZooKeeper for failover
▪ Region transitions in ZK
▪ Working master failover in all cases
▪ Refactor/Redesign of major components
▪ Load balancer, cluster startup, failover redesigned
▪ Emphasis on independent testability
Random Read Optimizations
Performance degrades with lots of files
▪ Bloom filters
▪ Dynamic Row or Row+Column as HFile metadata
▪ Skip files on disk that don’t match
▪ Timestamp ranges
▪ Stored as HFile metadata
▪ Skip files on disk that don’t cover time range
Random Read Optimizations
Performance degrades with wide rows
▪ Aggressively seek/reseek
▪ Use query and block index to skip blocks
▪ Stop processing as soon as we finish query
▪ Expose seeking to Filter API
▪ Allow specialized optimizations
▪ Millions of versions in a row, grab 10
Administration Tools
Detect and repair potential issues
▪ HBCK
▪ FSCK for HBase
▪ Detect and repair cluster issues
▪ Cluster Verification
▪ Ensure cluster can be written to, read from
▪ Tables can be created/disabled/dropped
Hadoop Improvements
HDFS Appends
▪ Hadoop 0.20
▪ Widely deployed but no support for appends
▪ Hadoop 0.20 with append support
▪ Apache Hadoop 0.20-Append
▪ Cloudera’s CDH version 3
▪ Facebook’s version of Hadoop 0.20
▪ http://github.com/facebook/hadoop-20-append
Hadoop Improvements
HDFS rolling upgrades and NameNode HA
▪ HDFS in online application
▪ Need to support upgrades without downtime
▪ More sensitive to NameNode SPOF
▪ Hadoop AvatarNode
▪ Hot standby pair of NameNodes
▪ Failover to new version of NameNode
▪ Failover to hot standby in seconds under failure
Coming Soon
New Features
▪ East coast / west coast replication
▪ Asynchronous replication between data centers
▪ Faster recovery
▪ Distributed log splitting
▪ Master controlled rolling restart
▪ Fast and retaining assignment information
Future Work
Coprocessors
Complex server-side operations
▪ Dynamically loaded server-side logic
▪ Hook into read/write and cluster operations
▪ Endless possibilities
▪ Server-side merges and joins
▪ Lightweight MapReduce for aggregation
▪ Efficient secondary indexing
Intelligent Load Balancing
Complex notion of load
▪ Currently based only on region count
▪ Different regions have different access patterns
▪ And data locality equally important
▪ Next generation load balancing algorithms
▪ Consider complex notion of read load / write load
▪ And HDFS block locations for locality
▪ Retain assignment information between restarts
Other Future Work
Cluster Performance
▪ Quality of service
▪ One MapReduce job can take down cluster
▪ Dynamic configuration changes
▪ Change important parameters on running cluster
▪ HDFS performance
▪ Critical target for long-term HBase performance
(c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

PPTX
Keynote: The Future of Apache HBase
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
PPTX
Apache Spark on Apache HBase: Current and Future
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PDF
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
PPTX
HBase at Bloomberg: High Availability Needs for the Financial Industry
PDF
HBaseCon 2015- HBase @ Flipboard
Keynote: The Future of Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Apache Spark on Apache HBase: Current and Future
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon 2015- HBase @ Flipboard

What's hot (20)

PPTX
HBase in Practice
PPTX
HBaseCon 2015: HBase Operations in a Flurry
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
PPT
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PPTX
HBaseCon 2015: State of HBase Docs and How to Contribute
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PPTX
A Survey of HBase Application Archetypes
PPTX
HBaseCon 2015: HBase and Spark
PPTX
Digital Library Collection Management using HBase
PPTX
Time-Series Apache HBase
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PPT
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
PDF
Large-scale Web Apps @ Pinterest
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase in Practice
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
A Survey of HBase Application Archetypes
HBaseCon 2015: HBase and Spark
Digital Library Collection Management using HBase
Time-Series Apache HBase
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Large-scale Web Apps @ Pinterest
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBase Read High Availability Using Timeline-Consistent Region Replicas
Ad

Viewers also liked (20)

PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
PDF
Project Voldemort
PDF
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
PPTX
H-Base in Data Base Mangement System
PDF
ADER RRHH PRESENTACIÓN CORPORATIVA
PPTX
Facebook Retrospective - Big data-world-europe-2012
PPTX
Facebook's Approach to Big Data Storage Challenge
PDF
Data Warehouse Evolution Roadshow
PDF
Storage infrastructure using HBase behind LINE messages
PPT
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
PDF
HBase @ Twitter
PDF
Storage Infrastructure Behind Facebook Messages
PPTX
Creating a Culture of Data @ Facebook - TCCEU13
KEY
NoSQL at Twitter (NoSQL EU 2010)
PDF
Project Voldemort: Big data loading
PDF
Facebook Messages & HBase
PPT
Hive Training -- Motivations and Real World Use Cases
PPTX
Alibaba group
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
PDF
Understanding Data Partitioning and Replication in Apache Cassandra
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
Project Voldemort
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
H-Base in Data Base Mangement System
ADER RRHH PRESENTACIÓN CORPORATIVA
Facebook Retrospective - Big data-world-europe-2012
Facebook's Approach to Big Data Storage Challenge
Data Warehouse Evolution Roadshow
Storage infrastructure using HBase behind LINE messages
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
HBase @ Twitter
Storage Infrastructure Behind Facebook Messages
Creating a Culture of Data @ Facebook - TCCEU13
NoSQL at Twitter (NoSQL EU 2010)
Project Voldemort: Big data loading
Facebook Messages & HBase
Hive Training -- Motivations and Real World Use Cases
Alibaba group
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Understanding Data Partitioning and Replication in Apache Cassandra
Ad

Similar to Facebook - Jonthan Gray - Hadoop World 2010 (20)

PDF
Facebook keynote-nicolas-qcon
PDF
支撑Facebook消息处理的h base存储系统
PDF
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
PDF
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
PDF
[Hi c2011]building mission critical messaging system(guoqiang jerry)
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPT
Hw09 Practical HBase Getting The Most From Your H Base Install
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PDF
Facebook's HBase Backups - StampedeCon 2012
PPTX
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
PDF
MyRocks introduction and production deployment
PDF
Hive spark-s3acommitter-hbase-nfs
PPTX
Introduction to Apache HBase
PPTX
HDFS- What is New and Future
PPTX
Storage Infrastructure Behind Facebook Messages
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PPTX
Chicago Data Summit: Geo-based Content Processing Using HBase
PDF
Hbase mhug 2015
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
[Hi c2011]building mission critical messaging system(guoqiang jerry)
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Hw09 Practical HBase Getting The Most From Your H Base Install
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Facebook's HBase Backups - StampedeCon 2012
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
MyRocks introduction and production deployment
Hive spark-s3acommitter-hbase-nfs
Introduction to Apache HBase
HDFS- What is New and Future
Storage Infrastructure Behind Facebook Messages
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
Chicago Data Summit: Geo-based Content Processing Using HBase
Hbase mhug 2015
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Leveraging the cloud for analytics and machine learning 1.29.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
Electronic commerce courselecture one. Pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Facebook - Jonthan Gray - Hadoop World 2010

  • 2. HBase at Facebook Jonathan Gray Hadoop World October 12, 2010
  • 3. 1 Data at Facebook 2 HBase Development 3 Future Work Agenda
  • 5. 500 Million active monthly users >500 Billion page views per month 25 Billion pieces of content per month
  • 6. Cache OS Web server Database Language Data analysis
  • 7. Daily Federated MySQLSilver Cluster Platinum Hadoop Cluster 5-15 minutes5-15 minutes Cluster 1 Cluster 2 Scribe/Hadoop cluster Scribe/Hadoop cluster
  • 8. Cache OS Web server Database Language Data analysis
  • 9. Cache OS Web server Database Language Data analysis
  • 10. HBase Key Properties ▪ Linearly scalable ▪ Fast indexed writes ▪ Tight integration with Hadoop Bridges gap between online and offline
  • 11. Use Case #1 Incremental Updates into Data Warehouse ▪ Currently ▪ Nightly dumps of UDBs into Warehouse ▪ With HBase ▪ Tail UDB replication logs into HBase UDB to Warehouse in minutes
  • 12. Use Case #2 High Frequency Counters and Realtime Analytics ▪ Currently ▪ Scribe to HDFS, periodically aggregate to UDB ▪ With HBase ▪ Scribe to HBase, read in realtime with API or MR Storage, serving, and analysis in one
  • 13. Use Case #3 User-facing Database for Write Intensive Workloads ▪ Currently ▪ Constantly expanding UDB and Memcache tiers ▪ With HBase ▪ Fast writes, automatic partitioning, linear scaling Fast and scalable writes, just add nodes
  • 15. Hive Integration HBase and Hive ▪ HBase Tables usable as Hive Tables ▪ ETL data target ▪ Query data source ▪ Support for different read/write patterns ▪ API random write or MR bulk load ▪ API random read or MR table scan
  • 16. HBase Master Re-architected for HA and Testability ▪ Increased usage of ZooKeeper for failover ▪ Region transitions in ZK ▪ Working master failover in all cases ▪ Refactor/Redesign of major components ▪ Load balancer, cluster startup, failover redesigned ▪ Emphasis on independent testability
  • 17. Random Read Optimizations Performance degrades with lots of files ▪ Bloom filters ▪ Dynamic Row or Row+Column as HFile metadata ▪ Skip files on disk that don’t match ▪ Timestamp ranges ▪ Stored as HFile metadata ▪ Skip files on disk that don’t cover time range
  • 18. Random Read Optimizations Performance degrades with wide rows ▪ Aggressively seek/reseek ▪ Use query and block index to skip blocks ▪ Stop processing as soon as we finish query ▪ Expose seeking to Filter API ▪ Allow specialized optimizations ▪ Millions of versions in a row, grab 10
  • 19. Administration Tools Detect and repair potential issues ▪ HBCK ▪ FSCK for HBase ▪ Detect and repair cluster issues ▪ Cluster Verification ▪ Ensure cluster can be written to, read from ▪ Tables can be created/disabled/dropped
  • 20. Hadoop Improvements HDFS Appends ▪ Hadoop 0.20 ▪ Widely deployed but no support for appends ▪ Hadoop 0.20 with append support ▪ Apache Hadoop 0.20-Append ▪ Cloudera’s CDH version 3 ▪ Facebook’s version of Hadoop 0.20 ▪ http://github.com/facebook/hadoop-20-append
  • 21. Hadoop Improvements HDFS rolling upgrades and NameNode HA ▪ HDFS in online application ▪ Need to support upgrades without downtime ▪ More sensitive to NameNode SPOF ▪ Hadoop AvatarNode ▪ Hot standby pair of NameNodes ▪ Failover to new version of NameNode ▪ Failover to hot standby in seconds under failure
  • 22. Coming Soon New Features ▪ East coast / west coast replication ▪ Asynchronous replication between data centers ▪ Faster recovery ▪ Distributed log splitting ▪ Master controlled rolling restart ▪ Fast and retaining assignment information
  • 24. Coprocessors Complex server-side operations ▪ Dynamically loaded server-side logic ▪ Hook into read/write and cluster operations ▪ Endless possibilities ▪ Server-side merges and joins ▪ Lightweight MapReduce for aggregation ▪ Efficient secondary indexing
  • 25. Intelligent Load Balancing Complex notion of load ▪ Currently based only on region count ▪ Different regions have different access patterns ▪ And data locality equally important ▪ Next generation load balancing algorithms ▪ Consider complex notion of read load / write load ▪ And HDFS block locations for locality ▪ Retain assignment information between restarts
  • 26. Other Future Work Cluster Performance ▪ Quality of service ▪ One MapReduce job can take down cluster ▪ Dynamic configuration changes ▪ Change important parameters on running cluster ▪ HDFS performance ▪ Critical target for long-term HBase performance
  • 27. (c) 2007 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0