SlideShare a Scribd company logo
Real-time Analytics at Facebook



Zheng Shao
10/18/2011
Agenda
 1   Analytics and Real-time

 2   Data Freeway

 3   Puma

 4   Future Works
Analytics and Real-time
what and why
Facebook Insights
• Use cases
▪   Websites/Ads/Apps/Pages
▪   Time series
▪   Demographic break-downs
▪   Unique counts/heavy hitters

• Major challenges
▪   Scalability
▪   Latency
Analytics based on Hadoop/Hive
                                              Hourly           Daily
        seconds            seconds         Copier/Loader   Pipeline Jobs


 HTTP             Scribe             NFS              Hive             MySQL
                                                     Hadoop

• 3000-node Hadoop cluster

• Copier/Loader: Map-Reduce hides machine failures

• Pipeline Jobs: Hive allows SQL-like syntax

• Good scalability, but poor latency! 24 – 48 hours.
How to Get Lower Latency?




• Small-batch Processing                    • Stream Processing
▪   Run Map-reduce/Hive every hour, every   ▪   Aggregate the data as soon as it arrives
    15 min, every 5 min, …
                                            ▪   How to solve the reliability problem?
▪   How do we reduce per-batch
    overhead?
Decisions
• Stream Processing wins!



• Data Freeway
▪   Scalable Data Stream Framework

• Puma
▪   Reliable Stream Aggregation Engine
Data Freeway
scalable data stream
Scribe

                                                  Batch
                                                  Copier
                                                               HDFS

                                                  tail/fopen
 Scribe          Scribe     Scribe
                Mid-Tier                    NFS
 Clients                    Writers                           Log
• Simple push/RPC-based logging system                     Consumer


• Open-sourced in 2008. 100 log categories at that time.

• Routing driven by static configuration.
Data Freeway
                                                      Continuous
                                                        Copier
                  C1            C2         DataNode

                                                                     HDFS
                                                                       PTail
                  C1            C2         DataNode
                                                                      (in the
                                                                       plan)
Scribe                                                  PTail
Clients      Calligraphus   Calligraphus     HDFS
               Mid-tier       Writers
                                                                      Log
                                                                   Consumer
                    Zookeeper

• 9GB/sec at peak, 10 sec latency, 2500 log categories
Calligraphus
• RPC  File System
▪   Each log category is represented by 1 or more FS directories
▪   Each directory is an ordered list of files

• Bucketing support
▪   Application buckets are application-defined shards.
▪   Infrastructure buckets allows log streams from x B/s to x GB/s

• Performance
▪   Latency: Call sync every 7 seconds
▪   Throughput: Easily saturate 1Gbit NIC
Continuous Copier
• File System  File System

• Low latency and smooth network usage

• Deployment
▪   Implemented as long-running map-only job
▪   Can move to any simple job scheduler

• Coordination
▪   Use lock files on HDFS for now
▪   Plan to move to Zookeeper
PTail
                     files        checkpoint

directory

directory

directory



  • File System  Stream (  RPC )

  • Reliability
  ▪   Checkpoints inserted into the data stream
  ▪   Can roll back to tail from any data checkpoints
  ▪   No data loss/duplicates
Channel Comparison
           Push / RPC Pull / FS
Latency      1-2 sec   10 sec
Loss/Dups      Few      None
Robustness     Low      High           Scribe
Complexity    Low       High
                                        Push /
                                         RPC
                      PTail + ScribeSend          Calligraphus
                                      Pull / FS

                                  Continuous Copier
Puma
real-time aggregation/storage
Overview


 Log Stream    Aggregations                   Serving
                               Storage
• ~ 1M log lines per second, but light read

• Multiple Group-By operations per log line

• The first key in Group By is always time/date-related

• Complex aggregations: Unique user count, most frequent
  elements
MySQL and HBase: one page
                   MySQL                 HBase
Parallel           Manual sharding       Automatic
                                         load balancing
Fail-over          Manual master/slave   Automatic
                   switch
Read efficiency    High                  Low
Write efficiency   Medium                High
Columnar support   No                    Yes
Puma2 Architecture




   PTail         Puma2         HBase        Serving

• PTail provide parallel data streams

• For each log line, Puma2 issue “increment” operations to
  HBase. Puma2 is symmetric (no sharding).

• HBase: single increment on multiple columns
Puma2: Pros and Cons
• Pros
▪   Puma2 code is very simple.
▪   Puma2 service is very easy to maintain.

• Cons
▪   “Increment” operation is expensive.
▪   Do not support complex aggregations.
▪   Hacky implementation of “most frequent elements”.
▪   Can cause small data duplicates.
Improvements in Puma2
• Puma2
▪   Batching of requests. Didn‟t work well because of long-tail distribution.

• HBase
▪   “Increment” operation optimized by reducing locks.
▪   HBase region/HDFS file locality; short-circuited read.
▪   Reliability improvements under high load.

• Still not good enough!
Puma3 Architecture



                          PTail          Puma3         HBase
• Puma3 is sharded by aggregation key.

• Each shard is a hashmap in memory.
                                                  Serving
• Each entry in hashmap is a pair of
  an aggregation key and a user-defined aggregation.

• HBase as persistent key-value storage.
Puma3 Architecture



                                    PTail           Puma3           HBase




• Write workflow
                                                                Serving
▪   For each log line, extract the columns for key and value.
▪   Look up in the hashmap and call user-defined aggregation
Puma3 Architecture



                                    PTail              Puma3       HBase




• Checkpoint workflow
                                                               Serving
▪   Every 5 min, save modified hashmap entries, PTail checkpoint to
    HBase
▪   On startup (after node failure), load from HBase
Puma3 Architecture



                                  PTail          Puma3           HBase




• Read workflow
                                                             Serving
▪   Read uncommitted: directly serve from the in-memory hashmap; load
    from Hbase on miss.
▪   Read committed: read from HBase and serve.
Puma3 Architecture



                                    PTail            Puma3        HBase




• Join
                                                              Serving
▪   Static join table in HBase.
▪   Distributed hash lookup in user-defined function (udf).
▪   Local cache improves the throughput of the udf a lot.
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪   Use 25% of the boxes to handle the same load.
▪   HBase is really good at write throughput.

• Puma3 needs a lot of memory
▪   Use 60GB of memory per box for the hashmap
▪   SSD can scale to 10x per box.
Puma3 Special Aggregations
• Unique Counts Calculation
▪   Adaptive sampling
▪   Bloom filter (in the plan)

• Most frequent item (in the plan)
▪   Lossy counting
▪   Probabilistic lossy counting
PQL – Puma Query Language
• CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟
  „adid‟, „userid‟);              INSERT INTO l (a, b, c)
                                  SELECT
• CREATE VIEW v AS                   udf.hour(time),
  SELECT *, udf.age(userid)          adid,
  FROM t                             age,
  WHERE udf.age(userid) > 21         count(1),
                                     udf.count_distinc(userid)
                                  FROM v
                                  GROUP BY
• CREATE HBASE TABLE h …             udf.hour(time),
                                     adid,
• CREATE LOGICAL TABLE l …           age;
Future Works
challenges and opportunities
Future Works
• Scheduler Support
▪   Just need simple scheduling because the work load is continuous

• Mass adoption
▪   Migrate most daily reporting queries from Hive

• Open Source
▪   Biggest bottleneck: Java Thrift dependency
▪   Will come one by one
Similar Systems
• STREAM from Stanford

• Flume from Cloudera

• S4 from Yahoo

• Rainbird/Storm from Twitter

• Kafka from Linkedin
Key differences
• Scalable Data Streams
▪   9 GB/sec with < 10 sec of latency
▪   Both Push/RPC-based and Pull/File System-based
▪   Components to support arbitrary combination of channels

• Reliable Stream Aggregations
▪   Good support for Time-based Group By, Table-Stream Lookup Join
▪   Query Language:    Puma : Realtime-MR = Hive : MR
▪   No support for sliding window, stream joins
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

PDF
Apache Spark Core – Practical Optimization
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PPTX
Apache Flink Deep Dive
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Patterns and Operational Insights from the First Users of Delta Lake
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Apache Spark Core – Practical Optimization
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Koalas: Making an Easy Transition from Pandas to Apache Spark
Apache Flink Deep Dive
Koalas: Making an Easy Transition from Pandas to Apache Spark
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Patterns and Operational Insights from the First Users of Delta Lake
Optimizing Delta/Parquet Data Lakes for Apache Spark

What's hot (20)

PPTX
Apache Spark Architecture
PDF
Apache Spark Performance: Past, Future and Present
PPTX
ETL with SPARK - First Spark London meetup
PDF
Productionizing your Streaming Jobs
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Ingesting data at scale into elasticsearch with apache pulsar
PDF
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Care and Feeding of Catalyst Optimizer
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
PDF
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Improving Apache Spark Downscaling
PDF
Track A-2 基於 Spark 的數據分析
PDF
Parquet performance tuning: the missing guide
PDF
Memory Management in Apache Spark
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Apache Spark Architecture
Apache Spark Performance: Past, Future and Present
ETL with SPARK - First Spark London meetup
Productionizing your Streaming Jobs
Building Robust ETL Pipelines with Apache Spark
Ingesting data at scale into elasticsearch with apache pulsar
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Processing Large Data with Apache Spark -- HasGeek
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Care and Feeding of Catalyst Optimizer
Frustration-Reduced PySpark: Data engineering with DataFrames
Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf
Using Apache Spark as ETL engine. Pros and Cons
Improving Apache Spark Downscaling
Track A-2 基於 Spark 的數據分析
Parquet performance tuning: the missing guide
Memory Management in Apache Spark
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Ad

Similar to Xldb2011 tue 0940_facebook_realtimeanalytics (20)

PPTX
Hic 2011 realtime_analytics_at_facebook
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PDF
[Hi c2011]building mission critical messaging system(guoqiang jerry)
PDF
Facebook keynote-nicolas-qcon
PDF
支撑Facebook消息处理的h base存储系统
PDF
Facebook Messages & HBase
PDF
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
Kudu - Fast Analytics on Fast Data
PDF
Large-scale Web Apps @ Pinterest
KEY
Near-realtime analytics with Kafka and HBase
PDF
Hbase: an introduction
PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
Hive spark-s3acommitter-hbase-nfs
PDF
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
PDF
Storage Infrastructure Behind Facebook Messages
PPTX
Scale your Alfresco Solutions
PDF
Impala presentation ahad rana
PDF
Kudu austin oct 2015.pptx
PDF
Meet Hadoop Family: part 4
Hic 2011 realtime_analytics_at_facebook
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
[Hi c2011]building mission critical messaging system(guoqiang jerry)
Facebook keynote-nicolas-qcon
支撑Facebook消息处理的h base存储系统
Facebook Messages & HBase
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
Real time fraud detection at 1+M scale on hadoop stack
Kudu - Fast Analytics on Fast Data
Large-scale Web Apps @ Pinterest
Near-realtime analytics with Kafka and HBase
Hbase: an introduction
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
Hive spark-s3acommitter-hbase-nfs
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Storage Infrastructure Behind Facebook Messages
Scale your Alfresco Solutions
Impala presentation ahad rana
Kudu austin oct 2015.pptx
Meet Hadoop Family: part 4
Ad

More from liqiang xu (12)

PDF
浅谈灰度发布在贴吧的应用 支付宝 20130909
PDF
Csrf攻击原理及防御措施
PDF
Hdfs comics
PDF
Xldb2011 wed 1415_andrew_lamb-buildingblocks
PDF
Xldb2011 tue 1055_tom_fastner
PDF
Xldb2011 tue 1005_linked_in
PDF
Xldb2011 tue 1120_youtube_datawarehouse
PDF
Selenium私房菜(新手入门教程)
PDF
大话Php之性能
PPT
Nginx internals
PPT
1.4亿在线背后的故事(2)
PPT
1.4亿在线背后的故事(1)
浅谈灰度发布在贴吧的应用 支付宝 20130909
Csrf攻击原理及防御措施
Hdfs comics
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1120_youtube_datawarehouse
Selenium私房菜(新手入门教程)
大话Php之性能
Nginx internals
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(1)

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.

Xldb2011 tue 0940_facebook_realtimeanalytics

  • 1. Real-time Analytics at Facebook Zheng Shao 10/18/2011
  • 2. Agenda 1 Analytics and Real-time 2 Data Freeway 3 Puma 4 Future Works
  • 4. Facebook Insights • Use cases ▪ Websites/Ads/Apps/Pages ▪ Time series ▪ Demographic break-downs ▪ Unique counts/heavy hitters • Major challenges ▪ Scalability ▪ Latency
  • 5. Analytics based on Hadoop/Hive Hourly Daily seconds seconds Copier/Loader Pipeline Jobs HTTP Scribe NFS Hive MySQL Hadoop • 3000-node Hadoop cluster • Copier/Loader: Map-Reduce hides machine failures • Pipeline Jobs: Hive allows SQL-like syntax • Good scalability, but poor latency! 24 – 48 hours.
  • 6. How to Get Lower Latency? • Small-batch Processing • Stream Processing ▪ Run Map-reduce/Hive every hour, every ▪ Aggregate the data as soon as it arrives 15 min, every 5 min, … ▪ How to solve the reliability problem? ▪ How do we reduce per-batch overhead?
  • 7. Decisions • Stream Processing wins! • Data Freeway ▪ Scalable Data Stream Framework • Puma ▪ Reliable Stream Aggregation Engine
  • 9. Scribe Batch Copier HDFS tail/fopen Scribe Scribe Scribe Mid-Tier NFS Clients Writers Log • Simple push/RPC-based logging system Consumer • Open-sourced in 2008. 100 log categories at that time. • Routing driven by static configuration.
  • 10. Data Freeway Continuous Copier C1 C2 DataNode HDFS PTail C1 C2 DataNode (in the plan) Scribe PTail Clients Calligraphus Calligraphus HDFS Mid-tier Writers Log Consumer Zookeeper • 9GB/sec at peak, 10 sec latency, 2500 log categories
  • 11. Calligraphus • RPC  File System ▪ Each log category is represented by 1 or more FS directories ▪ Each directory is an ordered list of files • Bucketing support ▪ Application buckets are application-defined shards. ▪ Infrastructure buckets allows log streams from x B/s to x GB/s • Performance ▪ Latency: Call sync every 7 seconds ▪ Throughput: Easily saturate 1Gbit NIC
  • 12. Continuous Copier • File System  File System • Low latency and smooth network usage • Deployment ▪ Implemented as long-running map-only job ▪ Can move to any simple job scheduler • Coordination ▪ Use lock files on HDFS for now ▪ Plan to move to Zookeeper
  • 13. PTail files checkpoint directory directory directory • File System  Stream (  RPC ) • Reliability ▪ Checkpoints inserted into the data stream ▪ Can roll back to tail from any data checkpoints ▪ No data loss/duplicates
  • 14. Channel Comparison Push / RPC Pull / FS Latency 1-2 sec 10 sec Loss/Dups Few None Robustness Low High Scribe Complexity Low High Push / RPC PTail + ScribeSend Calligraphus Pull / FS Continuous Copier
  • 16. Overview Log Stream Aggregations Serving Storage • ~ 1M log lines per second, but light read • Multiple Group-By operations per log line • The first key in Group By is always time/date-related • Complex aggregations: Unique user count, most frequent elements
  • 17. MySQL and HBase: one page MySQL HBase Parallel Manual sharding Automatic load balancing Fail-over Manual master/slave Automatic switch Read efficiency High Low Write efficiency Medium High Columnar support No Yes
  • 18. Puma2 Architecture PTail Puma2 HBase Serving • PTail provide parallel data streams • For each log line, Puma2 issue “increment” operations to HBase. Puma2 is symmetric (no sharding). • HBase: single increment on multiple columns
  • 19. Puma2: Pros and Cons • Pros ▪ Puma2 code is very simple. ▪ Puma2 service is very easy to maintain. • Cons ▪ “Increment” operation is expensive. ▪ Do not support complex aggregations. ▪ Hacky implementation of “most frequent elements”. ▪ Can cause small data duplicates.
  • 20. Improvements in Puma2 • Puma2 ▪ Batching of requests. Didn‟t work well because of long-tail distribution. • HBase ▪ “Increment” operation optimized by reducing locks. ▪ HBase region/HDFS file locality; short-circuited read. ▪ Reliability improvements under high load. • Still not good enough!
  • 21. Puma3 Architecture PTail Puma3 HBase • Puma3 is sharded by aggregation key. • Each shard is a hashmap in memory. Serving • Each entry in hashmap is a pair of an aggregation key and a user-defined aggregation. • HBase as persistent key-value storage.
  • 22. Puma3 Architecture PTail Puma3 HBase • Write workflow Serving ▪ For each log line, extract the columns for key and value. ▪ Look up in the hashmap and call user-defined aggregation
  • 23. Puma3 Architecture PTail Puma3 HBase • Checkpoint workflow Serving ▪ Every 5 min, save modified hashmap entries, PTail checkpoint to HBase ▪ On startup (after node failure), load from HBase
  • 24. Puma3 Architecture PTail Puma3 HBase • Read workflow Serving ▪ Read uncommitted: directly serve from the in-memory hashmap; load from Hbase on miss. ▪ Read committed: read from HBase and serve.
  • 25. Puma3 Architecture PTail Puma3 HBase • Join Serving ▪ Static join table in HBase. ▪ Distributed hash lookup in user-defined function (udf). ▪ Local cache improves the throughput of the udf a lot.
  • 26. Puma2 / Puma3 comparison • Puma3 is much better in write throughput ▪ Use 25% of the boxes to handle the same load. ▪ HBase is really good at write throughput. • Puma3 needs a lot of memory ▪ Use 60GB of memory per box for the hashmap ▪ SSD can scale to 10x per box.
  • 27. Puma3 Special Aggregations • Unique Counts Calculation ▪ Adaptive sampling ▪ Bloom filter (in the plan) • Most frequent item (in the plan) ▪ Lossy counting ▪ Probabilistic lossy counting
  • 28. PQL – Puma Query Language • CREATE INPUT TABLE t („time', • CREATE AGGREGATION „abc‟ „adid‟, „userid‟); INSERT INTO l (a, b, c) SELECT • CREATE VIEW v AS udf.hour(time), SELECT *, udf.age(userid) adid, FROM t age, WHERE udf.age(userid) > 21 count(1), udf.count_distinc(userid) FROM v GROUP BY • CREATE HBASE TABLE h … udf.hour(time), adid, • CREATE LOGICAL TABLE l … age;
  • 30. Future Works • Scheduler Support ▪ Just need simple scheduling because the work load is continuous • Mass adoption ▪ Migrate most daily reporting queries from Hive • Open Source ▪ Biggest bottleneck: Java Thrift dependency ▪ Will come one by one
  • 31. Similar Systems • STREAM from Stanford • Flume from Cloudera • S4 from Yahoo • Rainbird/Storm from Twitter • Kafka from Linkedin
  • 32. Key differences • Scalable Data Streams ▪ 9 GB/sec with < 10 sec of latency ▪ Both Push/RPC-based and Pull/File System-based ▪ Components to support arbitrary combination of channels • Reliable Stream Aggregations ▪ Good support for Time-based Group By, Table-Stream Lookup Join ▪ Query Language: Puma : Realtime-MR = Hive : MR ▪ No support for sliding window, stream joins
  • 33. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0