Evolution of Big Data
  Architectures@
     Facebook
Architecture Summit, Shenzhen, August 2012
               Ashish Thusoo
About Me

• Currently Co-founder/CEO of Qubole
• Ran the Data Infrastructure Team at
  Facebook till 2011
• Co-founded Apache Hive @ Facebook
Outline

• Big Data @ Facebook - Scope & Scale
• Evolution of Big Data Architectures @ FB
• Qubole
Big Data @ FB(2011):
        Scale

• 25 PB of compressed data ~ 150 PB of
  uncompressed data
• 400 TB/day (uncompressed) of new data
• 1 new job every second
Big Data @ FB: Scope

• Simple reporting
• Model generation
• Adhoc analysis + data science
• Index generation
• Many many others...
A/B Testing Email #1
A/B Testing Email #2
A/B Testing Email #2 is
      3x Better
Evolution: 2007-2011
                 DW Size in TB
    30000
                                    25000

    22500


    15000
                             8000
     7500

            15   250   800
        0
         2007 2008 2009 2010 2011
2007: Traditional EDW


                 Scribe Mid-Tier
                                                Summarization Cluster
 Web Clusters




                                   NAS Filers


MySQL Clusters                                  RDBMS Data Warehouse
2007: Pain Points
                                                - compute close to storage
                                                    (early map/reduce)
                 Scribe Mid-Tier

 Web Clusters




                                                               Summarization Cluster
                                   NAS Filers



MySQL Clusters




                                        - daily ETL > 24 hours
                                     - Lots of tuning/indexes etc.
                                     - Lots of hardware planning
                                                               RDBMS Data Warehouse
2007: Limitations
• Most use cases were
  in business metrics -
  data science, model
  building etc. not
  possible
• Only summary data
  was stored online -
  details archived away
2008: Move to Hadoop


                 Scribe Mid-Tier
                                                Summarization Cluster
 Web Clusters




                                   NAS Filers




MySQL Clusters                                  RDBMS Data Warehouse
2008: Move to Hadoop


                 Scribe Mid-Tier             Batch
                                            copier/
 Web Clusters
                                            loaders


                                                Hadoop/Hive Data Warehouse
                                   NAS Filers




MySQL Clusters
                                                      RDBMS Data Mart
2008: Immediate Pros
• Data science at
  scale became
  possible
• For the first time all
  of the instrumented
  data could be held
  online
• Use cases expanded
2009: Democratizing
            Data

                 Scribe Mid-Tier
 Web Clusters


                                                Hadoop/Hive Data Warehouse
                                   NAS Filers




MySQL Clusters
                                                     RDBMS Data Mart
2009: Democratizing
 Databee &
               Data                                 Nectar:
Chronos: Data                                 instrumentation &
   Pipeline                                   schema aware data
 Framework                                         collection




 HiPal: Adhoc                                      Scrapes:
Queries + Data   Hadoop/Hive Data Warehouse      Configuration
  Discovery                                        Driven
2009: Democratizing
    Data(Nectar)
• Typical Nectar Pipeline
 • Simple schema evolution
    built in
 • json encoded short term
    data
 • decomposing json for
    long term storage
2009: Democratizing
    Data (Tools)
• HiPal - data discovery
  and query authoring
• Charting and
  dashboard generation
  tools
2009: Democratizing
    Data (Tools)

• Databee: Workflow
  language
• Chronos: Scheduling
  tool
2009: Cons of
     Democratization
• Isolation to protect
  against Bad Jobs
• Fair sharing of the
  cluster - what is a
  high priority job
  and how to enforce
  it
2010: Controlling
         Chaos
• Isolation
• Reducing operational overhead
• Better resource utilization
• Measurement, ownership, accountability
2010: Isolation

                   Scribe Mid-Tier
 Web Clusters

                                                  Hadoop/Hive Data Warehouse


                                     NAS Filers



MySQL Clusters
2010: Isolation

                   Scribe Mid-Tier
 Web Clusters

                                                  Platinum Warehouse

                                                      Hive Replication
                                     NAS Filers



MySQL Clusters




                                                   Silver Warehouse
2010: Ops Efficiency

 Web Clusters    Scribe HDFS

                       ptail: parallel             Platinum Warehouse
                        tail on hdfs                   Hive Replication
                             near real time data
                                 consumers

MySQL Clusters




                                                    Silver Warehouse
2010: Resource
        Utilization (Disk)

•   HDFS-RAID: from 3
    replicas to 2.2 replicas

•   RCFile: Row columnar
    format for compressing
    Hive tables
2010: Resource
       Utilization (CPU)
•   Continuous copier/
    loaders

•   Incremental scrapes

•   Hive optimizations to
    save CPU
2010: Monitoring(SLAs)

•   Per job statistics rolled
    up to owner/group/team

•   Expected time of arrival
    vs Actual time of arrival
    of data

•   Simple data quality
    metrics
2011: New
        Requirements

• More real time requirements for
  aggregations
• Optimizing resource utilization
2011: Beyond Hadoop


• Puma for real time analytics
• Peregrine for simple and fast queries
2011: Puma

 Web Clusters     Scribe HDFS

                        ptail: parallel             Platinum Warehouse
                         tail on hdfs                   Hive Replication
                              near real time data
                                  consumers

MySQL Clusters




                                                     Silver Warehouse
2011: Puma


    Scribe HDFS



          ptail: parallel tail on
                   hdfs




 Puma Clusters
                                    Hbase Cluster
Some takeaways
• Operating and optimizing Data
  Infrastructure is a hard problem
 • Lots of components from log collection,
    storage, compute, query processing, tools
    and interfaces
 • Lots of choices within each part of the
    stack
Qubole
• Mission:
 • Data Infrastructure in the Cloud made
    Easy, Fast and Reliable
 • We take care of operating and optimizing
    this infrastructure so that you can focus
    on your data, analysis, algorithms and
    building your data apps
Qubole - Information
• Early Trial(by invitation):
 • www.qubole.com
• Come talk to us to join a small and
  passionate team
  • jobs@qubole.com
• Follow us on twitter/facebook/linkedin
Fb talk arch_summit

More Related Content

PPTX
Big data architecture on cloud computing infrastructure
PDF
Big Data Journey
PPTX
Big data and hadoop anupama
PPTX
Hadoop configuration & performance tuning
PPTX
Cloud Optimized Big Data
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
PPT
Presentation on Hadoop Technology
PPTX
10c introduction
Big data architecture on cloud computing infrastructure
Big Data Journey
Big data and hadoop anupama
Hadoop configuration & performance tuning
Cloud Optimized Big Data
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Presentation on Hadoop Technology
10c introduction

What's hot (19)

KEY
Intro To Hadoop
PPTX
Hadoop technology
PDF
Hadoop Fundamentals I
PDF
NoSQL overview implementation free
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PPT
PPTX
HADOOP TECHNOLOGY ppt
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
Apache Hadoop
PPTX
Hadoop: Distributed Data Processing
PDF
Design, Scale and Performance of MapR's Distribution for Hadoop
PPTX
PPT on Hadoop
PPTX
Asbury Hadoop Overview
PPTX
Apache hadoop technology : Beginners
PDF
02.28.13 WANdisco ApacheCon 2013
PDF
Hadoop Primer
PPTX
Hadoop
Intro To Hadoop
Hadoop technology
Hadoop Fundamentals I
NoSQL overview implementation free
Introduction to Big Data & Hadoop Architecture - Module 1
HADOOP TECHNOLOGY ppt
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Hadoop Hive Talk At IIT-Delhi
Apache Hadoop
Hadoop: Distributed Data Processing
Design, Scale and Performance of MapR's Distribution for Hadoop
PPT on Hadoop
Asbury Hadoop Overview
Apache hadoop technology : Beginners
02.28.13 WANdisco ApacheCon 2013
Hadoop Primer
Hadoop
Ad

Viewers also liked (10)

PDF
Scaling agileteamsderby2012
PDF
低功耗服务器定制与绿色计算——章文嵩(淘宝)
PDF
Writing high quality code for agile2012
PDF
Pragmatic notdogmatictdd agile2012
PDF
Top100summit 芈珺七拼八凑搭建移动自动化测试框架
PDF
Continuous delivery agile_2012
PDF
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
PDF
Story mapstestplansandothercrosscutting
PDF
Via forensics appsecusa-nov-2013
PPTX
Web security-–-everything-we-know-is-wrong-eoin-keary
Scaling agileteamsderby2012
低功耗服务器定制与绿色计算——章文嵩(淘宝)
Writing high quality code for agile2012
Pragmatic notdogmatictdd agile2012
Top100summit 芈珺七拼八凑搭建移动自动化测试框架
Continuous delivery agile_2012
F1 07 淘宝软件基础设施构建实践_章文嵩_淘宝
Story mapstestplansandothercrosscutting
Via forensics appsecusa-nov-2013
Web security-–-everything-we-know-is-wrong-eoin-keary
Ad

Similar to Fb talk arch_summit (20)

PPTX
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
PDF
Google Compute and MapR
PDF
Hadoop on Azure, Blue elephants
PPTX
Big data ppt
KEY
Processing Big Data
PDF
Searching conversations with hadoop
PDF
Introduction to Gruter and Gruter's BigData Platform
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PDF
Big Data/Hadoop Infrastructure Considerations
PDF
Hadoop and Hive Development at Facebook
 
PDF
Hadoop and Hive Development at Facebook
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
Architecting the Future of Big Data & Search - Eric Baldeschwieler
PDF
Cloud Computing Big Data Is Future Of It
PDF
Architecting Virtualized Infrastructure for Big Data
PDF
Common and unique use cases for Apache Hadoop
PDF
Commonanduniqueusecases 110831113310-phpapp01
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PDF
Keynote from ApacheCon NA 2011
PPTX
Steve Watt Presentation
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Google Compute and MapR
Hadoop on Azure, Blue elephants
Big data ppt
Processing Big Data
Searching conversations with hadoop
Introduction to Gruter and Gruter's BigData Platform
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Big Data/Hadoop Infrastructure Considerations
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Managing Big Data (Chapter 2, SC 11 Tutorial)
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Cloud Computing Big Data Is Future Of It
Architecting Virtualized Infrastructure for Big Data
Common and unique use cases for Apache Hadoop
Commonanduniqueusecases 110831113310-phpapp01
Hadoop - Architectural road map for Hadoop Ecosystem
Keynote from ApacheCon NA 2011
Steve Watt Presentation

More from drewz lin (20)

PPTX
Phu appsec13
PPTX
Owasp2013 johannesullrich
PDF
Owasp advanced mobile-application-code-review-techniques-v0.2
PPTX
I mas appsecusa-nov13-v2
PDF
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
ODP
Csrf not-all-defenses-are-created-equal
PPTX
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
PPTX
Appsec usa roberthansen
PDF
Appsec usa2013 js_libinsecurity_stefanodipaola
PPT
Appsec2013 presentation-dickson final-with_all_final_edits
PPTX
Appsec2013 presentation
PPTX
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
PPTX
Appsec2013 assurance tagging-robert martin
PPTX
Amol scadaowasp
PPTX
Agile sdlc-v1.1-owasp-app sec-usa
PPTX
Vulnex app secusa2013
PDF
基于虚拟化技术的分布式软件测试框架
PPTX
新浪微博稳定性经验谈
PPTX
无线App的性能分析和监控实践 rickyqiu
PPT
网易移动自动化测试实践(孔庆云)
Phu appsec13
Owasp2013 johannesullrich
Owasp advanced mobile-application-code-review-techniques-v0.2
I mas appsecusa-nov13-v2
Defeating xss-and-xsrf-with-my faces-frameworks-steve-wolf
Csrf not-all-defenses-are-created-equal
Chuck willis-owaspbwa-beyond-1.0-app secusa-2013-11-21
Appsec usa roberthansen
Appsec usa2013 js_libinsecurity_stefanodipaola
Appsec2013 presentation-dickson final-with_all_final_edits
Appsec2013 presentation
Appsec 2013-krehel-ondrej-forensic-investigations-of-web-exploitations
Appsec2013 assurance tagging-robert martin
Amol scadaowasp
Agile sdlc-v1.1-owasp-app sec-usa
Vulnex app secusa2013
基于虚拟化技术的分布式软件测试框架
新浪微博稳定性经验谈
无线App的性能分析和监控实践 rickyqiu
网易移动自动化测试实践(孔庆云)

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Architecture types and enterprise applications.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
August Patch Tuesday
PPT
Geologic Time for studying geology for geologist
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Unlock new opportunities with location data.pdf
PDF
Hybrid model detection and classification of lung cancer
PPT
What is a Computer? Input Devices /output devices
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
Tartificialntelligence_presentation.pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
observCloud-Native Containerability and monitoring.pptx
Architecture types and enterprise applications.pdf
Zenith AI: Advanced Artificial Intelligence
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
NewMind AI Weekly Chronicles – August ’25 Week III
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
August Patch Tuesday
Geologic Time for studying geology for geologist
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Hindi spoken digit analysis for native and non-native speakers
A novel scalable deep ensemble learning framework for big data classification...
Unlock new opportunities with location data.pdf
Hybrid model detection and classification of lung cancer
What is a Computer? Input Devices /output devices
Chapter 5: Probability Theory and Statistics
Tartificialntelligence_presentation.pptx
Module 1.ppt Iot fundamentals and Architecture
Benefits of Physical activity for teenagers.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx

Fb talk arch_summit

  • 1. Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
  • 2. About Me • Currently Co-founder/CEO of Qubole • Ran the Data Infrastructure Team at Facebook till 2011 • Co-founded Apache Hive @ Facebook
  • 3. Outline • Big Data @ Facebook - Scope & Scale • Evolution of Big Data Architectures @ FB • Qubole
  • 4. Big Data @ FB(2011): Scale • 25 PB of compressed data ~ 150 PB of uncompressed data • 400 TB/day (uncompressed) of new data • 1 new job every second
  • 5. Big Data @ FB: Scope • Simple reporting • Model generation • Adhoc analysis + data science • Index generation • Many many others...
  • 8. A/B Testing Email #2 is 3x Better
  • 9. Evolution: 2007-2011 DW Size in TB 30000 25000 22500 15000 8000 7500 15 250 800 0 2007 2008 2009 2010 2011
  • 10. 2007: Traditional EDW Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers MySQL Clusters RDBMS Data Warehouse
  • 11. 2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse
  • 12. 2007: Limitations • Most use cases were in business metrics - data science, model building etc. not possible • Only summary data was stored online - details archived away
  • 13. 2008: Move to Hadoop Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers MySQL Clusters RDBMS Data Warehouse
  • 14. 2008: Move to Hadoop Scribe Mid-Tier Batch copier/ Web Clusters loaders Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart
  • 15. 2008: Immediate Pros • Data science at scale became possible • For the first time all of the instrumented data could be held online • Use cases expanded
  • 16. 2009: Democratizing Data Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart
  • 17. 2009: Democratizing Databee & Data Nectar: Chronos: Data instrumentation & Pipeline schema aware data Framework collection HiPal: Adhoc Scrapes: Queries + Data Hadoop/Hive Data Warehouse Configuration Discovery Driven
  • 18. 2009: Democratizing Data(Nectar) • Typical Nectar Pipeline • Simple schema evolution built in • json encoded short term data • decomposing json for long term storage
  • 19. 2009: Democratizing Data (Tools) • HiPal - data discovery and query authoring • Charting and dashboard generation tools
  • 20. 2009: Democratizing Data (Tools) • Databee: Workflow language • Chronos: Scheduling tool
  • 21. 2009: Cons of Democratization • Isolation to protect against Bad Jobs • Fair sharing of the cluster - what is a high priority job and how to enforce it
  • 22. 2010: Controlling Chaos • Isolation • Reducing operational overhead • Better resource utilization • Measurement, ownership, accountability
  • 23. 2010: Isolation Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters
  • 24. 2010: Isolation Scribe Mid-Tier Web Clusters Platinum Warehouse Hive Replication NAS Filers MySQL Clusters Silver Warehouse
  • 25. 2010: Ops Efficiency Web Clusters Scribe HDFS ptail: parallel Platinum Warehouse tail on hdfs Hive Replication near real time data consumers MySQL Clusters Silver Warehouse
  • 26. 2010: Resource Utilization (Disk) • HDFS-RAID: from 3 replicas to 2.2 replicas • RCFile: Row columnar format for compressing Hive tables
  • 27. 2010: Resource Utilization (CPU) • Continuous copier/ loaders • Incremental scrapes • Hive optimizations to save CPU
  • 28. 2010: Monitoring(SLAs) • Per job statistics rolled up to owner/group/team • Expected time of arrival vs Actual time of arrival of data • Simple data quality metrics
  • 29. 2011: New Requirements • More real time requirements for aggregations • Optimizing resource utilization
  • 30. 2011: Beyond Hadoop • Puma for real time analytics • Peregrine for simple and fast queries
  • 31. 2011: Puma Web Clusters Scribe HDFS ptail: parallel Platinum Warehouse tail on hdfs Hive Replication near real time data consumers MySQL Clusters Silver Warehouse
  • 32. 2011: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster
  • 33. Some takeaways • Operating and optimizing Data Infrastructure is a hard problem • Lots of components from log collection, storage, compute, query processing, tools and interfaces • Lots of choices within each part of the stack
  • 34. Qubole • Mission: • Data Infrastructure in the Cloud made Easy, Fast and Reliable • We take care of operating and optimizing this infrastructure so that you can focus on your data, analysis, algorithms and building your data apps
  • 35. Qubole - Information • Early Trial(by invitation): • www.qubole.com • Come talk to us to join a small and passionate team • jobs@qubole.com • Follow us on twitter/facebook/linkedin