Fb talk arch_summit

Evolution of Big Data
Architectures@
Facebook
Architecture Summit, Shenzhen, August 2012
Ashish Thusoo

About Me

• Currently Co-founder/CEO of Qubole
• Ran the Data Infrastructure Team at
Facebook till 2011
• Co-founded Apache Hive @ Facebook

Outline

• Big Data @ Facebook - Scope & Scale
• Evolution of Big Data Architectures @ FB
• Qubole

Big Data @ FB(2011):
Scale

• 25 PB of compressed data ~ 150 PB of
uncompressed data
• 400 TB/day (uncompressed) of new data
• 1 new job every second

Big Data @ FB: Scope

• Simple reporting
• Model generation
• Adhoc analysis + data science
• Index generation
• Many many others...

A/B Testing Email #2 is
3x Better

Evolution: 2007-2011
DW Size in TB
30000
25000

22500

15000
8000
7500

15 250 800
0
2007 2008 2009 2010 2011

2007: Traditional EDW

Scribe Mid-Tier
Summarization Cluster
Web Clusters

NAS Filers

MySQL Clusters RDBMS Data Warehouse

2007: Pain Points
- compute close to storage
(early map/reduce)
Scribe Mid-Tier

Web Clusters

NAS Filers

MySQL Clusters

- daily ETL > 24 hours
- Lots of tuning/indexes etc.
- Lots of hardware planning
RDBMS Data Warehouse

2007: Limitations
• Most use cases were
in business metrics -
data science, model
building etc. not
possible
• Only summary data
was stored online -
details archived away

2008: Move to Hadoop

Scribe Mid-Tier
Web Clusters

NAS Filers

MySQL Clusters RDBMS Data Warehouse

2008: Move to Hadoop

Scribe Mid-Tier Batch
copier/
Web Clusters
loaders

Hadoop/Hive Data Warehouse
NAS Filers

MySQL Clusters
RDBMS Data Mart

2008: Immediate Pros
• Data science at
scale became
possible
• For the ﬁrst time all
of the instrumented
data could be held
online
• Use cases expanded

2009: Democratizing
Data

Scribe Mid-Tier
Web Clusters

NAS Filers

MySQL Clusters
RDBMS Data Mart

2009: Democratizing
Databee &
Data Nectar:
Chronos: Data instrumentation &
Pipeline schema aware data
Framework collection

HiPal: Adhoc Scrapes:
Queries + Data Hadoop/Hive Data Warehouse Conﬁguration
Discovery Driven

2009: Democratizing
Data(Nectar)
• Typical Nectar Pipeline
• Simple schema evolution
built in
• json encoded short term
data
• decomposing json for
long term storage

2009: Democratizing
Data (Tools)
• HiPal - data discovery
and query authoring
• Charting and
dashboard generation
tools

2009: Democratizing
Data (Tools)

• Databee: Workﬂow
language
• Chronos: Scheduling
tool

2009: Cons of
Democratization
• Isolation to protect
against Bad Jobs
• Fair sharing of the
cluster - what is a
high priority job
and how to enforce
it

2010: Controlling
Chaos
• Isolation
• Reducing operational overhead
• Better resource utilization
• Measurement, ownership, accountability

2010: Isolation

Scribe Mid-Tier
Web Clusters


NAS Filers

MySQL Clusters

2010: Isolation

Scribe Mid-Tier
Web Clusters

Platinum Warehouse

Hive Replication
NAS Filers

MySQL Clusters

Silver Warehouse

2010: Ops Efﬁciency

Web Clusters Scribe HDFS

ptail: parallel Platinum Warehouse
tail on hdfs Hive Replication
near real time data
consumers

MySQL Clusters

Silver Warehouse

2010: Resource
Utilization (Disk)

• HDFS-RAID: from 3
replicas to 2.2 replicas

• RCFile: Row columnar
format for compressing
Hive tables

2010: Resource
Utilization (CPU)
• Continuous copier/
loaders

• Incremental scrapes

• Hive optimizations to
save CPU

2010: Monitoring(SLAs)

• Per job statistics rolled
up to owner/group/team

• Expected time of arrival
vs Actual time of arrival
of data

• Simple data quality
metrics

2011: New
Requirements

• More real time requirements for
aggregations
• Optimizing resource utilization

2011: Beyond Hadoop

• Puma for real time analytics
• Peregrine for simple and fast queries

2011: Puma

Web Clusters Scribe HDFS

ptail: parallel Platinum Warehouse
tail on hdfs Hive Replication
near real time data
consumers

MySQL Clusters

Silver Warehouse

2011: Puma

Scribe HDFS

ptail: parallel tail on
hdfs

Puma Clusters
Hbase Cluster

Some takeaways
• Operating and optimizing Data
Infrastructure is a hard problem
• Lots of components from log collection,
storage, compute, query processing, tools
and interfaces
• Lots of choices within each part of the
stack

Qubole
• Mission:
• Data Infrastructure in the Cloud made
Easy, Fast and Reliable
• We take care of operating and optimizing
this infrastructure so that you can focus
on your data, analysis, algorithms and
building your data apps

Qubole - Information
• Early Trial(by invitation):
• www.qubole.com
• Come talk to us to join a small and
passionate team
• jobs@qubole.com
• Follow us on twitter/facebook/linkedin

Fb talk arch_summit

More Related Content

What's hot (19)

Viewers also liked (10)

Similar to Fb talk arch_summit (20)

More from drewz lin (20)

Recently uploaded (20)

Fb talk arch_summit