SlideShare a Scribd company logo
How @twitterhadoop
chose Google Cloud
Joep Rottinghuis & Lohit VijayaRenu
Twitter Hadoop Team (@twitterhadoop)
1
1. Twitter infrastructure
2. Hadoop evaluation
3. Evaluation outcomes
4. Recommendations and conclusions
5. Q&A
Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave
Beckett (https://youtu.be/4FLFcWgZdo4) 2
Twitter Infrastructure
3
Twitter’s infrastructure
● Twitter founded in 2006
● Global-scale application
● Unique scale and performance characteristics
● Real-time
● Built to purpose and well optimized
● Large data centers
4
Strategic questions
1. What is the long-term mix of cloud versus
datacenter?
2. Which cloud provider(s) should we use?
3. How can we be confident in this type of
decision?
4. Why should we evaluate this now (2016)?
5
Tactical questions
1. What is the feasibility and cost of large-scale
adoption?
2. Which workloads are best-suited for the cloud
and are they separable?
3. How would our architecture change on the
cloud?
4. How do we get to an actionable plan?
6
Evaluation process
● Started evaluation in 2016
● Were able to make a patient, rigorous
decision
● Defined baseline workload requirements
● Engaged major providers
● Analyzed clouds for each major workload
● Built overall cloud plan
● Iterated and optimized choices
7
Evaluation Timeline
Considering Moving
● PoC’s Completed
& Results Delivered
● Legal Agreement with
T&C’s ratified
● Kickoff dataproc,
bigquery, dataflow
experimentation
● Security and
Platform
Review
● v1 Hadoop on GCP
Architecture
Ratified
● Begin build for
migration plan
● Consensus built with
Product, Revenue, Eng
● Migration Kickoff
● Proposal to migrate
Hadoop to GCP
formally accepted
June
‘16
● Initial Cloud RfP release
● 27 Synthetic PoC’s on
GCP begin
● Testing Projects /
Network established
Sept
‘16
Mar
‘17
July
‘17
Nov
‘17
Jan
‘18
Apr
‘18
June
‘18
8
Built overall cloud plan
● Created a series of candidate architectures
for each platform with their resource
requirements
● Developed a migration project plan &
timeline
● Created financial projections
● With some other business considerations
9
Financial modeling
● 10-year time horizon to avoid timing artifacts
● Compared on premise and multiple cloud
scenarios
● Costs of migration and long-term
● Long-term price/performance curves
(e.g. Moore’s Law, historical pricing)
● Two independent models to avoid model
errors
10
● An immediate all-in migration at Twitter scale
is: expensive, distracting, and risky
● More value from new architectures and
transformation, so start smaller and learn as
we go
● Hadoop offered several important, specific
benefits with lower risk
● We gained confidence in our investments in
both cloud projects and data centers
What we found
11
>1.4T
Messages Per Day
>500K
Compute Cores
>300PB
Logical Storage
Hadoop@Twitter scale
>12,500
Peak Cluster Size
12
Type Use Compute %
Real-time Critical performance production jobs
with dedicated capacity
10%
Processing Regularly scheduled production jobs
with dedicated capacity
60%
Ad-hoc One off / ad-hoc queries and analysis 30%
Cold Dense storage clusters, not for compute minimal
Twitter Hadoop cluster types
13
Twitter Hadoop challenges
1. Scaling: Significant YoY Compute & Storage growth
2. Hardware: Designing, building, maintaining & operating
3. Capacity Planning: Hard to predict for adhoc especially
4. Agility: Must respond fast especially for adhoc compute
5. Deployment: Must deploy at scale and in-flight
6. Network: Both cross-DC and cross-cluster
7. Disaster Recovery: Durable copies needed in 2+ DCs
14
Twitter Hadoop requirements
● Network sustained bandwidth per core
● Disk (data) sustained bandwidth per core
● Large sequential reads & writes
● Throughput not latency
● Capacity
● CPU / RAM not usually the bottleneck
● Consistency of datasets (set of HDFS files)
15
Twitter Hadoop on premise hardware
numbers
Clusters: 10 to 10K nodes
Network: 10G moving to 25G
Data Disks: 24T-72T over 12 HDDs
CPU: 8 cores with 64G memory
I/O: Network: ~20MB/s sustained, peaks of 10x
HDFS read: 20 rq/s sustained, peaks of 3x
HDFS write: large variation, peaks of 10x
16
2. Twitter Hadoop on
cloud VMs
Durable storage: cloud
object store
Scratch storage:
a. with HDFS over
cloud object store
b. with HDFS on cloud
block store
c. with HDFS on local
disks
1. Hadoop-as-a-Service
(HaaS) from the cloud
provider
Cloud architectural options
17
2. Functional Test
Gridmix: IO + Compute
● Capture of real
production cluster
workload (1k-5k jobs)
● Replays reads, writes,
shuffles, compute
Testing plan
1. Baseline Tests
● TestDFSIO:
low level IO read/write
● Teragen:
measure maximum
write rate
● Terasort:
read, shuffle, write
18
HDFS configurations tested
Availability
● Critical data: 2 regions
● Other data: 2 zones
Each type of Object, Block
and Local Storage
Dataset consistency
Test cloud provider choices:
1. object store
2. object store with external
consistency service
19
Hadoop Evaluation
20
GCP HaaS: DataProc config
● Hadoop 2.7.2
● Performance tests with 800 vCPUs:
○ 100 x n1-standard-8 (8 VCPU, 30G memory)
○ 200 x n1-standard-4 (4 VCPU, 30G memory)
● Scale test with 8000 vCPUs:
○ 1000 x n1-standard-8 (8 vCPU, 30G memory)
● Modeled average CPU and average to peak CPU.
● No preemptible instances in initial work
● Similar to on premise hardware SKUs
21
Decided to use DataProc
for evaluation.
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 3 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD ~1x
None PD-HDD 1.5TB PD-HDD ~1x
DataProc 100 x n1-standard-8 Results
Tuned Compute Engine instance types to get the optimum balance of
network : cores : storage (this changes over time)
22
Durable
Storage
Scratch
Storage
HDFS Speedup vs on premise
(normalized by IO-per-core)
Cloud
Storage
Local SSD 2 x 375G SSD ~2x (but expensive)
Cloud
Storage
PD-HDD 1.5TB PD-HDD 1.4x
DataProc 200 x n1-standard-4 Results
23
Benchmark Findings
1. Application Benchmarks
are critical
Total job time is composed of
multiple steps. We found
variation both better and worse
at each step.
Recommendation: You should
rely on an application
benchmark like GridMix rather
than micro-benchmarks.
2. Can treat network
storage like local disk
Both Cloud Storage and PD
offered nearly as much
bandwidth as typical direct
attached HDDs on premise
24
Functional Test Findings
1. Live Migration of VMs was not noticeable
during Hadoop testing. It was during other
Twitter platform testing of Compute Engine
(cache at very high rps of small objects)
2. Cloud Storage checksum vs HDFS checksum.
Fixed via HDFS-13056 in collaboration with
Google
3. fsync() system call on Local SSD was slow
(fixed)
25
Evaluation Outcomes
26
+ Leads to the fastest migration
+ Limits duplication of costs during migration period
- Introduces significant tech debt post-migration
- Requires a major rearchitecture post-migration to
capture benefits of cloud
- Concerns around overall cost, risk, and distraction of this
approach at Twitter scale
Life-and-Shift
everything
Disqualified Lift-and-Shift *Everything*
27
● Separable with fewer dependencies
● Standard open source software:
○ Continue to develop in house and run on premise
○ Reduces lock-in risk
● Rearchitecting is achievable
○ Not a lift-and-shift
● Data in Cloud Storage:
○ Enables broader diversity of data processing
frameworks and services
● Long-term bet on Google’s Big Data ecosystem
Hadoop to Cloud was Interesting
28
Separate Hadoop Compute and Storage
● Scaling the dimensions independently
● Makes it easy to run multiple clusters and processing
frameworks over the same data
● Virtual network and project primitives provide
segmentation of access and cost structures.
● State is preserved in Cloud Storage therefore
deployments, upgrades, and testing are simpler
● Can treat storage as a commodity
Enables
29
1. Cold Cluster
● Storage: Cloud Storage
● Compute: Limited
ephemeral Dataproc an
option
● Scaling: mostly storage
driven
2. Ad-Hoc Clusters
● Storage: Cloud Storage
● Compute: Compute
Engine and Twitter build
of Hadoop (long running
clusters)
● Scaling: mixture, with
spiky compute
Twitter Hadoop Rearchitected for Cloud
30
Twitter production Hadoop remains on premise
● Not as separable from other production workloads
● Focusing on non-production workloads limits our risk
● Regular compute-intensive usage patterns
● Benefits more from purpose built hardware
● Fewer processing frameworks are needed
31
Twitter Strategic Benefits
● Next-generation architecture with numerous
enhancements:
○ security, encryption, isolation, live migration
● Leverage Google’s capacity and R&D
● Larger ecosystem of open source & cloud software
● Long-term strategic collaboration with Google
● Beachhead that enable teams across Twitter to make
tactical cloud adoption decisions
What does this do
overall for Twitter?
32
Infrastructure benefits
● Large-scale ad-hoc
analysis and backfills
● Cloud Storage avoids
HDFS limits
● Offsite Backup
● Increases availability of
cold data
Twitter Functional Benefits
Platform benefits
● Built-in compliance
support (e.g. SOX)
● Direct chargeback using
Project
● Simplified retention
● GCP services such as
BigQuery, Spanner,
Cloud ML, TPUs, etc
33
Finding: At Twitter Scale, Cloud has limits
● Cloud providers have limits for all sorts of things
and we often need them increased.
● Cloud HaaS do not generally support 10K node
hadoop clusters
● Dynamic scaling down < O(days) is not yet
feasible / cost-effective with current Hadoop at
Twitter scale
● Capacity planning with cloud providers is
encouraged for O(10K) vCPU deltas and required
for O(100K) vCPU deltas
34
What we are working on now
❏ Finalizing bucket & user creation and IAM designs
❏ Building replication, cluster deployment, and data
management software
❏ Hadoop Cloud Storage connector improvements
continue (open source)
❏ Retention and “directory” / dataset atomicity in GCS
35
✓ Foundational network
(8x100Gbps)
✓ Copy cluster
✓ Copying PBs of data to the
cloud
✓ Early Presto analytics use
case: up to 100K-core
Dataproc cluster querying
15PB dataset in Cloud
Storage
Recommendations
and Conclusion
36
3. Ensure migration plan
captures benefits
Lift-and-shift may not deliver
value in all cases.
Substantial iteration is required
to balance tactical migration
work with long-term strategy.
2. Compare application
benchmark costs
Compare the cost of running an
application using benchmark
results. Don’t just look at
pricing pages.
e.g. the network is hugely
important to performance.
1. Run the most informative
tests
Application-level
benchmarking (e.g. GridMix)
Scale testing
Recommendations
37
2. Cloud adoption
is complex
Finding separable workloads
can be a challenge.
Architectural choices are
non-obvious.
Methodical evaluation is
well-worth the effort.
1. Separate compute and
storage is a real thing
The better the network, the less
locality matters.
Life gets much easier when
Compute can be stateless.
You can treat PD like direct
attached HDDs.
Conclusions
3. Very early in this process
and lots more to come
We’re excited to be gaining
experience with the platform
and learning from everyone.
38
Thank You
Questions?
39

More Related Content

PPTX
Managing 100s of PetaBytes of data in Cloud
PPTX
Twitter's Data Replicator for Google Cloud Storage
PDF
Extending Twitter's Data Platform to Google Cloud
PPTX
Scaling HDFS for Exabyte Storage@twitter
PDF
Story of migrating event pipeline from batch to streaming
PPTX
Log Events @Twitter
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Managing 100s of PetaBytes of data in Cloud
Twitter's Data Replicator for Google Cloud Storage
Extending Twitter's Data Platform to Google Cloud
Scaling HDFS for Exabyte Storage@twitter
Story of migrating event pipeline from batch to streaming
Log Events @Twitter
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...

What's hot (20)

PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Imply at Apache Druid Meetup in London 1-15-20
PDF
Change Data Streaming Patterns for Microservices With Debezium
PPTX
Symantec: Cassandra Data Modelling techniques in action
PPTX
AquaQ Analytics Kx Event - Data Direct Networks Presentation
PDF
DIscover Spark and Spark streaming
PPTX
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
PDF
Google Cloud Dataflow
PDF
5 levels of high availability from multi instance to hybrid cloud
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PDF
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
PPTX
Serverless ETL and Optimization on ML pipeline
PPTX
Real time analytics
PDF
Stsg17 speaker yousunjeong
PPTX
Scaling HDFS at Xiaomi
PDF
Using ClickHouse for Experimentation
PPTX
002 Introduction to hadoop v3
PDF
Argus Production Monitoring at Salesforce
PDF
Discover some "Big Data" architectural concepts with Redis
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Imply at Apache Druid Meetup in London 1-15-20
Change Data Streaming Patterns for Microservices With Debezium
Symantec: Cassandra Data Modelling techniques in action
AquaQ Analytics Kx Event - Data Direct Networks Presentation
DIscover Spark and Spark streaming
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Google Cloud Dataflow
5 levels of high availability from multi instance to hybrid cloud
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
Serverless ETL and Optimization on ML pipeline
Real time analytics
Stsg17 speaker yousunjeong
Scaling HDFS at Xiaomi
Using ClickHouse for Experimentation
002 Introduction to hadoop v3
Argus Production Monitoring at Salesforce
Discover some "Big Data" architectural concepts with Redis
Ad

Similar to How @twitterhadoop chose google cloud (20)

PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PPTX
Big data talk barcelona - jsr - jc
PPTX
The rise of “Big Data” on cloud computing
PPTX
Big dataarchitecturesandecosystem+nosql
PPTX
OpenSource and the Cloud ApacheCon.pptx
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PPTX
My Other Computer is a Data Center (2010 v21)
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
PPTX
Hadoop-2022.pptx
PDF
Big data and cloud computing 9 sep-2017
PPT
云计算及其应用
PPT
Google Cloud Computing on Google Developer 2008 Day
PDF
Benefits of Hadoop as Platform as a Service
PDF
Getting started with GCP ( Google Cloud Platform)
PPTX
Research on vector spatial data storage scheme based
PPTX
Hadoop in the Cloud - The what, why and how from the experts
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Big data talk barcelona - jsr - jc
The rise of “Big Data” on cloud computing
Big dataarchitecturesandecosystem+nosql
OpenSource and the Cloud ApacheCon.pptx
Introduction to Cloud computing and Big Data-Hadoop
My Other Computer is a Data Center (2010 v21)
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop-2022.pptx
Big data and cloud computing 9 sep-2017
云计算及其应用
Google Cloud Computing on Google Developer 2008 Day
Benefits of Hadoop as Platform as a Service
Getting started with GCP ( Google Cloud Platform)
Research on vector spatial data storage scheme based
Hadoop in the Cloud - The what, why and how from the experts
Ad

More from lohitvijayarenu (7)

PPTX
The Adoption of Apache Beam at Twitter
PDF
Scaling event aggregation at twitter
PDF
Large Scale EventLog Management @Twitter
PDF
Routing trillion events per day @twitter
PPTX
Open Source india 2014
PPTX
Hadoop 2 @Twitter, Elephant Scale. Presented at
PPTX
HBase backups and performance on MapR
The Adoption of Apache Beam at Twitter
Scaling event aggregation at twitter
Large Scale EventLog Management @Twitter
Routing trillion events per day @twitter
Open Source india 2014
Hadoop 2 @Twitter, Elephant Scale. Presented at
HBase backups and performance on MapR

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Nekopoi APK 2025 free lastest update
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Digital Strategies for Manufacturing Companies
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
AI in Product Development-omnex systems
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Transform Your Business with a Software ERP System
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
Understanding Forklifts - TECH EHS Solution
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administraation Chapter 3
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo Companies in India – Driving Business Transformation.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Upgrade and Innovation Strategies for SAP ERP Customers
Nekopoi APK 2025 free lastest update
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Digital Strategies for Manufacturing Companies
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
AI in Product Development-omnex systems
Design an Analysis of Algorithms I-SECS-1021-03
Transform Your Business with a Software ERP System
How to Migrate SBCGlobal Email to Yahoo Easily

How @twitterhadoop chose google cloud

  • 1. How @twitterhadoop chose Google Cloud Joep Rottinghuis & Lohit VijayaRenu Twitter Hadoop Team (@twitterhadoop) 1
  • 2. 1. Twitter infrastructure 2. Hadoop evaluation 3. Evaluation outcomes 4. Recommendations and conclusions 5. Q&A Credit to presentation at GoogleNext 2019 by Derek Lyon & Dave Beckett (https://youtu.be/4FLFcWgZdo4) 2
  • 4. Twitter’s infrastructure ● Twitter founded in 2006 ● Global-scale application ● Unique scale and performance characteristics ● Real-time ● Built to purpose and well optimized ● Large data centers 4
  • 5. Strategic questions 1. What is the long-term mix of cloud versus datacenter? 2. Which cloud provider(s) should we use? 3. How can we be confident in this type of decision? 4. Why should we evaluate this now (2016)? 5
  • 6. Tactical questions 1. What is the feasibility and cost of large-scale adoption? 2. Which workloads are best-suited for the cloud and are they separable? 3. How would our architecture change on the cloud? 4. How do we get to an actionable plan? 6
  • 7. Evaluation process ● Started evaluation in 2016 ● Were able to make a patient, rigorous decision ● Defined baseline workload requirements ● Engaged major providers ● Analyzed clouds for each major workload ● Built overall cloud plan ● Iterated and optimized choices 7
  • 8. Evaluation Timeline Considering Moving ● PoC’s Completed & Results Delivered ● Legal Agreement with T&C’s ratified ● Kickoff dataproc, bigquery, dataflow experimentation ● Security and Platform Review ● v1 Hadoop on GCP Architecture Ratified ● Begin build for migration plan ● Consensus built with Product, Revenue, Eng ● Migration Kickoff ● Proposal to migrate Hadoop to GCP formally accepted June ‘16 ● Initial Cloud RfP release ● 27 Synthetic PoC’s on GCP begin ● Testing Projects / Network established Sept ‘16 Mar ‘17 July ‘17 Nov ‘17 Jan ‘18 Apr ‘18 June ‘18 8
  • 9. Built overall cloud plan ● Created a series of candidate architectures for each platform with their resource requirements ● Developed a migration project plan & timeline ● Created financial projections ● With some other business considerations 9
  • 10. Financial modeling ● 10-year time horizon to avoid timing artifacts ● Compared on premise and multiple cloud scenarios ● Costs of migration and long-term ● Long-term price/performance curves (e.g. Moore’s Law, historical pricing) ● Two independent models to avoid model errors 10
  • 11. ● An immediate all-in migration at Twitter scale is: expensive, distracting, and risky ● More value from new architectures and transformation, so start smaller and learn as we go ● Hadoop offered several important, specific benefits with lower risk ● We gained confidence in our investments in both cloud projects and data centers What we found 11
  • 12. >1.4T Messages Per Day >500K Compute Cores >300PB Logical Storage Hadoop@Twitter scale >12,500 Peak Cluster Size 12
  • 13. Type Use Compute % Real-time Critical performance production jobs with dedicated capacity 10% Processing Regularly scheduled production jobs with dedicated capacity 60% Ad-hoc One off / ad-hoc queries and analysis 30% Cold Dense storage clusters, not for compute minimal Twitter Hadoop cluster types 13
  • 14. Twitter Hadoop challenges 1. Scaling: Significant YoY Compute & Storage growth 2. Hardware: Designing, building, maintaining & operating 3. Capacity Planning: Hard to predict for adhoc especially 4. Agility: Must respond fast especially for adhoc compute 5. Deployment: Must deploy at scale and in-flight 6. Network: Both cross-DC and cross-cluster 7. Disaster Recovery: Durable copies needed in 2+ DCs 14
  • 15. Twitter Hadoop requirements ● Network sustained bandwidth per core ● Disk (data) sustained bandwidth per core ● Large sequential reads & writes ● Throughput not latency ● Capacity ● CPU / RAM not usually the bottleneck ● Consistency of datasets (set of HDFS files) 15
  • 16. Twitter Hadoop on premise hardware numbers Clusters: 10 to 10K nodes Network: 10G moving to 25G Data Disks: 24T-72T over 12 HDDs CPU: 8 cores with 64G memory I/O: Network: ~20MB/s sustained, peaks of 10x HDFS read: 20 rq/s sustained, peaks of 3x HDFS write: large variation, peaks of 10x 16
  • 17. 2. Twitter Hadoop on cloud VMs Durable storage: cloud object store Scratch storage: a. with HDFS over cloud object store b. with HDFS on cloud block store c. with HDFS on local disks 1. Hadoop-as-a-Service (HaaS) from the cloud provider Cloud architectural options 17
  • 18. 2. Functional Test Gridmix: IO + Compute ● Capture of real production cluster workload (1k-5k jobs) ● Replays reads, writes, shuffles, compute Testing plan 1. Baseline Tests ● TestDFSIO: low level IO read/write ● Teragen: measure maximum write rate ● Terasort: read, shuffle, write 18
  • 19. HDFS configurations tested Availability ● Critical data: 2 regions ● Other data: 2 zones Each type of Object, Block and Local Storage Dataset consistency Test cloud provider choices: 1. object store 2. object store with external consistency service 19
  • 21. GCP HaaS: DataProc config ● Hadoop 2.7.2 ● Performance tests with 800 vCPUs: ○ 100 x n1-standard-8 (8 VCPU, 30G memory) ○ 200 x n1-standard-4 (4 VCPU, 30G memory) ● Scale test with 8000 vCPUs: ○ 1000 x n1-standard-8 (8 vCPU, 30G memory) ● Modeled average CPU and average to peak CPU. ● No preemptible instances in initial work ● Similar to on premise hardware SKUs 21 Decided to use DataProc for evaluation.
  • 22. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 3 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD ~1x None PD-HDD 1.5TB PD-HDD ~1x DataProc 100 x n1-standard-8 Results Tuned Compute Engine instance types to get the optimum balance of network : cores : storage (this changes over time) 22
  • 23. Durable Storage Scratch Storage HDFS Speedup vs on premise (normalized by IO-per-core) Cloud Storage Local SSD 2 x 375G SSD ~2x (but expensive) Cloud Storage PD-HDD 1.5TB PD-HDD 1.4x DataProc 200 x n1-standard-4 Results 23
  • 24. Benchmark Findings 1. Application Benchmarks are critical Total job time is composed of multiple steps. We found variation both better and worse at each step. Recommendation: You should rely on an application benchmark like GridMix rather than micro-benchmarks. 2. Can treat network storage like local disk Both Cloud Storage and PD offered nearly as much bandwidth as typical direct attached HDDs on premise 24
  • 25. Functional Test Findings 1. Live Migration of VMs was not noticeable during Hadoop testing. It was during other Twitter platform testing of Compute Engine (cache at very high rps of small objects) 2. Cloud Storage checksum vs HDFS checksum. Fixed via HDFS-13056 in collaboration with Google 3. fsync() system call on Local SSD was slow (fixed) 25
  • 27. + Leads to the fastest migration + Limits duplication of costs during migration period - Introduces significant tech debt post-migration - Requires a major rearchitecture post-migration to capture benefits of cloud - Concerns around overall cost, risk, and distraction of this approach at Twitter scale Life-and-Shift everything Disqualified Lift-and-Shift *Everything* 27
  • 28. ● Separable with fewer dependencies ● Standard open source software: ○ Continue to develop in house and run on premise ○ Reduces lock-in risk ● Rearchitecting is achievable ○ Not a lift-and-shift ● Data in Cloud Storage: ○ Enables broader diversity of data processing frameworks and services ● Long-term bet on Google’s Big Data ecosystem Hadoop to Cloud was Interesting 28
  • 29. Separate Hadoop Compute and Storage ● Scaling the dimensions independently ● Makes it easy to run multiple clusters and processing frameworks over the same data ● Virtual network and project primitives provide segmentation of access and cost structures. ● State is preserved in Cloud Storage therefore deployments, upgrades, and testing are simpler ● Can treat storage as a commodity Enables 29
  • 30. 1. Cold Cluster ● Storage: Cloud Storage ● Compute: Limited ephemeral Dataproc an option ● Scaling: mostly storage driven 2. Ad-Hoc Clusters ● Storage: Cloud Storage ● Compute: Compute Engine and Twitter build of Hadoop (long running clusters) ● Scaling: mixture, with spiky compute Twitter Hadoop Rearchitected for Cloud 30
  • 31. Twitter production Hadoop remains on premise ● Not as separable from other production workloads ● Focusing on non-production workloads limits our risk ● Regular compute-intensive usage patterns ● Benefits more from purpose built hardware ● Fewer processing frameworks are needed 31
  • 32. Twitter Strategic Benefits ● Next-generation architecture with numerous enhancements: ○ security, encryption, isolation, live migration ● Leverage Google’s capacity and R&D ● Larger ecosystem of open source & cloud software ● Long-term strategic collaboration with Google ● Beachhead that enable teams across Twitter to make tactical cloud adoption decisions What does this do overall for Twitter? 32
  • 33. Infrastructure benefits ● Large-scale ad-hoc analysis and backfills ● Cloud Storage avoids HDFS limits ● Offsite Backup ● Increases availability of cold data Twitter Functional Benefits Platform benefits ● Built-in compliance support (e.g. SOX) ● Direct chargeback using Project ● Simplified retention ● GCP services such as BigQuery, Spanner, Cloud ML, TPUs, etc 33
  • 34. Finding: At Twitter Scale, Cloud has limits ● Cloud providers have limits for all sorts of things and we often need them increased. ● Cloud HaaS do not generally support 10K node hadoop clusters ● Dynamic scaling down < O(days) is not yet feasible / cost-effective with current Hadoop at Twitter scale ● Capacity planning with cloud providers is encouraged for O(10K) vCPU deltas and required for O(100K) vCPU deltas 34
  • 35. What we are working on now ❏ Finalizing bucket & user creation and IAM designs ❏ Building replication, cluster deployment, and data management software ❏ Hadoop Cloud Storage connector improvements continue (open source) ❏ Retention and “directory” / dataset atomicity in GCS 35 ✓ Foundational network (8x100Gbps) ✓ Copy cluster ✓ Copying PBs of data to the cloud ✓ Early Presto analytics use case: up to 100K-core Dataproc cluster querying 15PB dataset in Cloud Storage
  • 37. 3. Ensure migration plan captures benefits Lift-and-shift may not deliver value in all cases. Substantial iteration is required to balance tactical migration work with long-term strategy. 2. Compare application benchmark costs Compare the cost of running an application using benchmark results. Don’t just look at pricing pages. e.g. the network is hugely important to performance. 1. Run the most informative tests Application-level benchmarking (e.g. GridMix) Scale testing Recommendations 37
  • 38. 2. Cloud adoption is complex Finding separable workloads can be a challenge. Architectural choices are non-obvious. Methodical evaluation is well-worth the effort. 1. Separate compute and storage is a real thing The better the network, the less locality matters. Life gets much easier when Compute can be stateless. You can treat PD like direct attached HDDs. Conclusions 3. Very early in this process and lots more to come We’re excited to be gaining experience with the platform and learning from everyone. 38