SlideShare a Scribd company logo
DA300:
How Twitter Replicates
Petabytes of Data to
Google Cloud Storage
Lohit VijayaRenu, Twitter
@lohitvijayarenu
Twitter's Data Replicator for Google Cloud Storage
Agenda
Describe Twitter’s Data
Replicator Architecture,
present our solution to extend it
to Google Cloud Storage
and maintain consistent
interface for users.
Tweet questions
#GoogleNext19Twitter
Twitter DataCenter
Data Infrastructure for Analytics
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
Streaming systems
Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Namespace
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
ClusterZ
Namespace 2 Namespace 1
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1
DataCenter-1 DataCenter-2
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2
Replicator
DataCenter 2DataCenter 1
Need for Replication
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Thousands of
datasets configured
for replication
● Across tens of
different clusters
● Data kept in sync
hourly/daily/snapshot
● Fault tolerant
Data Replicator
● Replicator per destination
● 1 : 1 Copy from src to
destination
● N : 1 Copy + Merge from
multiple src to destination
● Publish to DAL upon
completion
Copy
Source
Cluster
Destination
Cluster
Replicator
Copy + Merge
Source
Cluster
Destination
Cluster
Replicator
Source
Cluster
Dataset : partly-cloudy
Src Cluster : ClusterX
Src path : /logs/partly-cloudy
Dest Cluster : ClusterY
Dest path : /logs/partly-cloudy
Copy Since : 3 days
Owner : hadoop-team
Replication setup
Data Access
Layer
Replicator
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge
Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Cloud Storage
Extending Replication to GCS
DataCenter 1
Hadoop
Cluster
BigQuery
GCE VMs
● Same dataset available
on GCS for users
● Unlock Presto on GCP,
Hadoop on GCP,
BigQuery and other
tools
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/04/
10/03
Distcp
Replicator : GCS
Proxy
group
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick of
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
Uniform Access
for Users
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
Query dataset
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:///logs/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
Query partitions of dataset
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11
HadoopLzop
2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12
HadoopLzop
2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13
HadoopLzop
2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14
HadoopLzop
2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15
HadoopLzop
2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16
HadoopLzop
All partitions for dataset on GCS
Monitoring
● Rich set of
monitoring for
Replicator and
replicator configs
● Uniform monitoring
dashboard for
onprem and cloud
replicators
Read/Write bytes per destination
Latency per destination
9. Alerting
● Fine tuned alert configs per metric per
replicator
● Pages on call for critical issues
● Uniform alert dashboard and config for
onprem and cloud replicators
GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/0
4/10/03
/gcs/dataY/2019/0
4/10/03
/gcs/dataZ/2019/04
/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml
Where are we today
● Tens of instances of GCS
Replicators
● Copied tens of petabytes of data
● Hundreds of thousands of copy
jobs
● Unlocked multiple use cases on
GCP
Made here
together
Twitter + Google
Google Storage Hadoop connector
● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage
○ Composite checksum HDFS-13056
○ More details in blog post*
● Proxy configuration as path
● Per user credentials
● Lazy initialization to support View FileSystem
* https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
Performance and Consistency
● Performance optimization uncovered during evaluation Presto on GCP
● Cooperative locking in Google Connector for atomic renames
○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking
● Same version of connector (onprem and open source)
Summary
Describe Twitter’s Data Replicator Architecture,
present our solution to extend it to Google Cloud Storage
and maintain consistent interface for users.
Acknowledgement
Ran Wang @RanWang18
Zhenzhao Wang @zhen____w
Joseph Boyd @sluicing
Joep Rottinghuis @joep
Hadoop Team @TwitterHadoop
https://cloud.google.com/twitter
Tweet to @TwitterEng
https://careers.twitter.com
Questions
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar
Thank you

More Related Content

PPTX
Managing 100s of PetaBytes of data in Cloud
PDF
How @twitterhadoop chose google cloud
PPTX
Scaling HDFS for Exabyte Storage@twitter
PDF
Extending Twitter's Data Platform to Google Cloud
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Foundations of streaming SQL: stream & table theory
PDF
Imply at Apache Druid Meetup in London 1-15-20
PDF
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
Managing 100s of PetaBytes of data in Cloud
How @twitterhadoop chose google cloud
Scaling HDFS for Exabyte Storage@twitter
Extending Twitter's Data Platform to Google Cloud
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Foundations of streaming SQL: stream & table theory
Imply at Apache Druid Meetup in London 1-15-20
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018

What's hot (20)

PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Omid: A Transactional Framework for HBase
PPTX
002 Introduction to hadoop v3
PPTX
AquaQ Analytics Kx Event - Data Direct Networks Presentation
PPTX
PDF
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
PPT
Hadoop introduction 2
PPTX
Scaling HDFS at Xiaomi
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PPTX
The National Oceanographic Data Center’s NetCDF Templates
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
Amazon Elastic Map Reduce - Ian Meyers
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
PPTX
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
PPTX
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
PDF
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
PDF
Argus Production Monitoring at Salesforce
PDF
Building Hadoop Data Applications with Kite
PDF
Hybrid data lake on google cloud with alluxio and dataproc
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Omid: A Transactional Framework for HBase
002 Introduction to hadoop v3
AquaQ Analytics Kx Event - Data Direct Networks Presentation
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Hadoop introduction 2
Scaling HDFS at Xiaomi
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The National Oceanographic Data Center’s NetCDF Templates
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Amazon Elastic Map Reduce - Ian Meyers
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Argus Production Monitoring at Salesforce
Building Hadoop Data Applications with Kite
Hybrid data lake on google cloud with alluxio and dataproc
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Ad

Similar to Twitter's Data Replicator for Google Cloud Storage (20)

PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Extending twitter's data platform to google cloud
PDF
Cloud Composer workshop at Airflow Summit 2023.pdf
PPTX
HDFS tiered storage
PDF
Hadoop Summit 2010 Data Management On Grid
ODP
Data analytics with hadoop hive on multiple data centers
PPTX
Research on vector spatial data storage scheme based
PDF
Large Scale EventLog Management @Twitter
PPTX
Cloud storage
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Cloud computing UNIT 2.1 presentation in
PPTX
Gobbin config-meetup-june-2016
PDF
Google Storage concepts and computing concepts.pdf
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PDF
Infrastructure Around Hadoop
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
PDF
slide share on aws data pipe line
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Extending Twitter's Data Platform to Google Cloud
Extending twitter's data platform to google cloud
Cloud Composer workshop at Airflow Summit 2023.pdf
HDFS tiered storage
Hadoop Summit 2010 Data Management On Grid
Data analytics with hadoop hive on multiple data centers
Research on vector spatial data storage scheme based
Large Scale EventLog Management @Twitter
Cloud storage
HDFS Tiered Storage: Mounting Object Stores in HDFS
Cloud computing UNIT 2.1 presentation in
Gobbin config-meetup-june-2016
Google Storage concepts and computing concepts.pdf
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Infrastructure Around Hadoop
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
slide share on aws data pipe line
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Ad

More from lohitvijayarenu (9)

PPTX
OpenSource and the Cloud ApacheCon.pptx
PPTX
The Adoption of Apache Beam at Twitter
PPTX
Log Events @Twitter
PDF
Story of migrating event pipeline from batch to streaming
PDF
Scaling event aggregation at twitter
PDF
Routing trillion events per day @twitter
PPTX
Open Source india 2014
PPTX
Hadoop 2 @Twitter, Elephant Scale. Presented at
PPTX
HBase backups and performance on MapR
OpenSource and the Cloud ApacheCon.pptx
The Adoption of Apache Beam at Twitter
Log Events @Twitter
Story of migrating event pipeline from batch to streaming
Scaling event aggregation at twitter
Routing trillion events per day @twitter
Open Source india 2014
Hadoop 2 @Twitter, Elephant Scale. Presented at
HBase backups and performance on MapR

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PDF
composite construction of structures.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Welding lecture in detail for understanding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PPT on Performance Review to get promotions
PPT
Project quality management in manufacturing
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Embodied AI: Ushering in the Next Era of Intelligent Systems
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Digital Logic Computer Design lecture notes
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Geodesy 1.pptx...............................................
Well-logging-methods_new................
composite construction of structures.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Welding lecture in detail for understanding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Model Code of Practice - Construction Work - 21102022 .pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT on Performance Review to get promotions
Project quality management in manufacturing
UNIT-1 - COAL BASED THERMAL POWER PLANTS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Twitter's Data Replicator for Google Cloud Storage

  • 1. DA300: How Twitter Replicates Petabytes of Data to Google Cloud Storage Lohit VijayaRenu, Twitter @lohitvijayarenu
  • 3. Agenda Describe Twitter’s Data Replicator Architecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users. Tweet questions #GoogleNext19Twitter
  • 4. Twitter DataCenter Data Infrastructure for Analytics Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Data Generate > 1.5 Trillion events every day Incoming Storage Produce > 4PB per day Production jobs Process hundreds of PB per day Ad hoc queries Executes tens of thousands of jobs per day Cold/Backup Hundreds of PBs of data Streaming systems
  • 5. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  • 6. Data Access Layer ● Dataset has logical name and one or more physical locations ● Users/Tools such as scalding, presto, HIVE query DAL for available hourly partitions ● Dataset has hourly/daily partitions in DAL ● Also stores various properties such as owner, schema, location with datasets * https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
  • 7. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Namespace
  • 8. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html ClusterZ Namespace 2 Namespace 1
  • 9. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1 DataCenter-1 DataCenter-2
  • 10. FileSystem abstraction Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Replicator
  • 11. DataCenter 2DataCenter 1 Need for Replication Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Thousands of datasets configured for replication ● Across tens of different clusters ● Data kept in sync hourly/daily/snapshot ● Fault tolerant
  • 12. Data Replicator ● Replicator per destination ● 1 : 1 Copy from src to destination ● N : 1 Copy + Merge from multiple src to destination ● Publish to DAL upon completion Copy Source Cluster Destination Cluster Replicator Copy + Merge Source Cluster Destination Cluster Replicator Source Cluster
  • 13. Dataset : partly-cloudy Src Cluster : ClusterX Src path : /logs/partly-cloudy Dest Cluster : ClusterY Dest path : /logs/partly-cloudy Copy Since : 3 days Owner : hadoop-team Replication setup Data Access Layer Replicator Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 15. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  • 16. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 17. Extending Replication to GCS DataCenter 1 Hadoop Cluster BigQuery GCE VMs ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 18. Bucket on GCS : gs://logs.partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage
  • 19. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage Connector Cloud Storage
  • 20. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 21. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 22. Twitter DataCenter Network setup for copy Twitter & Google private peering (PNI) Copy Cluster GCS /gcs/logs/partly- cloudy/2019/04/ 10/03 Distcp Replicator : GCS Proxy group
  • 23. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 24. Merging and updating DAL ● Multiple Replicators copy same dataset partition to destination ● Each of Replicator checks for availability of data independently ● Creates individual _SUCCESS_<SRC> files ● Updates DAL when all _SUCCESS_<SRC> are found ● Updates are idempotent Compare src and dest Kick of distcp job Check success file (ALL) Update DAL Success Let other instance update DAL Need to copy Copied already Success Failure No Yes Done Each Replicator updates partition independently
  • 26. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 27. Query dataset $dal logical-dataset list --role hadoop --name logs.partly-cloudy | 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active | $dal physical-dataset list --role hadoop --name logs.partly-cloudy | 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly- cloudy/yyyy/mm/dd/hh | | 41065 | http://dalpds/41065 | gcs | gcs:///logs/partly-cloudy/yyyy/mm/dd/hh | List all physical locations Find dataset by logical name
  • 28. Query partitions of dataset $dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs 2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11 HadoopLzop 2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12 HadoopLzop 2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13 HadoopLzop 2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14 HadoopLzop 2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15 HadoopLzop 2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16 HadoopLzop All partitions for dataset on GCS
  • 29. Monitoring ● Rich set of monitoring for Replicator and replicator configs ● Uniform monitoring dashboard for onprem and cloud replicators Read/Write bytes per destination Latency per destination
  • 30. 9. Alerting ● Fine tuned alert configs per metric per replicator ● Pages on call for critical issues ● Uniform alert dashboard and config for onprem and cloud replicators
  • 31. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/0 4/10/03 /gcs/dataY/2019/0 4/10/03 /gcs/dataZ/2019/04 /10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  • 32. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly-cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly-cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  • 33. Where are we today ● Tens of instances of GCS Replicators ● Copied tens of petabytes of data ● Hundreds of thousands of copy jobs ● Unlocked multiple use cases on GCP
  • 35. Google Storage Hadoop connector ● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage ○ Composite checksum HDFS-13056 ○ More details in blog post* ● Proxy configuration as path ● Per user credentials ● Lazy initialization to support View FileSystem * https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
  • 36. Performance and Consistency ● Performance optimization uncovered during evaluation Presto on GCP ● Cooperative locking in Google Connector for atomic renames ○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking ● Same version of connector (onprem and open source)
  • 37. Summary Describe Twitter’s Data Replicator Architecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users.
  • 38. Acknowledgement Ran Wang @RanWang18 Zhenzhao Wang @zhen____w Joseph Boyd @sluicing Joep Rottinghuis @joep Hadoop Team @TwitterHadoop https://cloud.google.com/twitter
  • 40. Your Feedback is Greatly Appreciated! Complete the session survey in mobile app 1-5 star rating system Open field for comments Rate icon in status bar

Editor's Notes

  • #2: Twitter’s Data Replicator for GCS at GoogleNext 2019. Lohit VijayaRenu, Twitter
  • #6: Data is identified by a dataset name HDFS is the primary storage for Analytics Users configure replication rules for different clusters Dataset also has retention rules defined per cluster Dataset are always represented on fixed interval partitions (hourly/daily) Dataset is defined in system called Data Access Layer (DAL)* Data is made available at different destination using Replicator
  • #8: All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #9: All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #10: All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #11: All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #13: Run one Replicator per destination cluster Always pull model Fault tolerant 1:1 or N:1 setup Upon copy, publish to DAL
  • #14: Users setup replication entry one per Dataset with properties Source and Destination clusters Copy since X days. (Optionally copy until Y days) Owner, team, contact email Copy job configuration Different ways to specify configuration. yml , DAL, configdb. Configure contact email for alerts. Fault tolerant copy to keep data in sync.
  • #15: Long running daemon (on mesos) Daemon checks configuration and schedules copy on hourly partition Copy jobs are executed as Hadoop distcp jobs Jobs are on destination cluster After hourly copy, publish partition to DAL
  • #16: Some datasets are collected across multiple DataCenters Replicator would kick off multiple DistCP jobs to copy at tmp location Replicator then merges dataset into single directory and does atomic rename to final destination Renames on HDFS are cheap and atomic, which makes this operation easy
  • #22: Use same Replicator code to sync data to GCS Utilize ViewFileSystem abstraction to hide GCS /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03 Use Google Hadoop Connector to interact with GCS using Hadoop APIs Distcp jobs runs on dedicated Copy cluster Create ViewFileSystem mount point on Copy cluster to fake GCS destination Distcp tasks stream data from source HDFS to GCS (no local copy)
  • #23: Replicator Daemon uses proxy While actual data flow directly to GCP from Twitter PNI setup between Twitter and Google
  • #24: Data for same dataset is aggregated at multiple DataCenters (DC x and DC y) Replicators in each DC schedules individual DistCp jobs Data from multiple DC ends up under same path on GCS
  • #27: UI support via EagleEye to view all replication configurations Properties associated with configuration. Src, dest, owner, email, etc… CLI support to manage replication configurations Load new or modify existing configuration List all configurations Mark active/inactive configurations API support for clients and replicators Rich set of api access for all above operations
  • #28: Command line tools dal command line to look up datasets, destination and available partitions API access to DAL Scalding/Presto query DAL to check partitions for time range Jobs also link to scheduler which can kick off jobs based on new partitions UI access EagleEye is UI to view details about Datasets and also available partitions Can also who delay per hourly partitions Uniform access on prem or cloud Interface to dataset properties are same on prem or cloud
  • #32: GCP Projects are based on organization Deploy separate Replicator with its own credentials per project Shared copy cluster per DataCenter Enables independent updates and reduces risk of errors
  • #33: Logs vs user path resolution Projects and buckets have standard naming convention Logs at : gs://logs.<category name>.twttr.net/ User data at gs://user.<user name>.twtter.net/ Access to these buckets are via standard path Logs at /gcs/logs/<category name>/ User data at /gcs/user/<user name>/ Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard No configuration or update needed
  • #36: Google Cloud Storage connector to access gcs Existing applications using Hadoop FileSystem APIs continue to work Existing tools continue to work against GCS For most part users do not know there is separate tool / API to access GCS Users use commands such as hadoop fs -ls /gcs/user/larry/my_dataset/2019/01/04 hadoop fs -du -s -h /gcs/user/larry/my_dataset hadoop fs -put /gcs/user/larry/my_dataset/2019/01/04/file.txt ./file.txt Hadoop Cloud Storage connector is installed along with hadoop client on jump hosts and hadoop nodes Applications can also package connector jar
  • #42: Google supports data at petabyte scale, securely with our best in class analytics and machine learning capabilities to inform real-time decisions and coordinate response on the roads.