SlideShare a Scribd company logo
DC Migration and Hadoop
Scale For Big Billion Days
Presented By:
Rahul Agarwal, Operation Engineer, Flipkart
Dhiraj Kumar, Software Development Engineer, Flipkart
Chetna Chaudhari, Software Development Engineer, Flipkart
● DC Landscape Before Migration
● Data at Flipkart - FDP Stats
● Infrastructure Challenges
● Hadoop Clusters Iterations
● Data Migration - Challenges
● Data Migration - Utilities Considered and Developed
● Data Migration - Execution
Agenda
1 Gbps
Shared
10 Gbps
Shared
Primary: All User,
Order & FDP
systems
Secondary: Few batch
processing systems (User
Insights, Ads and
Recommendations)
New: All User
and
FDP systems
DC Landscape
Flipkart Data Platform in the old DC
● ~340 nodes
● ~1.7 PB storage
● ~30TB RAM
● ~11000 cores
Flipkart Data Platform in the new DC
● ~1000 nodes
● ~30 PB storage
● ~75TB RAM
● ~32000 cores
Data at Flipkart - FDP Stats
● New data ingested daily
o ~6 TB on a business-as-usual day
o ~30 TB on sale days
● Number of raw data streams ~1000s
● Number of raw events in a day ~ 3 Billion
● Volume of data processed daily ~ 0.6 PB
● Number of Hadoop jobs run each day ~ 10K
InfraStructure Challenges
C1: Validation of the Hardware in entirely new
infrastructure
S1: Ripper Utility based on MR framework doing native
writes/reads on all disks on all datanodes.
C2: Ephemeral IP Addresses and Lack of DNS
S2: Managed Config Service:
• Store Key/Value Pairs - HostName/IP pairs
• ConfD implementation on all clients
InfraStructure Challenges - Contd.
C3: Ephemeral Disks and Nodes
S3.1: Quorum of 5 for ZK/JN
S3.2: Isolated Deployments for Each Component
C4: Lesser Memory on NameNode - 180mn FS Objects
S4.1: Delete Zero Size Files.
S4.2: Switching To G1 Helped Reduce Pause Times.
Hadoop Cluster Iterations - RedPill
• Utility Around IaaS and Ambari:
– Acquire Instance Types as requested
– Setup MySQL, Ambari Server and Agents
– Determine cluster configurations based on Instance Types
– Co-Host master components for Dev/Test deployments and Isolated
components for Prod setup.
– Generate Blueprint and Cluster Templates
– Deploy Cluster with Blueprint and batch of 50 nodes at first
– Horizontally scale cluster by adding further batches to prevent Repo
service related failures
– TAT of ~20 mins for 100 node cluster
Data Migration
Data Migration Challenges
• Data publishers/consumers not moving together
– Data consumers could move earlier than the
publishers or vice-versa.
• Migrating PBs of data not feasible over network
• Consistency for raw, prepared and reporting data
• Moving Disks from one center to another
– Data centers in different states - Legal Challenges
– Live data , 24/7 in use for analytics
• Replicate the data (Copy, Mirror and Regenerate)
– Files being created and deleted continuously
– Build the supporting services for scale
Solutions Considered | Data Migration
Data Migration Utilities- DistCp
• Small Files Performance
• Takes long to build index
• Hard to Figure out Corruption/Copy Aborts
• Content Based Data Validation is weak (CRC)
Data Migration Utilities - Transporter
• Configurable Batch Sizes
• Compression at Source
• MD5 sum of the content
• HAR to bundle small files
• DistCP HAR in Binary Mode
• UnHAR at Destination
• MR Validation for MD5 sum of the content
• Regenerate Production Hierarchy
• File Counts Verification
Data Migration Utilities - BlueShift
• OSS : https://github.com/flipkart-incubator/blueshift
• Features:
– On the fly compression
– Bulk migration with batches of over 10 Million files
– State management options, either HDFS or DB
– Optimized task scheduling
– Capable of using different protocols for source and destination.
– MD5 based checksum to ensure no corruptions
– Time based file filtering
– filesize based filtering
– Option to ignore exceptions and continue processing
Data Replication
• Only copied raw data about O(100TB) compressed
• Too many small files
• some files were very large
– All prepared and reports data generated from raw data
• Propagated delta changes using an Apache Kafka mirror
• Verification utilities to check correctness in data in both clusters
• Ran the full data platform stack in both places for over 2 weeks till
all data publishers and consumers move
2 way sync of Kafka Streams
Mirror
Old DC New DC
Kafka A Kafka C
Kafka B
HBase Migration | Solutions
• Copy Table
– issues:
• Full table scan - time consuming
• Secure to UnSecure not supported
• HBase Import/ Exports
– issues:
• Full table scan
• Slower than copy table
• Needs manual interventions
• Extra space
• Decompression while export
• Use Blue-Shift and HBase Bulk loader
Blue-Shift + HBase Bulk Load
HDFS
FTP
Disks
Transferred
Over Road
- Trucks
HBase
HDFS
Blue-Shift + HBase Bulk Load
• Moved snapshots of derived/computed data over wire (relatively
small)
• Used physical disks to move data (stored in HBase)
• Avoided HBase export. Instead transferred HFiles into disks using
blueshift
– knapsack'ed ~50K files into dozens of physical hard disks
• Disks shipped to new DC
• Transferred HFiles into HDFS using Blueshift
Learnings
• MD5 checksum - big win
• Should have used workflow
• Automated process is must
• Having isolations per tenant is must
Achievement :)
• Migration without downtime
Thank You !!
Questions ??
@cchaudhari11
@rahul67
@dhiraj2kumar

More Related Content

PPTX
HBaseConEast2016: Splice machine open source rdbms
PPTX
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
PDF
E commerce data migration in moving systems across data centres
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PDF
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
PDF
HBaseConAsia2018 Keynote1: Apache HBase Project Status
PDF
What's new in SQL on Hadoop and Beyond
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
HBaseConEast2016: Splice machine open source rdbms
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
E commerce data migration in moving systems across data centres
Building tiered data stores using aesop to bridge sql and no sql systems
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Keynote1: Apache HBase Project Status
What's new in SQL on Hadoop and Beyond
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

What's hot (20)

PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PPTX
HBaseConAsia2018 Track3-2: HBase at China Telecom
PDF
Apache HBase Workshop
PPTX
Hadoop and HBase @eBay
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PDF
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
PPTX
ClustrixDB: how distributed databases scale out
PDF
Change Data Capture with Data Collector @OVH
PPTX
Splice Machine Overview
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
PDF
#GeodeSummit - Redis to Geode Adaptor
PPTX
Hadoop @ eBay: Past, Present, and Future
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PDF
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
PDF
Cassandra in e-commerce
PDF
Real-time Data Streaming from Oracle to Apache Kafka
PDF
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
Lambda-less Stream Processing @Scale in LinkedIn
HBaseConAsia2018 Track3-2: HBase at China Telecom
Apache HBase Workshop
Hadoop and HBase @eBay
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
ClustrixDB: how distributed databases scale out
Change Data Capture with Data Collector @OVH
Splice Machine Overview
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
#GeodeSummit - Redis to Geode Adaptor
Hadoop @ eBay: Past, Present, and Future
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Cassandra in e-commerce
Real-time Data Streaming from Oracle to Apache Kafka
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Ad

Similar to DC Migration and Hadoop Scale For Big Billion Days (20)

PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PPTX
Hadoop Migration from 0.20.2 to 2.0
PPTX
Colorado Springs Open Source Hadoop/MySQL
PPTX
Data at Scale - Michael Peacock, Cloud Connect 2012
PPTX
A Scalable Data Transformation Framework using Hadoop Ecosystem
PDF
Technologies for Data Analytics Platform
PDF
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
PPT
HDFS_architecture.ppt
PPT
A Scalable Data Transformation Framework using the Hadoop Ecosystem
PPTX
Hadoop ppt1
PDF
Meta scale kognitio hadoop webinar
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Distributed Kafka Architecture Taboola Scale
PDF
(ATS6-PLAT06) Maximizing AEP Performance
PPTX
Hadoop introduction
PDF
Data Care, Feeding, and Maintenance
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Kinesis @ lyft
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Hadoop Migration from 0.20.2 to 2.0
Colorado Springs Open Source Hadoop/MySQL
Data at Scale - Michael Peacock, Cloud Connect 2012
A Scalable Data Transformation Framework using Hadoop Ecosystem
Technologies for Data Analytics Platform
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
HDFS_architecture.ppt
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Hadoop ppt1
Meta scale kognitio hadoop webinar
Hadoop and Big data in Big data and cloud.pptx
Distributed Kafka Architecture Taboola Scale
(ATS6-PLAT06) Maximizing AEP Performance
Hadoop introduction
Data Care, Feeding, and Maintenance
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Kinesis @ lyft
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto

DC Migration and Hadoop Scale For Big Billion Days

  • 1. DC Migration and Hadoop Scale For Big Billion Days Presented By: Rahul Agarwal, Operation Engineer, Flipkart Dhiraj Kumar, Software Development Engineer, Flipkart Chetna Chaudhari, Software Development Engineer, Flipkart
  • 2. ● DC Landscape Before Migration ● Data at Flipkart - FDP Stats ● Infrastructure Challenges ● Hadoop Clusters Iterations ● Data Migration - Challenges ● Data Migration - Utilities Considered and Developed ● Data Migration - Execution Agenda
  • 3. 1 Gbps Shared 10 Gbps Shared Primary: All User, Order & FDP systems Secondary: Few batch processing systems (User Insights, Ads and Recommendations) New: All User and FDP systems DC Landscape
  • 4. Flipkart Data Platform in the old DC ● ~340 nodes ● ~1.7 PB storage ● ~30TB RAM ● ~11000 cores
  • 5. Flipkart Data Platform in the new DC ● ~1000 nodes ● ~30 PB storage ● ~75TB RAM ● ~32000 cores
  • 6. Data at Flipkart - FDP Stats ● New data ingested daily o ~6 TB on a business-as-usual day o ~30 TB on sale days ● Number of raw data streams ~1000s ● Number of raw events in a day ~ 3 Billion ● Volume of data processed daily ~ 0.6 PB ● Number of Hadoop jobs run each day ~ 10K
  • 7. InfraStructure Challenges C1: Validation of the Hardware in entirely new infrastructure S1: Ripper Utility based on MR framework doing native writes/reads on all disks on all datanodes. C2: Ephemeral IP Addresses and Lack of DNS S2: Managed Config Service: • Store Key/Value Pairs - HostName/IP pairs • ConfD implementation on all clients
  • 8. InfraStructure Challenges - Contd. C3: Ephemeral Disks and Nodes S3.1: Quorum of 5 for ZK/JN S3.2: Isolated Deployments for Each Component C4: Lesser Memory on NameNode - 180mn FS Objects S4.1: Delete Zero Size Files. S4.2: Switching To G1 Helped Reduce Pause Times.
  • 9. Hadoop Cluster Iterations - RedPill • Utility Around IaaS and Ambari: – Acquire Instance Types as requested – Setup MySQL, Ambari Server and Agents – Determine cluster configurations based on Instance Types – Co-Host master components for Dev/Test deployments and Isolated components for Prod setup. – Generate Blueprint and Cluster Templates – Deploy Cluster with Blueprint and batch of 50 nodes at first – Horizontally scale cluster by adding further batches to prevent Repo service related failures – TAT of ~20 mins for 100 node cluster
  • 11. Data Migration Challenges • Data publishers/consumers not moving together – Data consumers could move earlier than the publishers or vice-versa. • Migrating PBs of data not feasible over network • Consistency for raw, prepared and reporting data
  • 12. • Moving Disks from one center to another – Data centers in different states - Legal Challenges – Live data , 24/7 in use for analytics • Replicate the data (Copy, Mirror and Regenerate) – Files being created and deleted continuously – Build the supporting services for scale Solutions Considered | Data Migration
  • 13. Data Migration Utilities- DistCp • Small Files Performance • Takes long to build index • Hard to Figure out Corruption/Copy Aborts • Content Based Data Validation is weak (CRC)
  • 14. Data Migration Utilities - Transporter • Configurable Batch Sizes • Compression at Source • MD5 sum of the content • HAR to bundle small files • DistCP HAR in Binary Mode • UnHAR at Destination • MR Validation for MD5 sum of the content • Regenerate Production Hierarchy • File Counts Verification
  • 15. Data Migration Utilities - BlueShift • OSS : https://github.com/flipkart-incubator/blueshift • Features: – On the fly compression – Bulk migration with batches of over 10 Million files – State management options, either HDFS or DB – Optimized task scheduling – Capable of using different protocols for source and destination. – MD5 based checksum to ensure no corruptions – Time based file filtering – filesize based filtering – Option to ignore exceptions and continue processing
  • 16. Data Replication • Only copied raw data about O(100TB) compressed • Too many small files • some files were very large – All prepared and reports data generated from raw data • Propagated delta changes using an Apache Kafka mirror • Verification utilities to check correctness in data in both clusters • Ran the full data platform stack in both places for over 2 weeks till all data publishers and consumers move
  • 17. 2 way sync of Kafka Streams Mirror Old DC New DC Kafka A Kafka C Kafka B
  • 18. HBase Migration | Solutions • Copy Table – issues: • Full table scan - time consuming • Secure to UnSecure not supported • HBase Import/ Exports – issues: • Full table scan • Slower than copy table • Needs manual interventions • Extra space • Decompression while export • Use Blue-Shift and HBase Bulk loader
  • 19. Blue-Shift + HBase Bulk Load HDFS FTP Disks Transferred Over Road - Trucks HBase HDFS
  • 20. Blue-Shift + HBase Bulk Load • Moved snapshots of derived/computed data over wire (relatively small) • Used physical disks to move data (stored in HBase) • Avoided HBase export. Instead transferred HFiles into disks using blueshift – knapsack'ed ~50K files into dozens of physical hard disks • Disks shipped to new DC • Transferred HFiles into HDFS using Blueshift
  • 21. Learnings • MD5 checksum - big win • Should have used workflow • Automated process is must • Having isolations per tenant is must Achievement :) • Migration without downtime
  • 22. Thank You !! Questions ?? @cchaudhari11 @rahul67 @dhiraj2kumar