SlideShare a Scribd company logo
Scaling Managed MySQL Platform in
Flipkart
The story of how flipkart.com manages its massive MySQL fleets
Sachin Japate
LEAD SRE
India's
largest
e-commerce
player
400
Million
Registered
Users
10 Million
Daily Page
Visits
8 Million
Shipments
per month
100,000
Sellers
22
state-of-the-art
warehouses
3
On-Prem
Data Centers
Sachin Japate
Lead SRE/MySQL SME @ Flipkart
9+ Years in Flipkart
Managed MySQL and D-SQL Platform Teams
India's largest
e-commerce
player
Flipkart Group
07
08
05
06
03
04
01
02
Tech Landscape
The Big Problem
Enter Altair
9 Challenges
Stats
Architecture
Future
Demo (time permitting)
05 min
20 min
10 min
10 min
09 Questions
Agenda
Tech Landscape
At the heart of all e-commerce businesses is an incredibly
complex transactional network of multiple microservices such
as Order Management, Supply Chain, Logistics, and
Seller-Management that have strong consistency
requirements.
A wide variety of tech stacks power different Microservices,
which facilitate the seamless functioning of the e-commerce
systems.
● MySQL is the most common data store used by over 70% of our systems.
● Other datastores are Redis / ElasticSearch / HBase / MongoDB / ZooKeeper / TiDB / Cassandra, etc.
● The Hot Store Transactional footprint is over 2 Petabyte.
Overview of Databases @ Flipkart
Microservices
3 state of the art on-prem Data Centers in India
Two in Chennai & one in Hyderabad (Renewable Energy)
Customized Hardware
Customized hardware for mission critical computing, storage, artificial
intelligence & machine learning capabilities, backed by an ultra-low
latency network.
VM and their choices
● Compute / Memory / Storage optimized instance types
● Various generations of Hardware (cores, disk, memory)
● Storage Flavours - Local HDDs/SSDs/JBODs/Network-Attached Storages
● Custom cuts for very specific use-cases
Robust Design
All Data Centers built for security, scale, elasticity and multi-zone
resilience with custom-designed racks, intelligent power and cooling.
Hybrid Cloud
Hybrid setup with Google Cloud Platform for bursting into public cloud. Why ?
Flipkart Cloud Platform
Developer productivity was seen to take a major hit.
Every team using MySQL needed to invest heavily on:
● Developer bandwidth
● Best practice adoption
● DB Tuning
● Time spent on OPs (solutioning / setup /
maintenance, backup, migration)
● Overdependence on MySQL specialists
● Tribal Knowledge risks
Enforcing Security & Auditing policies on a decentralised model
meant heavy program management and far longer time to get to
the desired state.
Developer Productivity
Policy Enforcement Challenges
As a result, teams were finding it increasingly difficult to focus
on the core business products, as a lot of time was instead
being spent on the management of these underlying
technology stacks.
Core vs Context
The
BIG
Problem
Enter Altair
★ DBaaS for MySQL built on top of Flipkart Cloud Platform (in-house)
★ Offered a seamless MySQL provisioning / maintenance / cluster
management experience
★ Abstracted infrastructure provisioning with complete platform service
integration
★ Systematically solved Flipkart's MySQL challenges
Let's see how this was achieved and what challenges came along !
Flipkart's in-house DBaaS
Challenge #1
The Time Challenge
“How do we reduce the overall time to create a MySQL cluster?”
Engineers had to first get hardware funded, then create a VM using
CLI, figure out all the permissions, install MySQL & dependent
libraries, find out process to import data relying mainly on
documentation which could be out of date. Typically this process
took almost a day.
Here’s what we did:
Removed the need for infrastructure provisioning, installing and
maintaining MySQL software. Everything was under the hood now.
Built a self-serve user interface and pre-provisioned all accounts so
there were no manual operations.
Altair facilitated project conception to deployment with a target of < 2 minute provisioning to use production grade
MySQL on:3306. Behind the scenes, all integrations with Cloud Services happened in a jiffy.
Challenge #2
The High Adoption Challenge
“How to ensure adoption is high ?”
Most of Flipkart was on MySQL 5.6 and 5.7. They feared the
move (losing control of their MySQL databases to a different
team) and they came up with various reasons not to onboard.
Here’s what we did:
Handheld some of the largest teams and moved them to Altair.
Seamless cluster migration flow.
Drove an internal program encouraging teams to move their Stage/Dev/NFR clusters to Altair.
Eventually teams started moving their production clusters to Altair and haven't moved out since!
The High Security Challenge
“How to ensure tight security controls?”
Teams were using non secure versions of MySQL, installing
scripts on the DB box, sharing root credentials openly and not
paying a lot of attention to security controls.
Here’s what we did:
We completely blocked All SSH access for everyone, including the
owners of the MySQL clusters. Only the central team had access.
Differentiated between human and machine access - service accounts for apps, while humans had an approval-based
system for controlled time-bound access to MySQL.
No more spurious scripts and non-descript crons running on MySQL boxes. Only certain limited privileges were now
available for MySQL users. The internal databases were accessible only by root.
Challenge #3
We completely blocked giving out SUPER/Admin privileges to
MySQL user.
The Disaster Recovery and Business Continuity Process Challenge
“How to ensure disaster recovery and business continuity planning ?”
BCP/DR was a decentralised model in Flipkart, meaning more
program management. Not all teams paid close attention to
BCP/DR. In addition, the tooling had to be set up manually via CLI.
Here’s what we did:
Integrated with internal tooling that allowed teams to define their first
class RPOs and RTOs for their databases.
Tool ensured backups were taken at a predefined time regularly.It also
supported both INCR and FULL backups.
Built a self-serve way to restore the latest backup on either region in addition to supporting multi-region MySQL clusters
Schrödinger Backups were eliminated - "The state of a backup is unknown unless a restore is performed on it"
Started regularly tracking the backups that kept failing for various issues and fixing them under the hood systematically.
Challenge #4
Backups started getting pushed to both near-site and far-site to recover from DC wide failures from a dedicated
backup node instead of an HS or RR node.
The High Availability Challenge
High Availability was one of the most important challenges to
solve in Flipkart. MySQL could go down at late nights, and
failover was manual with config changes in apps (restart)
Built a ZK-based highly available monitoring system that detected
failures in seconds.
Developed the Auto-promote feature using well-tested recovery
workflows that immediately kick-started the recovery process
after thorough & deep checks for false positives.
Integrated with internal DNS and Floating-IP to ensure the newly promoted Source continued to be accessible on
the same DNS.
This meant no more stopping apps, changing IP addresses, and restarting. It was just a blip in the traffic and the
regular connection retry handled DB failure just fine.
“How to ensure High Availability?”
Challenge #5
Here’s what we did:
The DB Tuning Challenge
DB Tuning was not a very well understood problem because it
needed specialised knowledge to tune memory configurations of
MySQL (SME / DBA); which wouldn’t scale.
Built an in-house variable validation system working on various
combinations of about 50 variables and a recommendation system
that recommended values for the tunable, considering the hardware
and the MySQL version.
Set up an auto-restart for variables which needed MySQL restart, differentiated tuning for Source and RR.
Posted clear error messages for users who wanted to increase all parameters for the best performance.
Teams were far more confident of their tuning - it was also saved in Altair so they could just forget about losing them.
Challenge #6
Here’s what we did:
“How to ensure databases are well tuned ?”
Developed a team HA DBAs for tuning very specific and corner cases.
The Observability Challenge
There were no standard deep dashboards across teams for MySQL
observability, which were typically powered by metrics that needed
ROOT access - something which we didn't intend to provide.
Standardised dashboard across the Organization and integrated with
OpenTSDB based internal metric monitoring system.
Pre-built deep Grafana dashboards with overall cluster health, member health, MySQL specific,
InnoDB specific, System & Network dashboards at a MySQL cluster level; PMM was the
benchmark here - we have started work on supporting PMM.
Pre-created cluster level Alerts with recommended thresholds and frequencies, integrated event-based alerting
that tied to the team's on-call calendar directly, Separated customer alerts and Altair Admin alerts.
We built auditing & event-logging on the cluster. Users could download slow-query/error.log etc., directly from the UI.
Challenge #7
Here’s what we did:
“How to ensure good observability despite lack of ROOT access ?”
The Hardware Abstraction Challenge
Hardware failures are more common in any large fleet. Earlier, we
tracked the hardware maintenance schedule on emails which
was cumbersome to remember, regulate, and reschedule.
Integrated with the hardware maintenance schedule API (low level APIs)
Scheduled Maintenance helped move the VMs away from the affected mother ships well before the actual hardware
maintenance activity. Ensured a good FD (Failure Domain) distribution at a cluster level
Built deep health-checks to track various hardware problems and replace VMs for unplanned maintenances
Teams largely benefited from this feature and gained back significant time on their hands. Adoption also increased.
Created an internal Scheduled Maintenance mapped to the underlying
hardware Scheduled Maintenance which the client could reschedule to
low-traffic hours.
Challenge #8
Here’s what we did:
“How to abstract hardware problems away from the user ?”
The Feature Compatibility without ROOT Challenge
Teams were using common features that needed ROOT/elevated
access. Altair had to bridge that gap for successful onboarding
without increasing our on-call load.
Automatic Binlog trimming and Binlog streaming for binlogs and GTID.
Custom topology support by scaling out read replicas and Adding / Removing HS/Backup nodes.
We could support upgrading and downgrading MySQL clusters before sale events, Migration & Cutover, along
with User and DB creation from UI. So far, nobody has complained about losing the ROOT privileges !
Automatic handling of disk divergence between Primary and Replicas,
Auto durability settings (for reducing replica lags).
Challenge #9
Here’s what we did:
“How to achieve feature compatibility without ROOT access ?”
Built interfaces to CRUD databases and users.
Pre-created stored procedures that allowed viewing debugging information without ROOT access.
Stats
700+ Clusters
Across 128 teams
3500+ Failure Recoveries
Includes Planned and
Unplanned failures of all
Nodes
8000+
Dashboards
2500+ VMs
Across CH and
HYD continuously
500+ Live Migrations
Existing clusters to
Altair and A2A
400+ Auto Failovers
Includes planned and
unplanned failures of
Source Node
1.5 Petabyte
footprint
8
Member
Team
Architecture
What Next ?
● K8s statefulset support
● GCP support
● Compute Storage Segregation
Building for the future
We have started work on
open-sourcing Altair as an operator
on Kubernetes starting with MySQL
Open Sourcing Track
● MySQL v8.0 support
● Semi Sync Replication
● Bidirectional Replication
MySQL upgrades
Product Demo
Self Serve Portal
http://altair.fkcloud.it
Questions ?
Thank You !
sachin.japate@flipkart.com
https://www.linkedin.com/in/sachin-j
apate-20403a2a/

More Related Content

PDF
MySQL GTID Concepts, Implementation and troubleshooting
PPTX
Running MariaDB in multiple data centers
PDF
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
PDF
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
PDF
Demystifying MySQL Replication Crash Safety
PDF
Performance Stability, Tips and Tricks and Underscores
PDF
Top-10-Features-In-MySQL-8.0 - Vinoth Kanna RS - Mydbops Team
ODP
OpenGurukul : Database : PostgreSQL
MySQL GTID Concepts, Implementation and troubleshooting
Running MariaDB in multiple data centers
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
Demystifying MySQL Replication Crash Safety
Performance Stability, Tips and Tricks and Underscores
Top-10-Features-In-MySQL-8.0 - Vinoth Kanna RS - Mydbops Team
OpenGurukul : Database : PostgreSQL

What's hot (20)

PDF
MySQL 5.7 InnoDB Cluster (Jan 2018)
PDF
Best Practice for Achieving High Availability in MariaDB
PDF
Dd and atomic ddl pl17 dublin
PDF
PostgreSQL and RAM usage
PDF
Connection Pooling in PostgreSQL using pgbouncer
PDF
MariaDB Performance Tuning and Optimization
PDF
MySQL Parallel Replication by Booking.com
PDF
Linux tuning to improve PostgreSQL performance
PDF
MySQL Data Encryption at Rest
PDF
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
PDF
MariaDB Server Performance Tuning & Optimization
PDF
Percona Xtrabackup - Highly Efficient Backups
PDF
Deploying PostgreSQL on Kubernetes
PDF
Postgresql database administration volume 1
PDF
MySQL Performance Schema in 20 Minutes
PDF
Patroni - HA PostgreSQL made easy
PPTX
SQL Tuning 101
PPTX
M|18 Battle of the Online Schema Change Methods
PDF
Open Source 101 2022 - MySQL Indexes and Histograms
PDF
Tanel Poder - Performance stories from Exadata Migrations
MySQL 5.7 InnoDB Cluster (Jan 2018)
Best Practice for Achieving High Availability in MariaDB
Dd and atomic ddl pl17 dublin
PostgreSQL and RAM usage
Connection Pooling in PostgreSQL using pgbouncer
MariaDB Performance Tuning and Optimization
MySQL Parallel Replication by Booking.com
Linux tuning to improve PostgreSQL performance
MySQL Data Encryption at Rest
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MariaDB Server Performance Tuning & Optimization
Percona Xtrabackup - Highly Efficient Backups
Deploying PostgreSQL on Kubernetes
Postgresql database administration volume 1
MySQL Performance Schema in 20 Minutes
Patroni - HA PostgreSQL made easy
SQL Tuning 101
M|18 Battle of the Online Schema Change Methods
Open Source 101 2022 - MySQL Indexes and Histograms
Tanel Poder - Performance stories from Exadata Migrations
Ad

Similar to Scaling managed MySQL Platform in Flipkart - (Sachin Japate - Flipkart) - Mydbops 13th Opensource Database Meetup (20)

PDF
MySQL At Mastercard - 2018 MySQL Days
PDF
Successful MySQL Scalability
PDF
Scalable web architecture
PPT
Designing Scalable Data Warehouse Using MySQL
PDF
Guide to NoSQL with MySQL
PDF
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
PPTX
Continuous Availability and Scale-out for MySQL with ScaleBase Lite & Enterpr...
PDF
ScaleBase Webinar: Strategies for scaling MySQL
PPTX
Overcoming Barriers of Scaling Your Database
PDF
Successful Scalability Principles - Part 1
PDF
Scaling, Tuning and Maintaining the Monolith
PPTX
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
PDF
Architecture and Design MySQL powered applications by Peter Zaitsev Meetup Sa...
PPT
MySQL Features & Implementation
PDF
MySQL DW Breakfast
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
MySQL Intro JSON NoSQL
PDF
My sql in_enterprise
PPTX
Hofstra University - Overview of Big Data
PDF
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
MySQL At Mastercard - 2018 MySQL Days
Successful MySQL Scalability
Scalable web architecture
Designing Scalable Data Warehouse Using MySQL
Guide to NoSQL with MySQL
Cloud-Native Patterns and the Benefits of MySQL as a Platform Managed Service
Continuous Availability and Scale-out for MySQL with ScaleBase Lite & Enterpr...
ScaleBase Webinar: Strategies for scaling MySQL
Overcoming Barriers of Scaling Your Database
Successful Scalability Principles - Part 1
Scaling, Tuning and Maintaining the Monolith
Архитектура приложений с использованием MySQL, Петр Зайцев (Percona)
Architecture and Design MySQL powered applications by Peter Zaitsev Meetup Sa...
MySQL Features & Implementation
MySQL DW Breakfast
"Database isolation: how we deal with hundreds of direct connections to the d...
MySQL Intro JSON NoSQL
My sql in_enterprise
Hofstra University - Overview of Big Data
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
Ad

More from Mydbops (20)

PDF
Scaling TiDB for Large-Scale Application
PDF
AWS MySQL Showdown - RDS vs RDS Multi AZ vs Aurora vs Serverless - Mydbops...
PDF
Mastering Vector Search with MongoDB Atlas - Manosh Malai - Mydbops MyWebinar 39
PDF
Migration Journey To TiDB - Kabilesh PR - Mydbops MyWebinar 38
PDF
AWS Blue Green Deployment for Databases - Mydbops
PDF
What's New In MySQL 8.4 LTS Mydbops MyWebinar Edition 36
PDF
What's New in PostgreSQL 17? - Mydbops MyWebinar Edition 35
PDF
What's New in MongoDB 8.0 - Mydbops MyWebinar Edition 34
PDF
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
PDF
Read/Write Splitting using MySQL Router - Mydbops Meetup16
PDF
TiDB - From Data to Discovery: Exploring the Intersection of Distributed Dat...
PDF
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
PDF
Demystifying Real time Analytics with TiDB
PDF
Must Know Postgres Extension for DBA and Developer during Migration
PDF
Efficient MySQL Indexing and what's new in MySQL Explain
PDF
Scale your database traffic with Read & Write split using MySQL Router
PDF
PostgreSQL Schema Changes with pg-osc - Mydbops @ PGConf India 2024
PDF
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
PDF
Mastering Aurora PostgreSQL Clusters for Disaster Recovery
PDF
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...
Scaling TiDB for Large-Scale Application
AWS MySQL Showdown - RDS vs RDS Multi AZ vs Aurora vs Serverless - Mydbops...
Mastering Vector Search with MongoDB Atlas - Manosh Malai - Mydbops MyWebinar 39
Migration Journey To TiDB - Kabilesh PR - Mydbops MyWebinar 38
AWS Blue Green Deployment for Databases - Mydbops
What's New In MySQL 8.4 LTS Mydbops MyWebinar Edition 36
What's New in PostgreSQL 17? - Mydbops MyWebinar Edition 35
What's New in MongoDB 8.0 - Mydbops MyWebinar Edition 34
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Read/Write Splitting using MySQL Router - Mydbops Meetup16
TiDB - From Data to Discovery: Exploring the Intersection of Distributed Dat...
MySQL InnoDB Storage Engine: Deep Dive - Mydbops
Demystifying Real time Analytics with TiDB
Must Know Postgres Extension for DBA and Developer during Migration
Efficient MySQL Indexing and what's new in MySQL Explain
Scale your database traffic with Read & Write split using MySQL Router
PostgreSQL Schema Changes with pg-osc - Mydbops @ PGConf India 2024
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Mastering Aurora PostgreSQL Clusters for Disaster Recovery
Navigating Transactions: ACID Complexity in Modern Databases- Mydbops Open So...

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Sustainable Sites - Green Building Construction
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPT
Project quality management in manufacturing
PDF
PPT on Performance Review to get promotions
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Current and future trends in Computer Vision.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
UNIT 4 Total Quality Management .pptx
Safety Seminar civil to be ensured for safe working.
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Sustainable Sites - Green Building Construction
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Project quality management in manufacturing
PPT on Performance Review to get promotions
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
Foundation to blockchain - A guide to Blockchain Tech
Current and future trends in Computer Vision.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Operating System & Kernel Study Guide-1 - converted.pdf

Scaling managed MySQL Platform in Flipkart - (Sachin Japate - Flipkart) - Mydbops 13th Opensource Database Meetup

  • 1. Scaling Managed MySQL Platform in Flipkart The story of how flipkart.com manages its massive MySQL fleets Sachin Japate LEAD SRE India's largest e-commerce player
  • 2. 400 Million Registered Users 10 Million Daily Page Visits 8 Million Shipments per month 100,000 Sellers 22 state-of-the-art warehouses 3 On-Prem Data Centers Sachin Japate Lead SRE/MySQL SME @ Flipkart 9+ Years in Flipkart Managed MySQL and D-SQL Platform Teams India's largest e-commerce player Flipkart Group
  • 3. 07 08 05 06 03 04 01 02 Tech Landscape The Big Problem Enter Altair 9 Challenges Stats Architecture Future Demo (time permitting) 05 min 20 min 10 min 10 min 09 Questions Agenda
  • 4. Tech Landscape At the heart of all e-commerce businesses is an incredibly complex transactional network of multiple microservices such as Order Management, Supply Chain, Logistics, and Seller-Management that have strong consistency requirements. A wide variety of tech stacks power different Microservices, which facilitate the seamless functioning of the e-commerce systems. ● MySQL is the most common data store used by over 70% of our systems. ● Other datastores are Redis / ElasticSearch / HBase / MongoDB / ZooKeeper / TiDB / Cassandra, etc. ● The Hot Store Transactional footprint is over 2 Petabyte. Overview of Databases @ Flipkart Microservices
  • 5. 3 state of the art on-prem Data Centers in India Two in Chennai & one in Hyderabad (Renewable Energy) Customized Hardware Customized hardware for mission critical computing, storage, artificial intelligence & machine learning capabilities, backed by an ultra-low latency network. VM and their choices ● Compute / Memory / Storage optimized instance types ● Various generations of Hardware (cores, disk, memory) ● Storage Flavours - Local HDDs/SSDs/JBODs/Network-Attached Storages ● Custom cuts for very specific use-cases Robust Design All Data Centers built for security, scale, elasticity and multi-zone resilience with custom-designed racks, intelligent power and cooling. Hybrid Cloud Hybrid setup with Google Cloud Platform for bursting into public cloud. Why ? Flipkart Cloud Platform
  • 6. Developer productivity was seen to take a major hit. Every team using MySQL needed to invest heavily on: ● Developer bandwidth ● Best practice adoption ● DB Tuning ● Time spent on OPs (solutioning / setup / maintenance, backup, migration) ● Overdependence on MySQL specialists ● Tribal Knowledge risks Enforcing Security & Auditing policies on a decentralised model meant heavy program management and far longer time to get to the desired state. Developer Productivity Policy Enforcement Challenges As a result, teams were finding it increasingly difficult to focus on the core business products, as a lot of time was instead being spent on the management of these underlying technology stacks. Core vs Context The BIG Problem
  • 7. Enter Altair ★ DBaaS for MySQL built on top of Flipkart Cloud Platform (in-house) ★ Offered a seamless MySQL provisioning / maintenance / cluster management experience ★ Abstracted infrastructure provisioning with complete platform service integration ★ Systematically solved Flipkart's MySQL challenges Let's see how this was achieved and what challenges came along ! Flipkart's in-house DBaaS
  • 8. Challenge #1 The Time Challenge “How do we reduce the overall time to create a MySQL cluster?” Engineers had to first get hardware funded, then create a VM using CLI, figure out all the permissions, install MySQL & dependent libraries, find out process to import data relying mainly on documentation which could be out of date. Typically this process took almost a day. Here’s what we did: Removed the need for infrastructure provisioning, installing and maintaining MySQL software. Everything was under the hood now. Built a self-serve user interface and pre-provisioned all accounts so there were no manual operations. Altair facilitated project conception to deployment with a target of < 2 minute provisioning to use production grade MySQL on:3306. Behind the scenes, all integrations with Cloud Services happened in a jiffy.
  • 9. Challenge #2 The High Adoption Challenge “How to ensure adoption is high ?” Most of Flipkart was on MySQL 5.6 and 5.7. They feared the move (losing control of their MySQL databases to a different team) and they came up with various reasons not to onboard. Here’s what we did: Handheld some of the largest teams and moved them to Altair. Seamless cluster migration flow. Drove an internal program encouraging teams to move their Stage/Dev/NFR clusters to Altair. Eventually teams started moving their production clusters to Altair and haven't moved out since!
  • 10. The High Security Challenge “How to ensure tight security controls?” Teams were using non secure versions of MySQL, installing scripts on the DB box, sharing root credentials openly and not paying a lot of attention to security controls. Here’s what we did: We completely blocked All SSH access for everyone, including the owners of the MySQL clusters. Only the central team had access. Differentiated between human and machine access - service accounts for apps, while humans had an approval-based system for controlled time-bound access to MySQL. No more spurious scripts and non-descript crons running on MySQL boxes. Only certain limited privileges were now available for MySQL users. The internal databases were accessible only by root. Challenge #3 We completely blocked giving out SUPER/Admin privileges to MySQL user.
  • 11. The Disaster Recovery and Business Continuity Process Challenge “How to ensure disaster recovery and business continuity planning ?” BCP/DR was a decentralised model in Flipkart, meaning more program management. Not all teams paid close attention to BCP/DR. In addition, the tooling had to be set up manually via CLI. Here’s what we did: Integrated with internal tooling that allowed teams to define their first class RPOs and RTOs for their databases. Tool ensured backups were taken at a predefined time regularly.It also supported both INCR and FULL backups. Built a self-serve way to restore the latest backup on either region in addition to supporting multi-region MySQL clusters Schrödinger Backups were eliminated - "The state of a backup is unknown unless a restore is performed on it" Started regularly tracking the backups that kept failing for various issues and fixing them under the hood systematically. Challenge #4 Backups started getting pushed to both near-site and far-site to recover from DC wide failures from a dedicated backup node instead of an HS or RR node.
  • 12. The High Availability Challenge High Availability was one of the most important challenges to solve in Flipkart. MySQL could go down at late nights, and failover was manual with config changes in apps (restart) Built a ZK-based highly available monitoring system that detected failures in seconds. Developed the Auto-promote feature using well-tested recovery workflows that immediately kick-started the recovery process after thorough & deep checks for false positives. Integrated with internal DNS and Floating-IP to ensure the newly promoted Source continued to be accessible on the same DNS. This meant no more stopping apps, changing IP addresses, and restarting. It was just a blip in the traffic and the regular connection retry handled DB failure just fine. “How to ensure High Availability?” Challenge #5 Here’s what we did:
  • 13. The DB Tuning Challenge DB Tuning was not a very well understood problem because it needed specialised knowledge to tune memory configurations of MySQL (SME / DBA); which wouldn’t scale. Built an in-house variable validation system working on various combinations of about 50 variables and a recommendation system that recommended values for the tunable, considering the hardware and the MySQL version. Set up an auto-restart for variables which needed MySQL restart, differentiated tuning for Source and RR. Posted clear error messages for users who wanted to increase all parameters for the best performance. Teams were far more confident of their tuning - it was also saved in Altair so they could just forget about losing them. Challenge #6 Here’s what we did: “How to ensure databases are well tuned ?” Developed a team HA DBAs for tuning very specific and corner cases.
  • 14. The Observability Challenge There were no standard deep dashboards across teams for MySQL observability, which were typically powered by metrics that needed ROOT access - something which we didn't intend to provide. Standardised dashboard across the Organization and integrated with OpenTSDB based internal metric monitoring system. Pre-built deep Grafana dashboards with overall cluster health, member health, MySQL specific, InnoDB specific, System & Network dashboards at a MySQL cluster level; PMM was the benchmark here - we have started work on supporting PMM. Pre-created cluster level Alerts with recommended thresholds and frequencies, integrated event-based alerting that tied to the team's on-call calendar directly, Separated customer alerts and Altair Admin alerts. We built auditing & event-logging on the cluster. Users could download slow-query/error.log etc., directly from the UI. Challenge #7 Here’s what we did: “How to ensure good observability despite lack of ROOT access ?”
  • 15. The Hardware Abstraction Challenge Hardware failures are more common in any large fleet. Earlier, we tracked the hardware maintenance schedule on emails which was cumbersome to remember, regulate, and reschedule. Integrated with the hardware maintenance schedule API (low level APIs) Scheduled Maintenance helped move the VMs away from the affected mother ships well before the actual hardware maintenance activity. Ensured a good FD (Failure Domain) distribution at a cluster level Built deep health-checks to track various hardware problems and replace VMs for unplanned maintenances Teams largely benefited from this feature and gained back significant time on their hands. Adoption also increased. Created an internal Scheduled Maintenance mapped to the underlying hardware Scheduled Maintenance which the client could reschedule to low-traffic hours. Challenge #8 Here’s what we did: “How to abstract hardware problems away from the user ?”
  • 16. The Feature Compatibility without ROOT Challenge Teams were using common features that needed ROOT/elevated access. Altair had to bridge that gap for successful onboarding without increasing our on-call load. Automatic Binlog trimming and Binlog streaming for binlogs and GTID. Custom topology support by scaling out read replicas and Adding / Removing HS/Backup nodes. We could support upgrading and downgrading MySQL clusters before sale events, Migration & Cutover, along with User and DB creation from UI. So far, nobody has complained about losing the ROOT privileges ! Automatic handling of disk divergence between Primary and Replicas, Auto durability settings (for reducing replica lags). Challenge #9 Here’s what we did: “How to achieve feature compatibility without ROOT access ?” Built interfaces to CRUD databases and users. Pre-created stored procedures that allowed viewing debugging information without ROOT access.
  • 17. Stats 700+ Clusters Across 128 teams 3500+ Failure Recoveries Includes Planned and Unplanned failures of all Nodes 8000+ Dashboards 2500+ VMs Across CH and HYD continuously 500+ Live Migrations Existing clusters to Altair and A2A 400+ Auto Failovers Includes planned and unplanned failures of Source Node 1.5 Petabyte footprint 8 Member Team
  • 19. What Next ? ● K8s statefulset support ● GCP support ● Compute Storage Segregation Building for the future We have started work on open-sourcing Altair as an operator on Kubernetes starting with MySQL Open Sourcing Track ● MySQL v8.0 support ● Semi Sync Replication ● Bidirectional Replication MySQL upgrades
  • 20. Product Demo Self Serve Portal http://altair.fkcloud.it
  • 21. Questions ? Thank You ! sachin.japate@flipkart.com https://www.linkedin.com/in/sachin-j apate-20403a2a/