SlideShare a Scribd company logo
Lessons from database failures
Colin Charles, Chief Evangelist, Percona Inc.

colin.charles@percona.com / byte@bytebot.net

http://www.bytebot.net/blog/ | @bytebot on Twitter

Percona Live Europe Amsterdam, Netherlands

5 October 2016
whoami
• Chief Evangelist (in the CTO office), Percona Inc

• Focusing on the MySQL ecosystem (MySQL, Percona Server, MariaDB
Server), as well as the MongoDB ecosystem (Percona Server for
MongoDB) + 100% open source tools from Percona like Percona
Monitoring & Management, Percona xtrabackup, Percona Toolkit, etc.

• Founding team of MariaDB Server (2009-2016), previously at Monty
Program Ab, merged with SkySQL Ab, now MariaDB Corporation

• Formerly MySQL AB (exit: Sun Microsystems)

• Past lives include Fedora Project (FESCO), OpenOffice.org

• MySQL Community Contributor of the Year Award winner 2014
Agenda
• Backups (and verification)

• Replication (and failover)

• Security (and encryption)
ma.gnolia.com
ma.gnolia.com’s failure
• January 30 2009: complete outage

• February 17 2009: data corruption in the UDB, essentially dead

• What happened?

• Ruby on Rails on four self-hosted Mac Mini’s, a couple of
XServe’s, 500GB+ MySQL 5 DB

• Filesystem corruption, corrupted database backup

• No versioning, didn’t check if the backups worked, made use of
rsync to backup the database over Firewire network
ma.gnolia.com today?
• EC2 for the app with EBS snapshots, RDS with snapshots, Multi-AZ
deployment

• Self-hosted?

• xtrabackup
• START TRANSACTION WITH CONSISTENT SNAPSHOT +
mysqldump —single-transaction —master-data
• Backup a replica

• Replication event checksums
Couchsurfing, 2006
Couchsurfing problems
1. major, avoidable hard drive crash

2. incremental backups weren’t executed in the correct manner, and
twelve of our most important data files didn’t survive
Time-delayed replication
• MySQL 5.6+ has time-delayed replication. Stop replication when you
know a mistake has happened before it propagates to all the slaves.

• Feature suggestion since 2001! Bug reported August 2006
(mysql#21639). Pushed June 2010 (WL#344). GA February 2013.
Why replicate?
• Scale out

• [automatic] (master) failover

• Geographical redundancy across multiple data centres

• Online schema changes
Replication
• Asynchronous (default)

• (Enhanced loss-less) Semi-synchronous (plugin)

• Synchronous (Galera, group replication, NDBCLUSTER)

• DRBD
Frameworks
• MySQL-MMM

• Severalnines ClusterControl

• Orchestrator

• MySQL MHA

• Tungsten Replicator

• 5.6+ utilities:
mysqlfailover,
mysqlrpladmin

• Percona Replication Manager
(https://github.com/percona/
percona-pacemaker-agents/)

• Replication Manager
(github.com/tanji/replication-
manager)
GitHub
GitHub
GitHub
GitHub
https://github.com/blog/1261-github-availability-this-week
Fully automated failover a good idea?
• False alarms

• Repeated failover

• Overloaded master? MHA doesn’t allow a failover within 8h,
unless —last_failover_min=n is set

• Data loss

• id=103 latest, relay logs at id=101 => loss

• group commit in the binary log

• Split brain
Proxies
• MariaDB MaxScale

• MaxScale as binlog server @ Booking - to replace intermediate
masters (downloads binlog from master, saves to disk, serves to
slave as if served from master)

• Popular use: load balancing Galera clusters

• MySQL Router + MySQL Fabric

• ProxySQL

• Used alongside Galera clusters too
Lessons from database failures
Sharding
• SPIDER

• Tungsten Replicator

• Tumblr JetPants
Vitess
• Servers & tools to scale MySQL for web written in Go

• Has MariaDB support too (*)

• Python client interface

• DML annotation, connection pooling, shard management, workflow
management, zero downtime restarts

• Become super easy to use: http://vitess.io/ (with the help of
Kubernetes)
Failwhales
• Twitter started on MySQL, and is still MySQL - you just need to
“evolve”

• Gizzard (sharding), Mesos + Apache Cotton

• Digg started on MySQL, migrated to Cassandra, and came back to
MySQL
Security
• Philippines voter data leave 55m at risk: 338GB MySQL dump

• Ashley Madison: 6.9GB compressed dump, 36m email addresses
leaked, 9.6m credit card transactions

• Patreon: 13.7GB MySQL dump, 99 tables
Mossack Fonseca: Panama Papers
Prevent SQL injections
• MariaDB MaxScale database firewall filter

• Configurable filter actions on rule match (Allow the query, block
the query or ignore the match), Logging of matching and/or non-
matching queries

• MySQL Enterprise firewall
Encryption at rest
• MariaDB Server 10.1: table or tablespace encryption

• design goal: Encrypt all user data that may touch the disk — InnoDB
data, InnoDB logs, binary logs, temporary tables, temporary files

• key management on the filesystem? [no key rotation] Amazon KMS? 

• caveats: mysqlbinlog needs work with encrypted binlogs; Galera
Cluster gcache isn’t encrypted

• MySQL 5.7: only encrypts InnoDB tablespaces (innodb_file_per_table;
logs unencrypted)
In conclusion…
• Use semi-sync replication with a failover solution that ensures you
don’t failover too often

• Make good backups. Test them. Save them.

• You’ll most definitely need to shard your data, use proven
frameworks and get a proxy involved. Complete backups with multi-
source replication when needed.

• Use mysqldump and xtrabackup together (and mydumper for
parallel backup/restore; mysqlpump)

• Security is key: prevent SQL injections, encrypt your data at rest
It’s 2016, you don’t want this…
Percona Monitoring and Management (PMM)
• http://pmmdemo.percona.com/
Thank you. Q&A?
colin.charles@percona.com / byte@bytebot.net
@bytebot on Twitter | http://www.bytebot.net/blog/
slides: slideshare.net/bytebot
Lessons from database failures

More Related Content

PDF
Forking Successfully - or is a branch better?
PDF
Lessons from database failures
PDF
The Complete MariaDB Server tutorial
PDF
The MySQL Server ecosystem in 2016
PDF
Lessons from database failures
PDF
MariaDB Server Compatibility with MySQL
PDF
Securing your MySQL / MariaDB Server data
PDF
MariaDB 10.1 what's new and what's coming in 10.2 - Tokyo MariaDB Meetup
Forking Successfully - or is a branch better?
Lessons from database failures
The Complete MariaDB Server tutorial
The MySQL Server ecosystem in 2016
Lessons from database failures
MariaDB Server Compatibility with MySQL
Securing your MySQL / MariaDB Server data
MariaDB 10.1 what's new and what's coming in 10.2 - Tokyo MariaDB Meetup

What's hot (20)

PDF
Distributions from the view a package
PDF
MariaDB Server & MySQL Security Essentials 2016
PDF
Capacity planning for your data stores
PDF
Best practices for MySQL/MariaDB Server/Percona Server High Availability
PDF
My first moments with MongoDB
PDF
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
PDF
Lessons from {distributed,remote,virtual} communities and companies
PDF
Tuning Linux for your database FLOSSUK 2016
PDF
Meet MariaDB 10.1 at the Bulgaria Web Summit
PDF
Meet MariaDB Server 10.1 London MySQL meetup December 2015
PDF
MariaDB - the "new" MySQL is 5 years old and everywhere (LinuxCon Europe 2015)
PDF
MariaDB 10: The Complete Tutorial
PDF
The MySQL Server ecosystem in 2016
PDF
Databases in the hosted cloud
PDF
Differences between MariaDB 10.3 & MySQL 8.0
PDF
The MySQL Server Ecosystem in 2016
PDF
MariaDB 10 and what's new with the project
PDF
Why MariaDB?
PDF
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
PDF
Best practices for MySQL High Availability Tutorial
Distributions from the view a package
MariaDB Server & MySQL Security Essentials 2016
Capacity planning for your data stores
Best practices for MySQL/MariaDB Server/Percona Server High Availability
My first moments with MongoDB
The Proxy Wars - MySQL Router, ProxySQL, MariaDB MaxScale
Lessons from {distributed,remote,virtual} communities and companies
Tuning Linux for your database FLOSSUK 2016
Meet MariaDB 10.1 at the Bulgaria Web Summit
Meet MariaDB Server 10.1 London MySQL meetup December 2015
MariaDB - the "new" MySQL is 5 years old and everywhere (LinuxCon Europe 2015)
MariaDB 10: The Complete Tutorial
The MySQL Server ecosystem in 2016
Databases in the hosted cloud
Differences between MariaDB 10.3 & MySQL 8.0
The MySQL Server Ecosystem in 2016
MariaDB 10 and what's new with the project
Why MariaDB?
MariaDB 10 Tutorial - 13.11.11 - Percona Live London
Best practices for MySQL High Availability Tutorial
Ad

Similar to Lessons from database failures (20)

PDF
OSDC 2017 | Lessons from database failures by Colin Charles
PDF
OSDC 2018 | Scaling & High Availability MySQL learnings from the past decade+...
PDF
Best practices for MySQL High Availability
PDF
MySQL Scalability and Reliability for Replicated Environment
PDF
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
PDF
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
PDF
MySQL Scalability and Reliability for Replicated Environment
PDF
MySQL Ecosystem in 2018
PDF
Scaling MySQL -- Swanseacon.co.uk
PDF
The MySQL ecosystem - understanding it, not running away from it!
PDF
Buytaert kris my_sql-pacemaker
PDF
MySQL Utilities -- PyTexas 2015
PPTX
Mysql ecosystem in 2019
PDF
The MySQL High Availability Landscape and where Galera Cluster fits in
PDF
High-level architecture of a complete MariaDB deployment
PDF
MySQL High Availability Solutions
PDF
Mysqlhacodebits20091203 1260184765-phpapp02
PDF
MySQL High Availability Solutions
PDF
MySQL Ecosystem in 2020
PPTX
Maria DB Galera Cluster for High Availability
OSDC 2017 | Lessons from database failures by Colin Charles
OSDC 2018 | Scaling & High Availability MySQL learnings from the past decade+...
Best practices for MySQL High Availability
MySQL Scalability and Reliability for Replicated Environment
MySQL Ecosystem in 2023 - FOSSASIA'23 - Alkin.pptx.pdf
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
MySQL Scalability and Reliability for Replicated Environment
MySQL Ecosystem in 2018
Scaling MySQL -- Swanseacon.co.uk
The MySQL ecosystem - understanding it, not running away from it!
Buytaert kris my_sql-pacemaker
MySQL Utilities -- PyTexas 2015
Mysql ecosystem in 2019
The MySQL High Availability Landscape and where Galera Cluster fits in
High-level architecture of a complete MariaDB deployment
MySQL High Availability Solutions
Mysqlhacodebits20091203 1260184765-phpapp02
MySQL High Availability Solutions
MySQL Ecosystem in 2020
Maria DB Galera Cluster for High Availability
Ad

More from Colin Charles (7)

PDF
What is MariaDB Server 10.3?
PDF
Databases in the hosted cloud
PDF
MySQL features missing in MariaDB Server
PDF
Databases in the Hosted Cloud
PDF
Percona ServerをMySQL 5.6と5.7用に作るエンジニアリング(そしてMongoDBのヒント)
PDF
Cool MariaDB Plugins
PDF
Better encryption & security with MariaDB 10.1 & MySQL 5.7
What is MariaDB Server 10.3?
Databases in the hosted cloud
MySQL features missing in MariaDB Server
Databases in the Hosted Cloud
Percona ServerをMySQL 5.6と5.7用に作るエンジニアリング(そしてMongoDBのヒント)
Cool MariaDB Plugins
Better encryption & security with MariaDB 10.1 & MySQL 5.7

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity

Lessons from database failures

  • 1. Lessons from database failures Colin Charles, Chief Evangelist, Percona Inc. colin.charles@percona.com / byte@bytebot.net http://www.bytebot.net/blog/ | @bytebot on Twitter Percona Live Europe Amsterdam, Netherlands 5 October 2016
  • 2. whoami • Chief Evangelist (in the CTO office), Percona Inc • Focusing on the MySQL ecosystem (MySQL, Percona Server, MariaDB Server), as well as the MongoDB ecosystem (Percona Server for MongoDB) + 100% open source tools from Percona like Percona Monitoring & Management, Percona xtrabackup, Percona Toolkit, etc. • Founding team of MariaDB Server (2009-2016), previously at Monty Program Ab, merged with SkySQL Ab, now MariaDB Corporation • Formerly MySQL AB (exit: Sun Microsystems) • Past lives include Fedora Project (FESCO), OpenOffice.org • MySQL Community Contributor of the Year Award winner 2014
  • 3. Agenda • Backups (and verification) • Replication (and failover) • Security (and encryption)
  • 5. ma.gnolia.com’s failure • January 30 2009: complete outage • February 17 2009: data corruption in the UDB, essentially dead • What happened? • Ruby on Rails on four self-hosted Mac Mini’s, a couple of XServe’s, 500GB+ MySQL 5 DB • Filesystem corruption, corrupted database backup • No versioning, didn’t check if the backups worked, made use of rsync to backup the database over Firewire network
  • 6. ma.gnolia.com today? • EC2 for the app with EBS snapshots, RDS with snapshots, Multi-AZ deployment • Self-hosted? • xtrabackup • START TRANSACTION WITH CONSISTENT SNAPSHOT + mysqldump —single-transaction —master-data • Backup a replica • Replication event checksums
  • 8. Couchsurfing problems 1. major, avoidable hard drive crash 2. incremental backups weren’t executed in the correct manner, and twelve of our most important data files didn’t survive
  • 9. Time-delayed replication • MySQL 5.6+ has time-delayed replication. Stop replication when you know a mistake has happened before it propagates to all the slaves. • Feature suggestion since 2001! Bug reported August 2006 (mysql#21639). Pushed June 2010 (WL#344). GA February 2013.
  • 10. Why replicate? • Scale out • [automatic] (master) failover • Geographical redundancy across multiple data centres • Online schema changes
  • 11. Replication • Asynchronous (default) • (Enhanced loss-less) Semi-synchronous (plugin) • Synchronous (Galera, group replication, NDBCLUSTER) • DRBD
  • 12. Frameworks • MySQL-MMM • Severalnines ClusterControl • Orchestrator • MySQL MHA • Tungsten Replicator • 5.6+ utilities: mysqlfailover, mysqlrpladmin • Percona Replication Manager (https://github.com/percona/ percona-pacemaker-agents/) • Replication Manager (github.com/tanji/replication- manager)
  • 17. Fully automated failover a good idea? • False alarms • Repeated failover • Overloaded master? MHA doesn’t allow a failover within 8h, unless —last_failover_min=n is set • Data loss • id=103 latest, relay logs at id=101 => loss • group commit in the binary log • Split brain
  • 18. Proxies • MariaDB MaxScale • MaxScale as binlog server @ Booking - to replace intermediate masters (downloads binlog from master, saves to disk, serves to slave as if served from master) • Popular use: load balancing Galera clusters • MySQL Router + MySQL Fabric • ProxySQL • Used alongside Galera clusters too
  • 20. Sharding • SPIDER • Tungsten Replicator • Tumblr JetPants
  • 21. Vitess • Servers & tools to scale MySQL for web written in Go • Has MariaDB support too (*) • Python client interface • DML annotation, connection pooling, shard management, workflow management, zero downtime restarts • Become super easy to use: http://vitess.io/ (with the help of Kubernetes)
  • 22. Failwhales • Twitter started on MySQL, and is still MySQL - you just need to “evolve” • Gizzard (sharding), Mesos + Apache Cotton • Digg started on MySQL, migrated to Cassandra, and came back to MySQL
  • 23. Security • Philippines voter data leave 55m at risk: 338GB MySQL dump • Ashley Madison: 6.9GB compressed dump, 36m email addresses leaked, 9.6m credit card transactions • Patreon: 13.7GB MySQL dump, 99 tables
  • 25. Prevent SQL injections • MariaDB MaxScale database firewall filter • Configurable filter actions on rule match (Allow the query, block the query or ignore the match), Logging of matching and/or non- matching queries • MySQL Enterprise firewall
  • 26. Encryption at rest • MariaDB Server 10.1: table or tablespace encryption • design goal: Encrypt all user data that may touch the disk — InnoDB data, InnoDB logs, binary logs, temporary tables, temporary files • key management on the filesystem? [no key rotation] Amazon KMS? • caveats: mysqlbinlog needs work with encrypted binlogs; Galera Cluster gcache isn’t encrypted • MySQL 5.7: only encrypts InnoDB tablespaces (innodb_file_per_table; logs unencrypted)
  • 27. In conclusion… • Use semi-sync replication with a failover solution that ensures you don’t failover too often • Make good backups. Test them. Save them. • You’ll most definitely need to shard your data, use proven frameworks and get a proxy involved. Complete backups with multi- source replication when needed. • Use mysqldump and xtrabackup together (and mydumper for parallel backup/restore; mysqlpump) • Security is key: prevent SQL injections, encrypt your data at rest
  • 28. It’s 2016, you don’t want this…
  • 29. Percona Monitoring and Management (PMM) • http://pmmdemo.percona.com/
  • 30. Thank you. Q&A? colin.charles@percona.com / byte@bytebot.net @bytebot on Twitter | http://www.bytebot.net/blog/ slides: slideshare.net/bytebot