Methods of Sharding MySQL

Methods of Sharding MySQL
Percona Live NYC 2012
Who are Palomino?
Bespoke Services: we work with and like you.
Production Experienced: senior DBAs, admins, and engineers.
24x7: globally-distributed on-call staff.
Short-term no-lock-in contracts.
Professional Services (DevOps):
➢ Chef,

➢ Puppet,

➢ Ansible.

Big Data Cluster Administration (OpsDev):
➢ MySQL, PostgreSQL,

➢ Cassandra, HBase,

➢ MongoDB, Couchbase.

Percona Live NYC 2012
Who am I?
Tim Ellis
CTO/Principal Architect, Palomino

Achievements:
➢ Palomino Big Data Strategy.

➢ Datawarehouse Cluster at Riot Games.

➢ Back-end Storage Architecture for Firefox Sync.

➢ Led DB teams at Digg for four years.

➢ Harassed the Reddit team at one of their parties.

Ensured Successful Business for:
➢ Digg, Friendster,

➢ Riot Games,

➢ Mozilla,

➢ StumbleUpon.

What is this Talk?
Large cluster admin: when one DB isn't enough.
➢ What is a shard?

➢ What shard types can I choose?

➢ How to build a large DB cluster.

➢ How to administer that giant mess of DBs.

Types of large clusters:
➢ Just a bunch of databases.

➢ Distributed database across machines.

Where the Focus will Lie
12% – Sharding theory/considerations.

25% – Building a Cluster to administer (tutorial):
➢ Palomino Cluster Tool.

50% – Flexible large-cluster administration (tutorial):
➢ Tumblr's Jetpants.

13% – Other sharding technologies (talk-only):
➢ Youtube's Vtocc (Vitess),

➢ Twitter's Gizzard,

➢ HAproxy.

What about the Silver Bullets?
NoSQL Distributed Databases:
➢ Promise “sharding” for free,

➢ Uptime and horizontal scaling trivially.

Reality:
➢ RDBMS is 40-yr-old tech,

➢ NoSQL is 10-yr-old tech,

➢ Which responsible for how many high-profile

downtimes in the past 10 years?
➢ Evaluate the alternatives without illusions.

What is a Shard?
A location for a subset of data:
➢ Itself made of pieces.

➢ Typically itself redundant.

Shard for User Data Shard for Logging Data Shard for Posts Data

Master Master Master

Slave Slave Slave Slave Slave Slave
Slave Slave Slave

What are the Sharding Method Choices?
By-Function:
➢ Move busy tables onto new shard.

➢ Writes of busiest tables on new hardware.

➢ Writes of remaining tables on current.

By-Columns:
➢ Split table into chunks of related columns,

store each set on its own Master/Slaves shard.
By-Rows:
➢ A table is split into N shards, shard gets a

subset of the rows of the table.

Shard Method Choices
By-function and By-Column Methods:
➢ Much easier.

➢ Can get you through months to years.

➢ Eventually you run out of options here.

By-Row Method:
➢ The hardest to do.

➢ Requires new ways of accessing data.

➢ Often requires sophisticated cache strategies.

➢ Itself can be done several ways.

By-Function Sharding
Picking a Functional Split:
➢ A subset of tables commonly joined.

➢ Tables outside this subset nearly never joined.

➢ One of them responsible for many writes.

Every table outside subset requires rewriting
JOINs into code-based multi-SELECTs.

Once subset of tables moved onto their own
server, writes are distributed.

By-Column Sharding (Vertical Partition)
Identifying candidate table:
➢ Many columns (“users” anyone?),

➢ Many updates,

➢ Many indexes.

Required: even split of columns/indexes by
update frequency. Attempt: logical grouping.

JOINs not possible nor desireable: write multi-
SELECT code in application DAL.

Row-based Sharding Choices
Range-based Sharding:
➢ Easy to understand.

➢ Each shard gets a range of rows.

➢ Oft-times some shards are “hot.”

➢ Hot shards are split into separate shards.

➢ Cold shards are joined into a single shard.

➢ Juggling shard load is a frequent process.

Typically the best solution. Shortcomings have
known work-arounds.

Modulus/Hash-based Sharding:
➢ Row key is hashed to integer modulo number

of shards, then placed on that shard.
➢ Only rarely are some shards are “hot.”

➢ Shard splitting is difficult to implement.

Also a common method of sharding. We hope
not to split shards often (or ever).

When we do, it's a multi-week process.

Lookup Table-based Sharding:
➢ Easy to understand.

➢ Row key mapped to shard in a lookup table.

➢ Easy to move load off hot shards.

➢ Lookup table method is problematic:

➢ Single point of failure.
➢ Performance bottleneck.

➢ Billions of rows, itself may need sharding.

Prerequisite: Build a Large Cluster
Allocating the Hardware
Getting Hardware – your own company's:
➢ Can be politically-charged.

➢ Get a small batch first.

➢ Build small demonstration cluster.

➢ Get everyone on-board with the demo.

Renting/Leasing Hardware – the Cloud:
➢ Allocate hardware in EC2 or elsewhere.

➢ Usually easier, but possibly harder admin:

➢ Hardware failure more common.
➢ Hardware/network flakiness more common.

Building the Cluster

Okay, I've got the hardware. What next?

Building the Cluster
Configuring the Hardware. The old dilemma:
➢ Spend days to install/configure DB software?

Subsequent management is painful.
➢ Use SSH in “for” loops?

Rolling your own configuration management
tools is a lot of work.
➢ Learn a configuration management tool?

Obvious choice in 2012. Well-documented
tools like Chef, Puppet, Ansible.

Configuration Management Tools
My Experience
Puppet: 6 years ago at Digg
➢ Manage/Deploy of hundreds of servers.

➢ Painful, but not as bad as hand-coding it all.

Chef: 2 years ago at Drawn to Scale and Riot
➢ Manage/Deploy dozens of servers.

➢ Learning Ruby is a “joy” of its own.

Ansible: 6 months ago at Palomino
➢ Manage/Deploy dozens of servers.

➢ First Palomino Cluster Tool subset built.

Configuration Management Options
Pick your Configuration Management:
➢ Chef: Popular, use Ruby to “code your

infrastructure.” Must learn Ruby.
➢ Puppet: Mature, use data structures to “define

your infrastructure.” Less coding.
➢ Ansible: Tiny and modular, similar to Puppet,

but with ordering for deployment. Pragmatic.
Write/Get Recipes, Manifests, Playbooks?
➢ Writing is tedious. Can take >1 week.

➢ Get from internet? Often incomplete.

The Palomino Cluster Tool
Palomino's tool for building large DB clusters:
➢ Chef, Puppet, Ansible modules.

➢ Open-source on Github.

➢ https://github.com/time-palominodb/PalominoClusterTool
➢ Google: “Palomino Cluster Tool.”
➢ Will build a large cluster for you in hours:
➢ Master(s)
➢ Slaves – hundreds of them as easy as two.

➢ MHA – when master fails, a slave takes over.

➢ Previously this would take days.

Building the Management Node
Cluster Management Node:
➢ Will build the initial cluster.

➢ Will do subsequent cluster management.

Tool for Initial Cluster Build:
➢ Palomino Cluster Tool (Ansible subset).

Tool for Cluster Management:
➢ Jetpants (Ruby).

Palomino Cluster Tool (Ansible subset).

Why Ansible?
➢ No server to set up, simply uses SSH.

➢ Easy-to-understand non-code Playbooks.

➢ Use a language you know for modules.

➢ For demo purposes, obvious choice.

➢ Also production-worthy:

➢ Built by Michael DeHaan, long-time
configuration management guru.

Management node lives alongside your cluster.
➢ We are building our cluster in EC2.

➢ Thus management node in EC2.

➢ This tutorial assumes Ubuntu 12.04.

➢ t1.micro is fine for management node.

Install basic tools:
➢ apt-get install git (for Ansible/P.C.T.)

➢ apt-get install make python-jinja2 (for

Ansible)

Configuring the Management Node
Install Ansible:
➢ git clone git://github.com/ansible/ansible.git

➢ make install

Install Palomino Cluster Tool:
➢ git clone git://github.com/time-

palominodb/PalominoClusterTool.git

I think we just finished the management node!

Allocating Shard Nodes
Shard nodes:
➢ m1.small or larger: at least 1.6GB RAM,

➢ :3306, :80, and :22 open between all (one

security group in EC2),
➢ Ubuntu 12.04 (other Debian-alikes at your

own risk – but may work!).

Do not need OS/database configuration:
➢ Ansible will configure them.

Building the First Shard – Step 1
From README: edit IP addresses in cluster
layout file (PalominoClusterToolLayout.ini):
# Alerting/Trending -----
[alertmaster]
10.252.157.110
[trendmaster]
10.252.157.110

# Servers -----
[mhamanager]
10.252.157.110

This section identical for all Shards.

From README: edit IP addresses in cluster
layout file (PalominoClusterToolLayout.ini):
[mysqlmasters]
10.244.17.6

[mysqlslaves]
10.244.26.199
10.244.18.178

[mysqls:vars]
master_host=10.244.17.6

This section different for every Shard.

Run setup command to put configuration and
SSH keys into /etc:
$ cd PalominoClusterTool/AnsiblePlaybooks/Ubuntu-12.04
$ ./00-Setup_PalominoClusterTool.sh ShardA

Run build command – it's a wrapper around
Ansible Playbooks:
$ ./10-MySQL_MHA_Manager.sh ShardA

Building the Second Shard
Just make one shard with a master and many
slaves. In real life, you might do something like
this instead:
for i in ShardB ShardC ShardD ; do
(manual step):
vim PalominoClusterToolLayout.ini
(scriptable steps):
./00-Setup_PalominoClusterTool.sh $i
./10-MySQL_MHA_Manager.sh $i
done

Run them in separate terminals to save time.

Make the Cluster Real
Data makes Shard Split Interesting
Fill ShardA using random data script.*

Palomino Cluster Tool includes such a tool.
➢ HelperScripts/makeGiantDatafile.pl

$ ssh root@sharda-master
# cd PalominoClusterTool/HelperScripts
# mysql -e 'create database palomino'
# ./makeGiantDatafile.pl 1200000 3 | mysql -f palomino

Install Jetpants, do shard split now.
* Be sure /var/lib/mysql is on large partition!

Administering the Cluster
Install Jetpants
General idea: Install Ruby >=1.9.2 and
RubyGems, then Jetpants via RubyGems.

On my systems, /etc/alternatives always
incorrect, ln the proper binaries for Jetpants.
# apt-get install ruby1.9.3 rubygems libmysqlclient-dev
# ln -sf /usr/bin/ruby1.9.3 /etc/alternatives/ruby
# ln -sf /usr/bin/gem1.9.3 /etc/alternatives/gem
# gem install jetpants

Configure Jetpants
General idea: edit /etc/jetpants.yaml and
create/own Jetpants inventory and application
configuration to Jetpants user:
# vim /etc/jetpants.yaml
# mkdir -p /var/jetpants
# touch /var/jetpants/assets.json
# chown jetpantsusr: /var/jetpants/assets.json
# mkdir -p /var/www
# touch /var/www/databases.yaml
# chown jetpantsusr: /var/www/databases.yaml

Jetpants Shard Splits
Tell Jetpants Console about your ShardA:
Jetpants> s = Shard.new(1, 999999999, '10.12.34.56',
:ready) #10.12.34.56==ShardA master
Jetpants> s.sync_configuration

Create spares within Console for all others
(improved workflow in Jetpants 0.7.8):
Jetpants> topology.tracker.spares << '10.23.45.67'
Jetpants> topology.write_config
Jetpants> topology.update_tracker_data

Just for this tutorial:
➢ Create the “palomino” database,

➢ Break the replication on all the spares,

➢ Be sure spares are read/write:

➢ Edit my.cnf,
➢ service mysql restart

➢ Ensure “jetpants pools” proper:
➢ One master,
➢ Two slaves.

How to perform an actual Shard Split:
$ jetpants shard_split --min-id=1 --max-id=999999999

Notes:
➢ Process takes hours. Use screen or nohup.

➢ LeftID == parent's first, RightID == parent's

last, no overlap/gap.
➢ Make children 1-300000,300001-999999999.

Jetpants Shard Splitting
The Gory Details
After “jetpants shard_split”:
ubuntu@ip-10-252-157-110:~$ jetpants pools
shard-1-999999999 [3GB]
master = 10.244.136.107 ip-10-244-136-107
standby slave 1 = 10.244.143.195 ip-10-244-143-195
standby slave 2 = 10.244.31.91 ip-10-244-31-91
shard-1-400000 (state: replicating) [2GB]
master = 10.244.144.183 ip-10-244-144-183
shard-400001-999999999 (state: replicating) [1GB]
master = 10.244.146.27 ip-10-244-146-27

0 global pools
3 shard pools
---- --------------
3 total pools

3 masters
0 active slaves
2 standby slaves
0 backup slaves
---- --------------
5 total nodes

Jetpants Improvements
The Result of an Experiment
Jetpants only well-tested on RHEL/CentOS.

Palomino Cluster Tool only well-tested to build
Ubuntu 12.04 clusters.

Little effort to fix Jetpants:
➢ /sbin/service location different,

➢ service mysql status output different.

Jetpants only well-tested on MySQL 5.1.

I built a cluster of MySQL 5.5.

A little more effort to fix Jetpants:
➢ Set master_host=' ' is syntax error,

➢ reset slave needs keyword “all” appended.

Jetpants only well-tested on large datasets.

I built a cluster with only hundreds of MB.

A wee tad more effort to fix Jetpants:
➢ Some timings assumed large datasets,

➢ Edge cases for small/quick operations

reported back to the author.

OSS Collaboration and Win
Evan Elias implemented these fixes last week!
➢ jetpants add_pool,

➢ jetpants add_shard,

➢ jetpants add_spare (with sanity-check spare),

➢ Shards with 1 slave (not for prod!),

➢ read_only spares not fatal,

➢ Debian-alike (Ubuntu) fixes,

➢ MySQL 5.5 fixes,

➢ Mid-split Jetpants pools output simpler.

Really responsive ownership of project!

Twitter's Gizzard
What is it?
General Framework for distributed database.
➢ Hides sharding from you.

➢ Literally, it is middleware.

➢ Applications connect to Gizzard,
➢ Gizzard sends connections to proper place,

➢ Shard splits and hardware failure taken care of.

➢ Created at Twitter by rogue cowboys.
➢ Not completely production-ready.

➢ Better than rolling your own!

Twitter's Gizzard
Why should I use it?
You've settled on row-based partition scheme:
➢ Master nearing I/O capacity, won't scale up,

➢ Can't move some tables to their own pool,

➢ Can't split the columns/indexes out,

➢ You want to keep using the DBMS you

already know and love: Percona Server.*
➢ Don't want to think about fault-tolerance or

shard splits (much),

* Actually use any storage back-end.

Twitter's Gizzard
The Fine Print
This sounds perfect. Why not Gizzard?

Writes must follow strict diet. Must be:
➢ Idempotent*,

➢ Commutative**,

➢ Must not have tuberculosis.

* Pfizer cannot remove the idempotency
requirement of Gizzard.
** Even on evenings and weekends.

Twitter's Gizzard
Expanding the Fine Print
Idempotency:
➢ Submit a write. Again. And again.

➢ Must be identical to doing it once.

➢ Bad: “update set col = col + 1”

Commutative – writes in arbitrary order:
➢ WriteA→WriteB→WriteC on Node1.

➢ WriteB→WriteC→WriteA on Node2.

➢ Bad: “update set col1 = 42”→“update set

col2 = col1 + 5”

Twitter's Gizzard
Expanding the Fine Print
Cluster is Eventually Consistent:
➢ May return old values for reads.

➢ Unknown when consistency will occur.

Like a politician's position on the budget:
➢ Might be consistent in the future.

➢ Just not right now.

➢ Or now.

Twitter's Gizzard
Working Around the Shortcomings
Gizzard work-around:
➢ Add timestamp to every transaction.

➢ Good:

➢ “col1.ts=1; update set col1=42” →
➢ “update set col2=col1 + 5 where col1.ts=1”

➢ Implementation trickier if DBMS doesn't
support column attributes.

Cannot escape: must radically re-think schema
and application/DBMS interaction.

Twitter's Gizzard
Trying it Out
I'm convinced! How do I begin?
➢ Learn Scala.

➢ Clone “rowz” from Github.

➢ https://github.com/twitter/Rowz
➢ Modify it to suit your needs.
➢ Learn how it interacts with existing tools.

➢ Write new monitoring/alerting plugins.

➢ Write unit tests!

➢ You should OSS it to help with overhead.

Twitter's Gizzard
Trying it Out
Sounds daunting. Maybe I'll roll my own?

Learn from others' mistakes:
➢ Digg: 2 engineers 6 months. Code thrown

away. Digg out of business.
➢ Countless identical stories in Silicon Valley.

NIHS attitude == Go out of business*.

* 8-figure R&D budgets excepted.

Youtube's Vitess/Vtocc
What is it?
Vitess is a library. Vtocc is an implemenation
using it.

Vtocc is another middleware solution.
➢ Sharding,

➢ Caching,

➢ Connection-pooling,

➢ In-use at Youtube,

➢ Built-in fail-safe features.

Youtube's Vtocc
Why use it?
Proven high-volume sharding solution.

Interesting feature-list:
➢ Auto query/transaction over-limit killing.

➢ Better query-cache implementation.

➢ Query comment-stripping for query cache.

➢ Query consolidation.

➢ Zero downtime restarts.

Less coding than Gizzard (more plug-in).

Youtube's Vtocc
Hold on, Zero Downtime Restarts?
Just start new Vtocc instance.
➢ Instance1 passes new requests to Instance2,

➢ Instance1's connections get 30s to complete,

➢ Instance2 kills Instance1 and takes over.

Vtocc Instance 1

Vtocc Instance 2

Youtube's Vtocc
The Fine Print
Requires Particular Primary Keys:
➢ varbinary datatype,

➢ Choose carefully to prevent hot-spots.

Max result-set size: larger resultsets fail.

Additional administration burden:
➢ “My query was killed. Why?”

➢ Middleware adds spooky hard-to-diagnose

failure modes.

Youtube's Vtocc
Implementation Details
➢ Run Vtocc on same server as MySQL.
➢ Configure Vtocc fail-safes for expected load:
➢ Pool Size (connection count),

➢ Max Transactions (has own connection pool),

➢ Query Timeout (before killed),

➢ Transaction Timeout (before killed),

➢ Max Resultset Size in rows

➢ Go language doesn't free allocated memory, so
pick this value carefully.
➢ More details: http://code.google.com/p/vitess/wiki/Operations

HAproxy
Re-thinking Proxy Topology
Old-school Proxy Topology:
➢ DB Clients one one side,

➢ DB Servers on the other,

➢ Proxy in-between.

Single Point of Failure

HAproxy
Re-thinking Proxy Topology
Free proxy provides new architecture option:
➢ Proxy on every DB client node.

➢ Good-bye single-point-of-failure.

➢ Hello configuration management for proxy.

HAproxy

HAproxy

HAproxy

HAproxy

HAproxy

Q&A
Questions? Suggestions:
➢ Interesting stuff. Got a job for me?

➢ Well I got a job for you. Interested?

➢ Warn me next time so I can sleep in the back

row.
➢ Was that a question?

Thank you! Emails to domain palominodb,
username time. Percona Live 2012 in New York
City. Enjoy the rest of the show!

Methods of Sharding MySQL

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Methods of Sharding MySQL (20)

More from Laine Campbell (10)

Recently uploaded (20)

Methods of Sharding MySQL