SlideShare a Scribd company logo
Medallia © Copyright 2015. Confidential. 1
High-Performance Production
Databases on Ceph
Medallia © Copyright 2016.. 2
At Medallia, we collect, analyze, and display terabytes of structured &
unstructured feedback for our multibillion dollar clients in real time.
And what’s more: we have a lot of fun doing it.
I’ve been at Medallia since 2010, growing from 70 to 700 employees.
Who are you?
Hi, I’m Thorvald, Architect @ Medallia
Medallia © Copyright 2015. Confidential. 3
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2015. Confidential. 4
Tech Industry Speak for “Last Year”
• New version of our analytics engine
○ Dream: Horizontally super-scalable! 1000s of servers!
• Reality: Peeking at Production
○ 100s of servers
■ .. that had individual names
■ .. almost, but not quite, entirely unlike each other
■ .. manual service placement
.. and server placement
■ “Don’t touch it”
A long, long time ago...
Medallia © Copyright 2015. Confidential. 5
• Skip 2-3 generations and go direct to “next gen”
○ MicroServices, Containers, <insert buzz-word here>
• Proof-of-Concept using 40GbE, Ceph, Docker
○ Resilient enough that it’s a problem to test resiliency
○ Performant enough to replace dedicated servers
• Can we run everything on this new infrastructure?
Rapid Evolution Time!
Jump into the future
Medallia © Copyright 2016.. 6
Design Goals
Keep it SIMPLE
• Commodity Components &
Supported Open Standards
• Fully automated provisioning
and reinstall
• Cheap & Scalable
• Immutable Servers
● No service that is tied to
specific hardware
● Every component must be
able to run anywhere
● Redundancy at App Layer
● Self-Healing
No Special MachinesCommodity Products
Medallia © Copyright 2016.. 7
Linux (Ubuntu)
2xIntel E5-v3
256GB Memory
40GbE Network
100GB SSD
Standard Rack
22 x Compute Node 3 x Networking
Linux (Cumulus)
1xIntel Atom 64-bit
8 GB Memory
32x40 GbE Network
8 x Storage Node
Linux (Ubuntu)
1xIntel E5-v3
64 GB Memory
40GbE Network
8x800GB SSD
PCIe NVRAM
Unified and Scalable
Medallia © Copyright 2015. Confidential. 8
Where do you draw the line?
• Application in relocatable Container?
• Load-balancer in relocatable Container?
• DNS server in relocatable Container?
• Database in relocatable Container?
Challenge
Everything as containers
Everything in Containers
Medallia © Copyright 2015. Confidential. 9
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2016.. 10
Problem: Network Mobility
Relocating non-movable services
DataCenter Firewall
Host: 10.1.2.3:80
Host: 10.1.2.5:80
172.17.0.3:80 nginx
Host: 10.1.2.4:2181
172.17.1.0:2181 zookeeper
172.17.1.2:80 application
Medallia © Copyright 2016.. 11
Network Mobility
host1
eth0
10.1.2.3/30
veth0
container-A
10.4.5.6/32
eth0
host2
host3
hostN
Medallia © Copyright 2016.. 12
Route Propagation
• Open Shortest Path First
○ Propagated Link State Database
○ Supported by every vendor
• Computes network paths with Dijkstra algorithm
• Moving 30 000 routes: ~ 1 second
• BGP works just as well, OSPF auto-configures easier
OSPF
Fully relocated IP address
Medallia © Copyright 2016.. 13
Problem: Storage Mobility
Relocating very non-movable services
Host: 10.1.2.3
172.17.1.2 application
Host: 10.1.2.4
172.17.5.5 PostgreSQL
Host: 10.1.3.5
172.17.5.5 PostgreSQL
Medallia © Copyright 2016.. 14
• Docker images are ephemeral
• Persistent volumes to the rescue!
○ Which work on your local machine only
• Proprietary Solutions needed for HA
○ iSCSI (Large Storage Vendor)
○ NFS (Large Storage Vendor, and.. performance?)
○ pNFS (... right)
• Scale up, but not out
• SLA? 4 hours hardware support not good enough!
Storage Mobility
Where did the filesystem go?
Medallia © Copyright 2016.. 15
• No need to communicate with metadata servers in hot path
• Clean design; we understand enough to go fix problems ourselves
• Need more capacity?
○ Add servers!
• Need more aggregate performance?
○ Add servers!
• Need more single-node performance?
○ Get creative!
Ceph
Short Version
Medallia © Copyright 2016.. 16
Storage “Solved”
Relocating very non-movable services
Host: 10.1.2.3
172.17.1.2 application
Host: 10.1.2.4
172.17.5.5 PostgreSQL
Host: 10.1.3.5
172.17.5.5 PostgreSQL
Replicated Ceph Cluster
Medallia © Copyright 2016.. 17
What happens when the server for your monitor dies?
• It’s “interesting” to switch Ceph monitor IPs. So don’t.
○ The monitors are services; each gets a unique IP.
• If machine hosting monitor dies, start same monitor somewhere else
with same IP.
○ It’ll clone data from the other monitors
• Not automated (somewhat high fubar potential)
Relocatable Infrastructure
Relocatable Monitors
Medallia © Copyright 2015. Confidential. 18
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2016.. 19
• Pre-OS linux + initramfs from PXE+HTTP
• Unlocks self-encrypting drives (Data-at-Rest encryption)
○ Key never known by runtime OS
• Check state:
○ Update Firmware? Unify BIOS version and config?
○ Install OS?
○ Boot OS?
• Completely uniform machines -- no half-installed, half-forgotten
state.
Remote Boot
Always boot from PXE
Medallia © Copyright 2016.. 20
Apache Aurora/Mesos
Mesos
Master
Mesos
Master
Aurora
Scheduler
Aurora
Scheduler
Aurora
Scheduler Mesos
Master
NODE-1
32 CPU
256 GB
NODE-2
12 CPU
128 GB
NODE-3
32 CPU
256 GB
NODE-4
32 CPU
256 GB
NODE-5
12 CPU
128 GB
NODE-6
32 CPU
256 GB
NODE-7
12 CPU
128 GB
NODE-8
32 CPU
256 GB
Mesos
Slaves
Create New Job!
docker-image
medallia/service1
resources
2*CPU
1*GB
instances
3
Zookeeper
Aurora
Scheduler
Aurora
Scheduler
Hadoop
Scheduler
Aurora
Scheduler
Aurora
Scheduler
Storm
Scheduler
“Program against your datacenter like it’s a single pool of resources”
Medallia © Copyright 2016.. 21
docker run -it
--net=routed --ip-address=1.2.3.4/32
-v demo:/demo:ceph,rw
ubuntu
Extended Docker
Medallia © Copyright 2016.. 22
How Fast Is It?
StorageNetwork
~ 5 us latency
38Gbit/s single-stream TCP
22Gbit/s single-stream TCP
39.5Gbit/s multi-stream TCP
Relocate < 50ms
~ 550 MB/sec single-stream IO
~ 4 GB/sec multi-stream IO
Reattach < 50ms
Limited by SATA SSD
Medallia © Copyright 2016.. 23
NUMA
CPU #0
DRAM
DRAM
DRAM
DRAM
40GbE NIC
CPU #1
DRAM
DRAM
DRAM
DRAM
SAS/SATA
Medallia © Copyright 2015. Confidential. 24
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2016.. 25
DEMO!
Medallia © Copyright 2015. Confidential. 26
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2015. Confidential. 27
• SSDs to 100k 4k random write IOPS!
○ If you have a “IO pipeline”
• Real-world:
○ Read: Databases don’t have IO depth of 64. It’s 1.
■ Read index, process, seek to correct index, read, process..
○ Write: Databases want each and every transaction to be
acknowledged by the storage layer
■ Full round-trip down to the storage layer
• Dedicated DB servers have a LOT of buffer cache
○ 24x800GB SSD = $15k. 512 GB RAM = $4k.
Real-World vs Synthetic IO
Latency, not IOPS or bandwidth!
Medallia © Copyright 2015. Confidential. 28
• We have two types of tables
○ “A few GB”
○ “A few TB”
• Application does heavy caching; few read requests
• DB Containers have plenty memory; most tables sit in buffer cache
• If a user actually modifies something, there’s a transaction...
What performance matters for DB?
fdatasync() is bottleneck
Medallia © Copyright 2015. Confidential. 29
Easy!
Slow!
Mixed read-write:
~640 iops
3 Ways to Mount
FUSE KRBD
Easy!
Fast...er
Mixed read-write:
~1550 iops
No fancy image
features
iSCSI tgt rbd
Hard!
Slow!
Mixed read-write:
~600 iops
Medallia © Copyright 2015. Confidential. 30
Something that resembles PG
• Can (and do) use PGbench, but pgbench workload and our real
workload differ.
• Observe production IO pattern, replicate with fio
○ Once something provides good results on fio, apply to real DB
• Allow buffer cache
○ Yes, you have it on in production
• IOdepth=1, 8 jobs, 8kb blocks
• fdatasync() every 100th block
• Very Large Files, semi-random access
• PG doesn’t use fancy IO, so neither does our benchmark
“Realistic” testing with FIO
Medallia © Copyright 2015. Confidential. 31
3x850 Pro RAID0
Writes:
99.9%: 22us
99.99%: 57us
~700 iops/job
Soft RAID0
No SuperCap
MegaRAID RAID6
Writes:
99.5%: 22us
99.9%: 15ms
99.99%: 119ms
~1100 iops/job
Battery-backed write-
back cache
Unpredictable
performance
KRBD
Writes:
99.9%: 9us
99.99%: 11us
~1000 iops/job
Survives controller
failure!
Local SSD comparison
Medallia © Copyright 2015. Confidential. 32
AGENDA
1
2
3
4
5
6
Networking/Storage Mobility
The Dream
Provisioning/Orchestration
Demo!
Real-world performance
Challenges and next steps
Medallia © Copyright 2016.. 33
Fun With Locking
• Switch is rebooted
• Aurora detects compute node dead
○ Restarts job somewhere else
• New location mounts Ext4 filesystem
• Switch finishes rebooting
• Old job, still running, now writes to the mounted filesystem
• “How to repair a broken ext4 filesystem with a critical database”
Ext4 on RBD
Test all failure scenarios
Medallia © Copyright 2016.. 34
• On Map; “rbd lock add <image> ”
■ If no success; then
● “rbd status <image”: Check for watcher, 3 times, 15s
○ If found, ABORT, ABORT!
● “ceph osd blacklist add <previous lock holder>”
● Steal lock
• On unmap; rbd lock remove
• On reboot; “ceph osd blacklist rm <self>”
Workaround
Modified RBD wrapper; /bin/sh to the rescue!
Medallia © Copyright 2015. Confidential. 35
Great, we beat legacy hardware… Or did we?
• Legacy hardware better write latency for <90% latency mark, worse
for >90%. Higher average write IOPS.
• We want no compromise on performance
• Currently rolling out PMC NV1616 NVRAM for Ceph write journal
○ Single storage-server test very promising.
○ Large-scale test ready in 2 weeks
• Experimenting with RoCE v2; RDMA over UDP
• Will post results to Ceph mailing list
Make it faster!
Medallia © Copyright 2016.. 36
Try this out!
Available now:
• Docker w/ Storage and Networking
• Aurora
Coming soon:
• DCIB
github.com/medallia
Medallia © Copyright 2016.. 37
2x E5-2667v3 or 2690v3
16x 16GB DDR4 RDIMM
SuperMicro X10DDW-i
Mellanox ConnectX-3 Pro
Intel DC-S3500 100GB SSD
If you want an exact replica...
Compute Node Networking
Dell S6000-ON
Cumulus
Dell AOC Cable (switches)
Mellanox Copper (servers)
Storage Node
1x E5-2667v3
4x 16GB DDR4 RDIMM
SuperMicro X10SRW-F
Mellanox ConnectX-3 Pro
Intel DC-S3500 100GB SSD
8x Intel DC-S3500 800GB
Flashtec NVRAM NV1616
(w/Encryption)
Medallia © Copyright 2016.. 38
Thank you!
engineering.medallia.com

More Related Content

PPTX
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
PPTX
Ceph Performance Profiling and Reporting
PDF
Stabilizing Ceph
PPTX
Ceph: Low Fail Go Scale
PDF
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
PDF
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
PPTX
Walk Through a Software Defined Everything PoC
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Performance Profiling and Reporting
Stabilizing Ceph
Ceph: Low Fail Go Scale
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Walk Through a Software Defined Everything PoC

What's hot (20)

PDF
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
PPTX
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PPTX
Ceph Day Melabourne - Community Update
PDF
Ceph on All Flash Storage -- Breaking Performance Barriers
PDF
Ceph Day San Jose - Object Storage for Big Data
PPTX
Ceph Day KL - Ceph on All-Flash Storage
PPTX
Ceph on 64-bit ARM with X-Gene
PPTX
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
PPTX
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
ODP
Ceph Day Melbourne - Troubleshooting Ceph
PDF
AF Ceph: Ceph Performance Analysis and Improvement on Flash
PDF
Ceph Day San Jose - HA NAS with CephFS
PPTX
OpenStack and Ceph case study at the University of Alabama
PDF
Ceph Day Tokyo -- Ceph on All-Flash Storage
PDF
Developing a Ceph Appliance for Secure Environments
PPTX
MySQL Head-to-Head
PDF
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
PDF
Ceph Day Shanghai - Opening
PPTX
Ceph Day San Jose - Ceph at Salesforce
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day Melabourne - Community Update
Ceph on All Flash Storage -- Breaking Performance Barriers
Ceph Day San Jose - Object Storage for Big Data
Ceph Day KL - Ceph on All-Flash Storage
Ceph on 64-bit ARM with X-Gene
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Melbourne - Troubleshooting Ceph
AF Ceph: Ceph Performance Analysis and Improvement on Flash
Ceph Day San Jose - HA NAS with CephFS
OpenStack and Ceph case study at the University of Alabama
Ceph Day Tokyo -- Ceph on All-Flash Storage
Developing a Ceph Appliance for Secure Environments
MySQL Head-to-Head
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Ceph Day Shanghai - Opening
Ceph Day San Jose - Ceph at Salesforce
Ad

Viewers also liked (20)

PPTX
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
PDF
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
PPTX
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
PPTX
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
PDF
Reference Architecture: Architecting Ceph Storage Solutions
PDF
Ceph Day Shanghai - Ceph in Chinau Unicom Labs
PDF
Ceph Day Shanghai - On the Productization Practice of Ceph
PDF
Ceph Day Shanghai - Community Update
PDF
Ceph Day Seoul - Ceph: a decade in the making and still going strong
PDF
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
PPTX
Ceph Day Taipei - Community Update
PPTX
Ceph Day Chicago - Ceph at work at Bloomberg
PPTX
Ceph Tech Talk -- Ceph Benchmarking Tool
PPTX
Ceph Day Taipei - Ceph on All-Flash Storage
PDF
Ceph Day Shanghai - Ceph Performance Tools
PDF
iSCSI Target Support for Ceph
PPTX
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PPTX
Ceph Day KL - Bluestore
PPTX
Ceph Day Seoul - Community Update
Ceph Day Chicago - Brining Ceph Storage to the Enterprise
Ceph Day Shanghai - Hyper Converged PLCloud with Ceph
Ceph Day Chicago - Deploying flash storage for Ceph without compromising perf...
Ceph Day Chicago - Supermicro Ceph - Open SolutionsDefined by Workload
Reference Architecture: Architecting Ceph Storage Solutions
Ceph Day Shanghai - Ceph in Chinau Unicom Labs
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - Community Update
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Ceph Day Chicago: Using Ceph for Large Hadron Collider Data
Ceph Day Taipei - Community Update
Ceph Day Chicago - Ceph at work at Bloomberg
Ceph Tech Talk -- Ceph Benchmarking Tool
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Shanghai - Ceph Performance Tools
iSCSI Target Support for Ceph
Ceph Day Taipei - Ceph Tiering with High Performance Architecture
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Day KL - Bluestore
Ceph Day Seoul - Community Update
Ad

Similar to 2016-JAN-28 -- High Performance Production Databases on Ceph (20)

PDF
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
PDF
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
PDF
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
PDF
optimizing_ceph_flash
PDF
SUSE Storage: Sizing and Performance (Ceph)
PPTX
EMC World 2016 - code.13 State of the Container Ecosystem with Persistent App...
PDF
Why Software Defined Storage is Critical for Your IT Strategy
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
PDF
Microservices using relocatable Docker containers
PDF
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
PPTX
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
PDF
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
PPS
Filesystems
PPS
Beyond the File System - Designing Large Scale File Storage and Serving
PPTX
Dfs in iaa_s
PDF
Introduction to Apache Mesos and DC/OS
PPTX
New Ceph capabilities and Reference Architectures
PPTX
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
PDF
Tuning Linux for your database FLOSSUK 2016
PDF
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
optimizing_ceph_flash
SUSE Storage: Sizing and Performance (Ceph)
EMC World 2016 - code.13 State of the Container Ecosystem with Persistent App...
Why Software Defined Storage is Critical for Your IT Strategy
Quick-and-Easy Deployment of a Ceph Storage Cluster
Microservices using relocatable Docker containers
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Filesystems
Beyond the File System - Designing Large Scale File Storage and Serving
Dfs in iaa_s
Introduction to Apache Mesos and DC/OS
New Ceph capabilities and Reference Architectures
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Tuning Linux for your database FLOSSUK 2016
TritonSort: A Balanced Large-Scale Sorting System (NSDI 2011)

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
A Presentation on Artificial Intelligence
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Review of recent advances in non-invasive hemoglobin estimation
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A Presentation on Artificial Intelligence
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Monthly Chronicles - July 2025
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx

2016-JAN-28 -- High Performance Production Databases on Ceph

  • 1. Medallia © Copyright 2015. Confidential. 1 High-Performance Production Databases on Ceph
  • 2. Medallia © Copyright 2016.. 2 At Medallia, we collect, analyze, and display terabytes of structured & unstructured feedback for our multibillion dollar clients in real time. And what’s more: we have a lot of fun doing it. I’ve been at Medallia since 2010, growing from 70 to 700 employees. Who are you? Hi, I’m Thorvald, Architect @ Medallia
  • 3. Medallia © Copyright 2015. Confidential. 3 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 4. Medallia © Copyright 2015. Confidential. 4 Tech Industry Speak for “Last Year” • New version of our analytics engine ○ Dream: Horizontally super-scalable! 1000s of servers! • Reality: Peeking at Production ○ 100s of servers ■ .. that had individual names ■ .. almost, but not quite, entirely unlike each other ■ .. manual service placement .. and server placement ■ “Don’t touch it” A long, long time ago...
  • 5. Medallia © Copyright 2015. Confidential. 5 • Skip 2-3 generations and go direct to “next gen” ○ MicroServices, Containers, <insert buzz-word here> • Proof-of-Concept using 40GbE, Ceph, Docker ○ Resilient enough that it’s a problem to test resiliency ○ Performant enough to replace dedicated servers • Can we run everything on this new infrastructure? Rapid Evolution Time! Jump into the future
  • 6. Medallia © Copyright 2016.. 6 Design Goals Keep it SIMPLE • Commodity Components & Supported Open Standards • Fully automated provisioning and reinstall • Cheap & Scalable • Immutable Servers ● No service that is tied to specific hardware ● Every component must be able to run anywhere ● Redundancy at App Layer ● Self-Healing No Special MachinesCommodity Products
  • 7. Medallia © Copyright 2016.. 7 Linux (Ubuntu) 2xIntel E5-v3 256GB Memory 40GbE Network 100GB SSD Standard Rack 22 x Compute Node 3 x Networking Linux (Cumulus) 1xIntel Atom 64-bit 8 GB Memory 32x40 GbE Network 8 x Storage Node Linux (Ubuntu) 1xIntel E5-v3 64 GB Memory 40GbE Network 8x800GB SSD PCIe NVRAM Unified and Scalable
  • 8. Medallia © Copyright 2015. Confidential. 8 Where do you draw the line? • Application in relocatable Container? • Load-balancer in relocatable Container? • DNS server in relocatable Container? • Database in relocatable Container? Challenge Everything as containers Everything in Containers
  • 9. Medallia © Copyright 2015. Confidential. 9 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 10. Medallia © Copyright 2016.. 10 Problem: Network Mobility Relocating non-movable services DataCenter Firewall Host: 10.1.2.3:80 Host: 10.1.2.5:80 172.17.0.3:80 nginx Host: 10.1.2.4:2181 172.17.1.0:2181 zookeeper 172.17.1.2:80 application
  • 11. Medallia © Copyright 2016.. 11 Network Mobility host1 eth0 10.1.2.3/30 veth0 container-A 10.4.5.6/32 eth0 host2 host3 hostN
  • 12. Medallia © Copyright 2016.. 12 Route Propagation • Open Shortest Path First ○ Propagated Link State Database ○ Supported by every vendor • Computes network paths with Dijkstra algorithm • Moving 30 000 routes: ~ 1 second • BGP works just as well, OSPF auto-configures easier OSPF Fully relocated IP address
  • 13. Medallia © Copyright 2016.. 13 Problem: Storage Mobility Relocating very non-movable services Host: 10.1.2.3 172.17.1.2 application Host: 10.1.2.4 172.17.5.5 PostgreSQL Host: 10.1.3.5 172.17.5.5 PostgreSQL
  • 14. Medallia © Copyright 2016.. 14 • Docker images are ephemeral • Persistent volumes to the rescue! ○ Which work on your local machine only • Proprietary Solutions needed for HA ○ iSCSI (Large Storage Vendor) ○ NFS (Large Storage Vendor, and.. performance?) ○ pNFS (... right) • Scale up, but not out • SLA? 4 hours hardware support not good enough! Storage Mobility Where did the filesystem go?
  • 15. Medallia © Copyright 2016.. 15 • No need to communicate with metadata servers in hot path • Clean design; we understand enough to go fix problems ourselves • Need more capacity? ○ Add servers! • Need more aggregate performance? ○ Add servers! • Need more single-node performance? ○ Get creative! Ceph Short Version
  • 16. Medallia © Copyright 2016.. 16 Storage “Solved” Relocating very non-movable services Host: 10.1.2.3 172.17.1.2 application Host: 10.1.2.4 172.17.5.5 PostgreSQL Host: 10.1.3.5 172.17.5.5 PostgreSQL Replicated Ceph Cluster
  • 17. Medallia © Copyright 2016.. 17 What happens when the server for your monitor dies? • It’s “interesting” to switch Ceph monitor IPs. So don’t. ○ The monitors are services; each gets a unique IP. • If machine hosting monitor dies, start same monitor somewhere else with same IP. ○ It’ll clone data from the other monitors • Not automated (somewhat high fubar potential) Relocatable Infrastructure Relocatable Monitors
  • 18. Medallia © Copyright 2015. Confidential. 18 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 19. Medallia © Copyright 2016.. 19 • Pre-OS linux + initramfs from PXE+HTTP • Unlocks self-encrypting drives (Data-at-Rest encryption) ○ Key never known by runtime OS • Check state: ○ Update Firmware? Unify BIOS version and config? ○ Install OS? ○ Boot OS? • Completely uniform machines -- no half-installed, half-forgotten state. Remote Boot Always boot from PXE
  • 20. Medallia © Copyright 2016.. 20 Apache Aurora/Mesos Mesos Master Mesos Master Aurora Scheduler Aurora Scheduler Aurora Scheduler Mesos Master NODE-1 32 CPU 256 GB NODE-2 12 CPU 128 GB NODE-3 32 CPU 256 GB NODE-4 32 CPU 256 GB NODE-5 12 CPU 128 GB NODE-6 32 CPU 256 GB NODE-7 12 CPU 128 GB NODE-8 32 CPU 256 GB Mesos Slaves Create New Job! docker-image medallia/service1 resources 2*CPU 1*GB instances 3 Zookeeper Aurora Scheduler Aurora Scheduler Hadoop Scheduler Aurora Scheduler Aurora Scheduler Storm Scheduler “Program against your datacenter like it’s a single pool of resources”
  • 21. Medallia © Copyright 2016.. 21 docker run -it --net=routed --ip-address=1.2.3.4/32 -v demo:/demo:ceph,rw ubuntu Extended Docker
  • 22. Medallia © Copyright 2016.. 22 How Fast Is It? StorageNetwork ~ 5 us latency 38Gbit/s single-stream TCP 22Gbit/s single-stream TCP 39.5Gbit/s multi-stream TCP Relocate < 50ms ~ 550 MB/sec single-stream IO ~ 4 GB/sec multi-stream IO Reattach < 50ms Limited by SATA SSD
  • 23. Medallia © Copyright 2016.. 23 NUMA CPU #0 DRAM DRAM DRAM DRAM 40GbE NIC CPU #1 DRAM DRAM DRAM DRAM SAS/SATA
  • 24. Medallia © Copyright 2015. Confidential. 24 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 25. Medallia © Copyright 2016.. 25 DEMO!
  • 26. Medallia © Copyright 2015. Confidential. 26 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 27. Medallia © Copyright 2015. Confidential. 27 • SSDs to 100k 4k random write IOPS! ○ If you have a “IO pipeline” • Real-world: ○ Read: Databases don’t have IO depth of 64. It’s 1. ■ Read index, process, seek to correct index, read, process.. ○ Write: Databases want each and every transaction to be acknowledged by the storage layer ■ Full round-trip down to the storage layer • Dedicated DB servers have a LOT of buffer cache ○ 24x800GB SSD = $15k. 512 GB RAM = $4k. Real-World vs Synthetic IO Latency, not IOPS or bandwidth!
  • 28. Medallia © Copyright 2015. Confidential. 28 • We have two types of tables ○ “A few GB” ○ “A few TB” • Application does heavy caching; few read requests • DB Containers have plenty memory; most tables sit in buffer cache • If a user actually modifies something, there’s a transaction... What performance matters for DB? fdatasync() is bottleneck
  • 29. Medallia © Copyright 2015. Confidential. 29 Easy! Slow! Mixed read-write: ~640 iops 3 Ways to Mount FUSE KRBD Easy! Fast...er Mixed read-write: ~1550 iops No fancy image features iSCSI tgt rbd Hard! Slow! Mixed read-write: ~600 iops
  • 30. Medallia © Copyright 2015. Confidential. 30 Something that resembles PG • Can (and do) use PGbench, but pgbench workload and our real workload differ. • Observe production IO pattern, replicate with fio ○ Once something provides good results on fio, apply to real DB • Allow buffer cache ○ Yes, you have it on in production • IOdepth=1, 8 jobs, 8kb blocks • fdatasync() every 100th block • Very Large Files, semi-random access • PG doesn’t use fancy IO, so neither does our benchmark “Realistic” testing with FIO
  • 31. Medallia © Copyright 2015. Confidential. 31 3x850 Pro RAID0 Writes: 99.9%: 22us 99.99%: 57us ~700 iops/job Soft RAID0 No SuperCap MegaRAID RAID6 Writes: 99.5%: 22us 99.9%: 15ms 99.99%: 119ms ~1100 iops/job Battery-backed write- back cache Unpredictable performance KRBD Writes: 99.9%: 9us 99.99%: 11us ~1000 iops/job Survives controller failure! Local SSD comparison
  • 32. Medallia © Copyright 2015. Confidential. 32 AGENDA 1 2 3 4 5 6 Networking/Storage Mobility The Dream Provisioning/Orchestration Demo! Real-world performance Challenges and next steps
  • 33. Medallia © Copyright 2016.. 33 Fun With Locking • Switch is rebooted • Aurora detects compute node dead ○ Restarts job somewhere else • New location mounts Ext4 filesystem • Switch finishes rebooting • Old job, still running, now writes to the mounted filesystem • “How to repair a broken ext4 filesystem with a critical database” Ext4 on RBD Test all failure scenarios
  • 34. Medallia © Copyright 2016.. 34 • On Map; “rbd lock add <image> ” ■ If no success; then ● “rbd status <image”: Check for watcher, 3 times, 15s ○ If found, ABORT, ABORT! ● “ceph osd blacklist add <previous lock holder>” ● Steal lock • On unmap; rbd lock remove • On reboot; “ceph osd blacklist rm <self>” Workaround Modified RBD wrapper; /bin/sh to the rescue!
  • 35. Medallia © Copyright 2015. Confidential. 35 Great, we beat legacy hardware… Or did we? • Legacy hardware better write latency for <90% latency mark, worse for >90%. Higher average write IOPS. • We want no compromise on performance • Currently rolling out PMC NV1616 NVRAM for Ceph write journal ○ Single storage-server test very promising. ○ Large-scale test ready in 2 weeks • Experimenting with RoCE v2; RDMA over UDP • Will post results to Ceph mailing list Make it faster!
  • 36. Medallia © Copyright 2016.. 36 Try this out! Available now: • Docker w/ Storage and Networking • Aurora Coming soon: • DCIB github.com/medallia
  • 37. Medallia © Copyright 2016.. 37 2x E5-2667v3 or 2690v3 16x 16GB DDR4 RDIMM SuperMicro X10DDW-i Mellanox ConnectX-3 Pro Intel DC-S3500 100GB SSD If you want an exact replica... Compute Node Networking Dell S6000-ON Cumulus Dell AOC Cable (switches) Mellanox Copper (servers) Storage Node 1x E5-2667v3 4x 16GB DDR4 RDIMM SuperMicro X10SRW-F Mellanox ConnectX-3 Pro Intel DC-S3500 100GB SSD 8x Intel DC-S3500 800GB Flashtec NVRAM NV1616 (w/Encryption)
  • 38. Medallia © Copyright 2016.. 38 Thank you! engineering.medallia.com