SlideShare a Scribd company logo
Designing for
High Performance Ceph at Scale
April 26, 2016
James Saint-Rossy - Principal Storage Engineer, Comcast
John Benton - Consulting Systems Engineer, WWT
Today’s Agenda
• Our Lab/Production Environment
• Holistic Architecture
• Strategies for Benchmarking
• Performance Bottlenecks/Lessons Learned
• Tuning Tips and Tricks
Designing for High Performance Ceph at Scale2
Our Typical Node Configuration
Storage Node
• 72 X 6 TB SATA 7.2K HDD’s
• 3 X 1.6TB PCIe NVME’s (Journals)
• 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)
• 256 GB of RAM
• Dual Port 40Gbe NIC
Mon/RGW Node
• 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
• 32 GB of Ram
• Dual Port 10 Gbe NIC
• ...Nothing Special
3 Designing for High Performance Ceph at Scale
Lab/Production Environment Layout
Designing for High Performance Ceph at Scale4
Holistic Architecture
Customer Requirements
-IOPS/Read Write Mix/Object Size …
-How Much Replication
-Which APIs
Cost
-HW Cost/Support Cost/Operational Cost?
Failure Domain
-Servers/Racks/Servers/Rows Etc...
Data Center Constraints
-Space/Power/Thermal
Operational Complexity
-Complex Hardware Configs
Designing for High Performance Ceph at Scale5
Holistic Architecture Cont’d
Journals
- Colocated?
- SSD vs NVME?
Designing for High Performance Ceph at Scale6
Strategies for Benchmarking
Tools
-Fio for block
-Cosbench for object
IOPS Isn’t Everything
-1000 workers may give you 30% more iops but at the
cost of 600% higher latency
Verify Published Stats With Benchmarks
-… Always
Verify Scale-Out
Designing for High Performance Ceph at Scale7
Performance - TCMalloc
• As cluster size increased, %SYS was increasingly taxed
• System profiling revealed up to 50% of CPU resources used by TCMalloc
• This library can be tuned to have more memory. This was good for nearly a
50% increase
Designing for High Performance Ceph at Scale8
Modern PC Architecture
9 Designing for High Performance Ceph at Scale
Performance - Inter-node data flow
10 Designing for High Performance Ceph at Scale
OSD Data Workflow
11
"complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0
Designing for High Performance Ceph at Scale
Performance - NUMA
• The bigger and faster the data node, the bigger the
bottleneck potential
• We tuned several areas to avoid unnecessary trips
across the QPI bus
• To map everything you must:
• Map CPU cores to sockets
• Map PCIE devices to sockets
• Map storage disks (and journals) to the associated
HBA
Designing for High Performance Ceph at Scale12
NUMA - IRQs
Pin all soft IRQs for all IO devices to it’s associated NUMA
node
13 Designing for High Performance Ceph at Scale
NUMA - Mount Points
Align mount points so that the OSD and journal are on the
same NUMA node
14 Designing for High Performance Ceph at Scale
NUMA - OSD Processes
Pin OSD processes to the NUMA node associated with the
storage it controls
15 Designing for High Performance Ceph at Scale
Performance - General Tips
• Use latest vendor drivers.
-We have seen 30% improvements from stock drivers
• OS tuning focused on increasing threads, file handles,
etc.
• Jumbo frames help, particularly on the cluster network
• Flow control issues with 40Gbe network adapters
• Scan for failing (but perhaps not completely failed) disks
Designing for High Performance Ceph at Scale16
Designing for High Performance Ceph at Scale17
"Question" by alphageek is licensed under CC BY-NC-SA 2.0
Designing for High Performance Ceph at Scale18
Performance - Mons
• Mons are generally a glorified TFTP server and you can
get away with 1+2 for redundancy
• That is, until they aren’t….....
• In certain situations like cluster rebalancing or deleting a
pool with a lot of PG’s, a single CPU on *ALL* mons will
become jammed up. They start evicting each other and
meyhem ensues.
• How to fix this:
Presentation title (optional)19

More Related Content

PPTX
ceph-barcelona-v-1.2
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
PDF
Ceph on arm64 upload
PPTX
MySQL on Ceph
PDF
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
BlueStore: a new, faster storage backend for Ceph
ceph-barcelona-v-1.2
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph on arm64 upload
MySQL on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
BlueStore: a new, faster storage backend for Ceph

What's hot (20)

PDF
Storage tiering and erasure coding in Ceph (SCaLE13x)
PDF
Ceph Performance: Projects Leading up to Jewel
PPTX
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
PDF
librados
PDF
Red Hat Ceph Storage Roadmap: January 2016
PDF
Evaluation of RBD replication options @CERN
PDF
SUSE Storage: Sizing and Performance (Ceph)
PDF
2021.02 new in Ceph Pacific Dashboard
PDF
HKG15-401: Ceph and Software Defined Storage on ARM servers
PDF
Ceph for Big Science - Dan van der Ster
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
PDF
Ceph Month 2021: RADOS Update
PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
PPTX
Ceph on 64-bit ARM with X-Gene
PDF
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
PDF
Red Hat Storage Server Administration Deep Dive
PDF
Ceph Day Taipei - Bring Ceph to Enterprise
PPTX
Hadoop over rgw
PDF
Red Hat Storage Roadmap
PPTX
Ceph Deployment at Target: Customer Spotlight
Storage tiering and erasure coding in Ceph (SCaLE13x)
Ceph Performance: Projects Leading up to Jewel
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
librados
Red Hat Ceph Storage Roadmap: January 2016
Evaluation of RBD replication options @CERN
SUSE Storage: Sizing and Performance (Ceph)
2021.02 new in Ceph Pacific Dashboard
HKG15-401: Ceph and Software Defined Storage on ARM servers
Ceph for Big Science - Dan van der Ster
Quick-and-Easy Deployment of a Ceph Storage Cluster
Ceph Month 2021: RADOS Update
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on 64-bit ARM with X-Gene
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Red Hat Storage Server Administration Deep Dive
Ceph Day Taipei - Bring Ceph to Enterprise
Hadoop over rgw
Red Hat Storage Roadmap
Ceph Deployment at Target: Customer Spotlight
Ad

Viewers also liked (14)

PPTX
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
PPTX
Which Hypervisor Is Best? My SQL on Ceph
PPTX
Ceph Performance and Sizing Guide
PDF
Mellanox High Performance Networks for Ceph
PDF
BlueStore: a new, faster storage backend for Ceph
PDF
Docker open manage_meetup_sep_2016
PPTX
Ceph - High Performance Without High Costs
PDF
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
PDF
Ceph@MIMOS: Growing Pains from R&D to Deployment
PPTX
Bluestore
PPTX
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
PPTX
ceph optimization on ssd ilsoo byun-short
PDF
Ceph Performance on OpenStack - Barcelona Summit
PPTX
Cephfs jewel mds performance benchmark
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Which Hypervisor Is Best? My SQL on Ceph
Ceph Performance and Sizing Guide
Mellanox High Performance Networks for Ceph
BlueStore: a new, faster storage backend for Ceph
Docker open manage_meetup_sep_2016
Ceph - High Performance Without High Costs
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Ceph@MIMOS: Growing Pains from R&D to Deployment
Bluestore
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
ceph optimization on ssd ilsoo byun-short
Ceph Performance on OpenStack - Barcelona Summit
Cephfs jewel mds performance benchmark
Ad

Similar to Designing for High Performance Ceph at Scale (20)

PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PDF
Shak larry-jeder-perf-and-tuning-summit14-part1-final
PDF
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
PDF
VMworld 2013: Extreme Performance Series: Monster Virtual Machines
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PDF
Building a High Performance Analytics Platform
PDF
Spark Summit EU talk by Berni Schiefer
PPT
Oracle real application_cluster
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
VMworld 2013: How SRP Delivers More Than Power to Their Customers
PDF
POWER9 for AI & HPC
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
PPTX
Recent Developments in Donard
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
PDF
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
PDF
Ambedded - how to build a true no single point of failure ceph cluster
PDF
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
PDF
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Community Talk on High-Performance Solid Sate Ceph
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Intel's Out of the Box Network Developers Ireland Meetup on March 29 2017 - ...
VMworld 2013: Extreme Performance Series: Monster Virtual Machines
QCT Ceph Solution - Design Consideration and Reference Architecture
Building a High Performance Analytics Platform
Spark Summit EU talk by Berni Schiefer
Oracle real application_cluster
OpenPOWER Acceleration of HPCC Systems
VMworld 2013: How SRP Delivers More Than Power to Their Customers
POWER9 for AI & HPC
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Recent Developments in Donard
Taking Splunk to the Next Level - Architecture Breakout Session
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Ambedded - how to build a true no single point of failure ceph cluster
OWF14 - Plenary Session : Thibaud Besson, IBM POWER Systems Specialist
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing

Designing for High Performance Ceph at Scale

  • 1. Designing for High Performance Ceph at Scale April 26, 2016 James Saint-Rossy - Principal Storage Engineer, Comcast John Benton - Consulting Systems Engineer, WWT
  • 2. Today’s Agenda • Our Lab/Production Environment • Holistic Architecture • Strategies for Benchmarking • Performance Bottlenecks/Lessons Learned • Tuning Tips and Tricks Designing for High Performance Ceph at Scale2
  • 3. Our Typical Node Configuration Storage Node • 72 X 6 TB SATA 7.2K HDD’s • 3 X 1.6TB PCIe NVME’s (Journals) • 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores) • 256 GB of RAM • Dual Port 40Gbe NIC Mon/RGW Node • 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz • 32 GB of Ram • Dual Port 10 Gbe NIC • ...Nothing Special 3 Designing for High Performance Ceph at Scale
  • 4. Lab/Production Environment Layout Designing for High Performance Ceph at Scale4
  • 5. Holistic Architecture Customer Requirements -IOPS/Read Write Mix/Object Size … -How Much Replication -Which APIs Cost -HW Cost/Support Cost/Operational Cost? Failure Domain -Servers/Racks/Servers/Rows Etc... Data Center Constraints -Space/Power/Thermal Operational Complexity -Complex Hardware Configs Designing for High Performance Ceph at Scale5
  • 6. Holistic Architecture Cont’d Journals - Colocated? - SSD vs NVME? Designing for High Performance Ceph at Scale6
  • 7. Strategies for Benchmarking Tools -Fio for block -Cosbench for object IOPS Isn’t Everything -1000 workers may give you 30% more iops but at the cost of 600% higher latency Verify Published Stats With Benchmarks -… Always Verify Scale-Out Designing for High Performance Ceph at Scale7
  • 8. Performance - TCMalloc • As cluster size increased, %SYS was increasingly taxed • System profiling revealed up to 50% of CPU resources used by TCMalloc • This library can be tuned to have more memory. This was good for nearly a 50% increase Designing for High Performance Ceph at Scale8
  • 9. Modern PC Architecture 9 Designing for High Performance Ceph at Scale
  • 10. Performance - Inter-node data flow 10 Designing for High Performance Ceph at Scale
  • 11. OSD Data Workflow 11 "complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0 Designing for High Performance Ceph at Scale
  • 12. Performance - NUMA • The bigger and faster the data node, the bigger the bottleneck potential • We tuned several areas to avoid unnecessary trips across the QPI bus • To map everything you must: • Map CPU cores to sockets • Map PCIE devices to sockets • Map storage disks (and journals) to the associated HBA Designing for High Performance Ceph at Scale12
  • 13. NUMA - IRQs Pin all soft IRQs for all IO devices to it’s associated NUMA node 13 Designing for High Performance Ceph at Scale
  • 14. NUMA - Mount Points Align mount points so that the OSD and journal are on the same NUMA node 14 Designing for High Performance Ceph at Scale
  • 15. NUMA - OSD Processes Pin OSD processes to the NUMA node associated with the storage it controls 15 Designing for High Performance Ceph at Scale
  • 16. Performance - General Tips • Use latest vendor drivers. -We have seen 30% improvements from stock drivers • OS tuning focused on increasing threads, file handles, etc. • Jumbo frames help, particularly on the cluster network • Flow control issues with 40Gbe network adapters • Scan for failing (but perhaps not completely failed) disks Designing for High Performance Ceph at Scale16
  • 17. Designing for High Performance Ceph at Scale17 "Question" by alphageek is licensed under CC BY-NC-SA 2.0
  • 18. Designing for High Performance Ceph at Scale18
  • 19. Performance - Mons • Mons are generally a glorified TFTP server and you can get away with 1+2 for redundancy • That is, until they aren’t…..... • In certain situations like cluster rebalancing or deleting a pool with a lot of PG’s, a single CPU on *ALL* mons will become jammed up. They start evicting each other and meyhem ensues. • How to fix this: Presentation title (optional)19