SlideShare a Scribd company logo
Reddy Chagam – Principal Engineer, Storage Architect
Stephen L Blinick – Senior Cloud Storage Performance Engineer
Acknowledgments: Warren Wang, Anton Thaker (WalMart)
Orlando Moreno, Vishal Verma (Intel)
Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system
manufacturer or retailer or learn more at http://intel.com.
Software and workloads used in performance tests may have been optimized for performance
only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products.
§ Configurations: Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6, CBT used for testing and data acquisition, OSD
System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe, partitioned into 4
OSD’s each, 16 OSD’s total per node, FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Single 10GbE
network for client & replication data transfer. FIO 2.2.8 with LibRBD engine. Tests run by Intel DCG Storage Group iin Intel lab. Ceph configuration and CBT YAML file
provided in backup slides.
§ For more information go to http://www.intel.com/performance.
Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and
other countries. *Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
12
DCG Storage Group 13
Agenda
• The transition to flash and the impact of NVMe
• NVMe technology with Ceph
• Cassandra & Ceph – a case for storage convergence
• The all-NVMe high-density Ceph Cluster
• Raw performance measurements and observations
• Examining performance of a Cassandra DB like workload
DCG Storage Group
Evolution of Non-Volatile Memory Storage Devices
PCIe
NVMe
10s us
>10 DW/day
<10 DW/day
100s K
10s K
PCIe NVMe
GB/s
SATA/SAS
SSDs
~100s MB/s
HDDs
~sub 100
MB/s
SATA/SAS
SSDs
100s us
HDDs
~ ms
IOPsEndurance
4K Read Latency
PCI Express® (PCIe)
NVM Express™ (NVMe)
3D XPoint™
DIMMs
3D XPoint
NVM SSDs
NVM Plays a Key Role in Delivering Performance for latency sensitive workloads
DCG Storage Group 15
Ceph Workloads
StoragePerformance
(IOPS,Throughput)
Storage Capacity
(PB)
Lower Higher
LowerHigher
Boot
Volumes
CDN
Enterprise
Dropbox
Backup,
Archive
Remote
Disks
VDI
App
Storage
BigData
Mobile
Content
Depot
Databases
Block
Object
NVM
Focus
Test &
Dev
Cloud
DVR
HPC
DCG Storage Group 16
Caching
Ceph - NVM Usages
Virtual Machine
Baremetal
RADOS
Node
Hypervisor
Guest
VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
Kernel
User
RBD DriverRBD Driver
RADOSRADOS
ApplicationApplication
RADOS
Protocol
RADOS
Protocol
RBDRBD
RADOSRADOS
RADOS Protocol RADOS Protocol
OSDOSD
JournalJournal FilestoreFilestore
NVMNVM
File SystemFile System
10GbE
Client caching w/
write through
NVM
NVMNVM
NVM
NVMNVM
Journaling
Read cache
OSD data
DCG Storage Group 17
Cassandra – What and Why?
Cassandra Ring
p1
p1
p20
p5
p3
p6
p5
p2 p4p8
p10
p7
Client
• Cassandra is column-oriented NoSQL DB with CQL
interface
 Each row has unique key which is used for partitioning
 No relations
 A row can have multiple columns – not necessarily same no. of
columns
• Open source, distributed, decentralized, highly available,
linearly scalable, multi DC, …..
• Used for analytics, real-time insights, fraud-detection,
IOT/sensor data, messaging etc.
Usecases: http://www.planetcassandra.org/apachecassandra-use-cases/
• Ceph is a popular open source unified storage platform
• Many large scale Ceph deployments in production
• End customers prefer converged infrastructure to
support multiple workloads (e.g. analytics) to achieve
CapEx, OpEx savings
• Several customers are asking for Cassandra workload on
Ceph
DCG Storage Group
IP Fabric
18
Ceph and Cassandra Integration
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Virtual Machine
Hypervisor
Guest VM
Qemu/VirtioQemu/Virtio
ApplicationApplication
RBDRBD
RADOSRADOS
CassandraCassandra
Ceph Storage Cluster
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
SSD SSD
OSDOSDOSDOSD OSDOSD
MON MON
Deployment Considerations
• Bootable Ceph volumes
(OS & Cassandra data)
• Cassandra RBD data
volumes
• Data protection
(Cassandra or Ceph)
DCG Storage Group
Ceph Storage Cluster
Hardware Environment Overview
Ceph network (192.168.142.0/24) - 10Gbps
CBT / Zabbix /
Monitoring
CBT / Zabbix /
Monitoring FIO RBD ClientFIO RBD Client
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
• OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node
• FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4
• Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6
• CBT used for testing and data acquisition
• Single 10GbE network for client & replication data transfer, Replication factor 2
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FIO RBD ClientFIO RBD Client
FatTwin (4x dual-socket XeonE5 v3)
FatTwin (4x dual-socket XeonE5 v3)
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
CephOSD1CephOSD1
NVMe1NVMe1 NVMe3NVMe3
NVMe2NVMe2 NVMe4NVMe4
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
CephOSD16CephOSD16
…
SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Intel Xeon E5 v3 18 Core CPUs
Intel P3700 NVMe PCI-e Flash
Easily serviceable NVMe Drives
DCG Storage Group
• High performance NVMe devices are capable of high parallelism at low latency
• DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128
• By using multiple OSD partitions, Ceph performance scales linearly
• Reduces lock contention within a single OSD process
• Lower latency at all queue-depths, biggest impact to random reads
• Introduces the concept of multiple OSD’s on the same physical device
• Conceptually similar crushmap data placement rules as managing disks in an enclosure
• High Resiliency of “Data Center” Class NVMe devices
• At least 10 Drive writes per day
• Power loss protection, full data path protection, device level telemetry
Multi-partitioning flash devices
NVMe1NVMe1
CephOSD1CephOSD1
CephOSD2CephOSD2
CephOSD3CephOSD3
CephOSD4CephOSD4
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group 21
Partitioning multiple OSD’s per NVMe
• Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization
0
2
4
6
8
10
12
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000
AvgLatency(ms)
IOPS
Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison
5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc,
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
0
10
20
30
40
50
60
70
80
90
%CPUUtilization
Single Node CPU Utilization Comparison - 4K Random Reads@QD32
4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket /
128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe
Single OSD
Double OSD
Quad OSD
DCG Storage Group
4K Random Read & Write Performance Summary
22
First Ceph cluster to break 1 Million 4K random IOPS
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
Workload Pattern Max IOPS
4K 100% Random Reads (2TB Dataset)
1.35Million
4K 100% Random Reads (4.8TB Dataset)
1.15Million
4K 100% Random Writes (4.8TB Dataset)
200K
4K 70%/30% Read/Write OLTP Mix
(4.8TB Dataset) 452K
DCG Storage Group
0
1
2
3
4
5
6
7
8
9
10
0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000
AvgLatency(ms)
IOPS
IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix
5 nodes, 60 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
100% 4K RandomRead 100% 4K RandomWrite 70/30% 4K Random OLTP 100% 4K RandomRead - 2TB DataSet
4K Random Read & Write Performance and Latency
23
First Ceph cluster to break 1 Million 4K random IOPS, ~1ms response time
171K 100% 4k Random
Write IOPS @ 6ms
400K 70/30% (OLTP) 4k
Random IOPS @~3ms
1M 100% 4k Random
Read IOPS @~1.1ms
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
1.35M 4k Random Read
IOPS w/ 2TB Hot Data
DCG Storage Group
Sequential performance (512KB)
24
• With 10gbE per node, both writes and reads are achieving line rate bottlenecked by the OSD node single interface.
• Higher throughputs would be possible through bonding or 40GbE connectivity.
3,214
5,888 5,631
0
1000
2000
3000
4000
5000
6000
7000
100% Write 100% Read 70/30% R/W Mix
MB/s
512k Sequential Performance Bandwidth
5 nodes, 80 OSDs, DC P3700, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group
Cassandra-like workload
25
242K IOPS at < 2ms latency
• Based on a typical customer cassanda workload profile
• 50% Reads and 50% Writes, predominantly 8K Reads and 12K Writes, FIO Queue depth = 8
78%
19%
3%
8K 5K 7K
92%
5%
12K 33K 115K 50K 80K
0
0.5
1
1.5
2
2.5
0.00
50,000.00
100,000.00
150,000.00
200,000.00
250,000.00
300,000.00
Latency(ms)
IOPS
Cassandra like workload - 50/50 Read/Write Mix
5 nodes, 80 OSDs, Xeon E5 2699v3 Dual Socket / 128GB
Ram / 10GbE
Ceph0.94.3 w/ JEMalloc
IOPS Latency
IO-Size Breakdown
Reads Writes
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
DCG Storage Group 26
Summary & Conclusions
• Flash technology including NVMe enables new performance capabilities in small
footprints
• Ceph and Cassandra provide a compelling case for feature-rich converged
storage that can support latency sensitive analytics workloads
• Using the latest standard high-volume servers and Ceph, you can now build an
open, high density, scalable, high performance cluster that can handle a low-
latency mixed workload.
• Ceph performance improvements over recent releases are significant, and today
over 1 Million random IOPS is achievable in 5U with ~1ms latency.
• Next steps:
• Address small block write performance, limited by Filestore backend
• Improve long tail latency for transactional workloads
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or
software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark
parameters.
Thank you!
28
DCG Storage Group 29
Configuration Detail – ceph.conf
Section Perf. Tuning Parameter Default Tuned
[global]
Authentication
auth_client_required cephx none
auth_cluster_required cephx none
auth_service_required cephx none
Debug logging
debug_lockdep 0/1 0/0
debug_context 0/1 0/0
debug_crush 1/1 0/0
debug_buffer 0/1 0/0
debug_timer 0/1 0/0
debug_filer 0/1 0/0
debug_objector 0/1 0/0
debug_rados 0/5 0/0
debug_rbd 0/5 0/0
debug_ms 0/5 0/0
debug_monc 0/5 0/0
debug_tp 0/5 0/0
debug_auth 1/5 0/0
debug_finisher 1/5 0/0
debug_heartbeatmap 1/5 0/0
debug_perfcounter 1/5 0/0
debug_rgw 1/5 0/0
debug_asok 1/5 0/0
debug_throttle 1/1 0/0
DCG Storage Group 30
Configuration Detail – ceph.conf (continued)Section Perf. Tuning Parameter Default Tuned
[global]
CBT specific
mon_pg_warn_max_object_skew 10 10000
mon_pg_warn_min_per_osd 0 0
mon_pg_warn_max_per_osd 32768 32768
osd_pg_bits 8 8
osd_pgp_bits 8 8
RBD cache rbd_cache true true
Other
mon_compact_on_trim true false
log_to_syslog false false
log_file /var/log/ceph/$name.log /var/log/ceph/$name.log
perf true true
mutex_perf_counter false true
throttler_perf_counter true false
[mon] CBT specific
mon_data /var/lib/ceph/mon/ceph-0 /home/bmpa/tmp_cbt/ceph/mon.$id
mon_max_pool_pg_num 65536 166496
mon_osd_max_split_count 32 10000
[osd]
Filestore parameters
filestore_wbthrottle_enable true false
filestore_queue_max_bytes 104857600 1048576000
filestore_queue_committing_max_bytes 104857600 1048576000
filestore_queue_max_ops 50 5000
filestore_queue_committing_max_ops 500 5000
filestore_max_sync_interval 5 10
filestore_fd_cache_size 128 64
filestore_fd_cache_shards 16 32
filestore_op_threads 2 6
Mount parameters
osd_mount_options_xfs rw,noatime,inode64,logbsize=256k,delaylog
osd_mkfs_options_xfs -f -i size=2048
Journal parameters
journal_max_write_entries 100 1000
journal_queue_max_ops 300 3000
journal_max_write_bytes 10485760 1048576000
journal_queue_max_bytes 33554432 1048576000
Op tracker osd_enable_op_tracker true false
OSD client
osd_client_message_size_cap 524288000 0
osd_client_message_cap 100 0
Objecter
objecter_inflight_ops 1024 102400
objecter_inflight_op_bytes 104857600 1048576000
Throttles ms_dispatch_throttle_bytes 104857600 1048576000
OSD number of threads
osd_op_threads 2 32
osd_op_num_shards 5 5
osd_op_num_threads_per_shard 2 2
DCG Storage Group 31
Configuration Detail - CBT YAML File
cluster:
user: "bmpa"
head: "ft01"
clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"]
osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"]
mons:
ft02:
a: "192.168.142.202:6789"
osds_per_node: 8
fs: xfs
mkfs_opts: '-f -i size=2048 -n size=64k'
mount_opts: '-o inode64,noatime,logbsize=256k'
conf_file: '/home/bmpa/cbt/ceph_nvme_2partition_5node_hsw.conf'
use_existing: False
rebuild_every_test: False
clusterid: "ceph"
iterations: 1
tmp_dir: "/home/bmpa/tmp_cbt"
pool_profiles:
2rep:
pg_size: 4096
pgp_size: 4096
replication: 2
DCG Storage Group 32
Configuration Detail - CBT YAML File (Continued)
benchmarks:
librbdfio:
time: 300
ramp: 600
vol_size: 81920
mode: ['randrw‘]
rwmixread: [0, 70, 100]
op_size: [4096]
procs_per_volume: [1]
volumes_per_client: [10]
use_existing_volumes: False
iodepth: [4, 8, 16, 32, 64, 96, 128]
osd_ra: [128]
norandommap: True
cmd_path: '/usr/bin/fio'
pool_profile: '2rep'
log_avg_msec: 250
DCG Storage Group 33
Storage Node Diagram
Two CPU Sockets: Socket 0 and Socket 1
 Socket 0
• 2 NVMes
• Intel X540-AT2 (10Gbps)
• 64GB: 8x 8GB 2133 DIMMs
 Socket 1
• 2 NVMes
• 64GB: 8x 8GB 2133 DIMMs
Explore additional
optimizations using
cgroups, IRQ affinity
DCG Storage Group
• Generally available server designs built for high density and high performance
• High density 1U standard high volume server
• Dual socket 3rd Generation Xeon E5 (2699v3)
• 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector
• Multiple 10Gb network ports, additional slots for 40Gb networking
• Intel DC P3700 NVMe drives are available in 2.5” drive form-factor
• Allowing easier service in a datacenter environment
High Performance Ceph Node Hardware Building
Blocks
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

More Related Content

PPTX
Cassandra an overview
PDF
Introduction to CQL and Data Modeling with Apache Cassandra
PPT
8a. How To Setup HBase with Docker
PDF
Cassandra at Instagram (August 2013)
PDF
MinIO January 2020 Briefing
PDF
Innodb에서의 Purge 메커니즘 deep internal (by 이근오)
PPTX
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
PPTX
Openstack Swift - Lots of small files
Cassandra an overview
Introduction to CQL and Data Modeling with Apache Cassandra
8a. How To Setup HBase with Docker
Cassandra at Instagram (August 2013)
MinIO January 2020 Briefing
Innodb에서의 Purge 메커니즘 deep internal (by 이근오)
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
Openstack Swift - Lots of small files

What's hot (20)

PDF
Kubernetes networking
PPTX
Ceph Introduction 2017
PDF
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
PDF
Scalar DL Technical Overview
PDF
クラウドネイティブなAWSの監視におけるモニタリング理論 - Datadog, Inc.
PDF
MySQL/MariaDB Proxy Software Test
PDF
Messaging queue - Kafka
PDF
PDF
Unbound/NSD最新情報(OSC 2013 Tokyo/Spring)
PDF
Intro to Cassandra
PDF
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
PDF
HDFS Overview
PPTX
Minio Cloud Storage
PDF
NVMe over Fabric
ODP
Couchbase training basic
PDF
Openstack 101
PPTX
Designing data intensive applications
PDF
Big data-cheat-sheet
PDF
Cassandra 101
PPTX
ELBの概要と勘所
Kubernetes networking
Ceph Introduction 2017
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
Scalar DL Technical Overview
クラウドネイティブなAWSの監視におけるモニタリング理論 - Datadog, Inc.
MySQL/MariaDB Proxy Software Test
Messaging queue - Kafka
Unbound/NSD最新情報(OSC 2013 Tokyo/Spring)
Intro to Cassandra
효과적인 NoSQL (Elasticahe / DynamoDB) 디자인 및 활용 방안 (최유정 & 최홍식, AWS 솔루션즈 아키텍트) :: ...
HDFS Overview
Minio Cloud Storage
NVMe over Fabric
Couchbase training basic
Openstack 101
Designing data intensive applications
Big data-cheat-sheet
Cassandra 101
ELBの概要と勘所
Ad

Viewers also liked (20)

PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
PDF
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
PDF
Open Stack Cheat Sheet V1
PDF
Tachyon-2014-11-21-amp-camp5
PDF
Linux Filesystems, RAID, and more
PDF
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
PDF
The Hot Rod Protocol in Infinispan
PDF
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
PDF
Scaling up genomic analysis with ADAM
PPTX
ELC-E 2010: The Right Approach to Minimal Boot Times
PDF
Velox: Models in Action
ODP
Naïveté vs. Experience
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
PDF
OpenStack Cheat Sheet V2
PDF
A Curious Course on Coroutines and Concurrency
PDF
Lab 5: Interconnecting a Datacenter using Mininet
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Best Practices for Virtualizing Apache Hadoop
PDF
Python in Action (Part 2)
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Open Stack Cheat Sheet V1
Tachyon-2014-11-21-amp-camp5
Linux Filesystems, RAID, and more
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
The Hot Rod Protocol in Infinispan
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Scaling up genomic analysis with ADAM
ELC-E 2010: The Right Approach to Minimal Boot Times
Velox: Models in Action
Naïveté vs. Experience
SparkR: Enabling Interactive Data Science at Scale
SampleClean: Bringing Data Cleaning into the BDAS Stack
OpenStack Cheat Sheet V2
A Curious Course on Coroutines and Concurrency
Lab 5: Interconnecting a Datacenter using Mininet
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Best Practices for Virtualizing Apache Hadoop
Python in Action (Part 2)
Ad

Similar to Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS (20)

PDF
Accelerating Virtual Machine Access with the Storage Performance Development ...
PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
PDF
Redefining Data Redundancywith RAID Offload
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PDF
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
PDF
Ceph Day Beijing - SPDK for Ceph
PDF
Ceph Day Beijing - SPDK in Ceph
PDF
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
PDF
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PDF
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
PDF
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
PPTX
Impact of Intel Optane Technology on HPC
PDF
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
PDF
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
PPTX
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
PPTX
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
PDF
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
PPTX
JetStor portfolio update final_2020-2021
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
Accelerating Virtual Machine Access with the Storage Performance Development ...
Ceph Day Taipei - Accelerate Ceph via SPDK
Redefining Data Redundancywith RAID Offload
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK in Ceph
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Impact of Intel Optane Technology on HPC
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
JetStor portfolio update final_2020-2021
Ceph Community Talk on High-Performance Solid Sate Ceph

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Advanced Soft Computing BINUS July 2025.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS

  • 1. Reddy Chagam – Principal Engineer, Storage Architect Stephen L Blinick – Senior Cloud Storage Performance Engineer Acknowledgments: Warren Wang, Anton Thaker (WalMart) Orlando Moreno, Vishal Verma (Intel)
  • 2. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at http://intel.com. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § Configurations: Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6, CBT used for testing and data acquisition, OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node, FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4, Single 10GbE network for client & replication data transfer. FIO 2.2.8 with LibRBD engine. Tests run by Intel DCG Storage Group iin Intel lab. Ceph configuration and CBT YAML file provided in backup slides. § For more information go to http://www.intel.com/performance. Intel, Intel Inside and the Intel logo are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation. 12
  • 3. DCG Storage Group 13 Agenda • The transition to flash and the impact of NVMe • NVMe technology with Ceph • Cassandra & Ceph – a case for storage convergence • The all-NVMe high-density Ceph Cluster • Raw performance measurements and observations • Examining performance of a Cassandra DB like workload
  • 4. DCG Storage Group Evolution of Non-Volatile Memory Storage Devices PCIe NVMe 10s us >10 DW/day <10 DW/day 100s K 10s K PCIe NVMe GB/s SATA/SAS SSDs ~100s MB/s HDDs ~sub 100 MB/s SATA/SAS SSDs 100s us HDDs ~ ms IOPsEndurance 4K Read Latency PCI Express® (PCIe) NVM Express™ (NVMe) 3D XPoint™ DIMMs 3D XPoint NVM SSDs NVM Plays a Key Role in Delivering Performance for latency sensitive workloads
  • 5. DCG Storage Group 15 Ceph Workloads StoragePerformance (IOPS,Throughput) Storage Capacity (PB) Lower Higher LowerHigher Boot Volumes CDN Enterprise Dropbox Backup, Archive Remote Disks VDI App Storage BigData Mobile Content Depot Databases Block Object NVM Focus Test & Dev Cloud DVR HPC
  • 6. DCG Storage Group 16 Caching Ceph - NVM Usages Virtual Machine Baremetal RADOS Node Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication Kernel User RBD DriverRBD Driver RADOSRADOS ApplicationApplication RADOS Protocol RADOS Protocol RBDRBD RADOSRADOS RADOS Protocol RADOS Protocol OSDOSD JournalJournal FilestoreFilestore NVMNVM File SystemFile System 10GbE Client caching w/ write through NVM NVMNVM NVM NVMNVM Journaling Read cache OSD data
  • 7. DCG Storage Group 17 Cassandra – What and Why? Cassandra Ring p1 p1 p20 p5 p3 p6 p5 p2 p4p8 p10 p7 Client • Cassandra is column-oriented NoSQL DB with CQL interface  Each row has unique key which is used for partitioning  No relations  A row can have multiple columns – not necessarily same no. of columns • Open source, distributed, decentralized, highly available, linearly scalable, multi DC, ….. • Used for analytics, real-time insights, fraud-detection, IOT/sensor data, messaging etc. Usecases: http://www.planetcassandra.org/apachecassandra-use-cases/ • Ceph is a popular open source unified storage platform • Many large scale Ceph deployments in production • End customers prefer converged infrastructure to support multiple workloads (e.g. analytics) to achieve CapEx, OpEx savings • Several customers are asking for Cassandra workload on Ceph
  • 8. DCG Storage Group IP Fabric 18 Ceph and Cassandra Integration Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Virtual Machine Hypervisor Guest VM Qemu/VirtioQemu/Virtio ApplicationApplication RBDRBD RADOSRADOS CassandraCassandra Ceph Storage Cluster SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD SSD SSD OSDOSDOSDOSD OSDOSD MON MON Deployment Considerations • Bootable Ceph volumes (OS & Cassandra data) • Cassandra RBD data volumes • Data protection (Cassandra or Ceph)
  • 9. DCG Storage Group Ceph Storage Cluster Hardware Environment Overview Ceph network (192.168.142.0/24) - 10Gbps CBT / Zabbix / Monitoring CBT / Zabbix / Monitoring FIO RBD ClientFIO RBD Client • OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node • FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6 • CBT used for testing and data acquisition • Single 10GbE network for client & replication data transfer, Replication factor 2 • OSD System Config: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Each system with 4x P3700 800GB NVMe, partitioned into 4 OSD’s each, 16 OSD’s total per node • FIO Client Systems: Intel Xeon E5-2699 v3 2x@ 2.30 GHz, 72 cores w/ HT, 96GB, Cache 46080KB, 128GB DDR4 • Ceph v0.94.3 Hammer Release, CentOS 7.1, 3.10-229 Kernel, Linked with JEMalloc 3.6 • CBT used for testing and data acquisition • Single 10GbE network for client & replication data transfer, Replication factor 2 FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FIO RBD ClientFIO RBD Client FatTwin (4x dual-socket XeonE5 v3) FatTwin (4x dual-socket XeonE5 v3) CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … CephOSD1CephOSD1 NVMe1NVMe1 NVMe3NVMe3 NVMe2NVMe2 NVMe4NVMe4 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 CephOSD16CephOSD16 … SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U SuperMicro 1028U Intel Xeon E5 v3 18 Core CPUs Intel P3700 NVMe PCI-e Flash Intel Xeon E5 v3 18 Core CPUs Intel P3700 NVMe PCI-e Flash Easily serviceable NVMe Drives
  • 10. DCG Storage Group • High performance NVMe devices are capable of high parallelism at low latency • DC P3700 800GB Raw Performance: 460K read IOPS & 90K Write IOPS at QD=128 • By using multiple OSD partitions, Ceph performance scales linearly • Reduces lock contention within a single OSD process • Lower latency at all queue-depths, biggest impact to random reads • Introduces the concept of multiple OSD’s on the same physical device • Conceptually similar crushmap data placement rules as managing disks in an enclosure • High Resiliency of “Data Center” Class NVMe devices • At least 10 Drive writes per day • Power loss protection, full data path protection, device level telemetry Multi-partitioning flash devices NVMe1NVMe1 CephOSD1CephOSD1 CephOSD2CephOSD2 CephOSD3CephOSD3 CephOSD4CephOSD4 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 11. DCG Storage Group 21 Partitioning multiple OSD’s per NVMe • Multiple OSD’s per NVMe result in higher performance, lower latency, and better CPU utilization 0 2 4 6 8 10 12 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 AvgLatency(ms) IOPS Latency vs IOPS - 4K Random Read - Multiple OSD's per Device comparison 5 nodes, 20/40/80 OSDs, Intel DC P3700 Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc, 1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. 0 10 20 30 40 50 60 70 80 90 %CPUUtilization Single Node CPU Utilization Comparison - 4K Random Reads@QD32 4/8/16 OSDs, Intel DC P3700, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc 1 OSD/NVMe 2 OSD/NVMe 4 OSD/NVMe Single OSD Double OSD Quad OSD
  • 12. DCG Storage Group 4K Random Read & Write Performance Summary 22 First Ceph cluster to break 1 Million 4K random IOPS Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. Workload Pattern Max IOPS 4K 100% Random Reads (2TB Dataset) 1.35Million 4K 100% Random Reads (4.8TB Dataset) 1.15Million 4K 100% Random Writes (4.8TB Dataset) 200K 4K 70%/30% Read/Write OLTP Mix (4.8TB Dataset) 452K
  • 13. DCG Storage Group 0 1 2 3 4 5 6 7 8 9 10 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 AvgLatency(ms) IOPS IODepth Scaling - Latency vs IOPS - Read, Write, and 70/30 4K Random Mix 5 nodes, 60 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc 100% 4K RandomRead 100% 4K RandomWrite 70/30% 4K Random OLTP 100% 4K RandomRead - 2TB DataSet 4K Random Read & Write Performance and Latency 23 First Ceph cluster to break 1 Million 4K random IOPS, ~1ms response time 171K 100% 4k Random Write IOPS @ 6ms 400K 70/30% (OLTP) 4k Random IOPS @~3ms 1M 100% 4k Random Read IOPS @~1.1ms Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters. 1.35M 4k Random Read IOPS w/ 2TB Hot Data
  • 14. DCG Storage Group Sequential performance (512KB) 24 • With 10gbE per node, both writes and reads are achieving line rate bottlenecked by the OSD node single interface. • Higher throughputs would be possible through bonding or 40GbE connectivity. 3,214 5,888 5,631 0 1000 2000 3000 4000 5000 6000 7000 100% Write 100% Read 70/30% R/W Mix MB/s 512k Sequential Performance Bandwidth 5 nodes, 80 OSDs, DC P3700, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 15. DCG Storage Group Cassandra-like workload 25 242K IOPS at < 2ms latency • Based on a typical customer cassanda workload profile • 50% Reads and 50% Writes, predominantly 8K Reads and 12K Writes, FIO Queue depth = 8 78% 19% 3% 8K 5K 7K 92% 5% 12K 33K 115K 50K 80K 0 0.5 1 1.5 2 2.5 0.00 50,000.00 100,000.00 150,000.00 200,000.00 250,000.00 300,000.00 Latency(ms) IOPS Cassandra like workload - 50/50 Read/Write Mix 5 nodes, 80 OSDs, Xeon E5 2699v3 Dual Socket / 128GB Ram / 10GbE Ceph0.94.3 w/ JEMalloc IOPS Latency IO-Size Breakdown Reads Writes Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 16. DCG Storage Group 26 Summary & Conclusions • Flash technology including NVMe enables new performance capabilities in small footprints • Ceph and Cassandra provide a compelling case for feature-rich converged storage that can support latency sensitive analytics workloads • Using the latest standard high-volume servers and Ceph, you can now build an open, high density, scalable, high performance cluster that can handle a low- latency mixed workload. • Ceph performance improvements over recent releases are significant, and today over 1 Million random IOPS is achievable in 5U with ~1ms latency. • Next steps: • Address small block write performance, limited by Filestore backend • Improve long tail latency for transactional workloads Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Any difference in system hardware or software design or configuration may affect actual performance. See configuration slides in backup for details on software configuration and test benchmark parameters.
  • 18. 28
  • 19. DCG Storage Group 29 Configuration Detail – ceph.conf Section Perf. Tuning Parameter Default Tuned [global] Authentication auth_client_required cephx none auth_cluster_required cephx none auth_service_required cephx none Debug logging debug_lockdep 0/1 0/0 debug_context 0/1 0/0 debug_crush 1/1 0/0 debug_buffer 0/1 0/0 debug_timer 0/1 0/0 debug_filer 0/1 0/0 debug_objector 0/1 0/0 debug_rados 0/5 0/0 debug_rbd 0/5 0/0 debug_ms 0/5 0/0 debug_monc 0/5 0/0 debug_tp 0/5 0/0 debug_auth 1/5 0/0 debug_finisher 1/5 0/0 debug_heartbeatmap 1/5 0/0 debug_perfcounter 1/5 0/0 debug_rgw 1/5 0/0 debug_asok 1/5 0/0 debug_throttle 1/1 0/0
  • 20. DCG Storage Group 30 Configuration Detail – ceph.conf (continued)Section Perf. Tuning Parameter Default Tuned [global] CBT specific mon_pg_warn_max_object_skew 10 10000 mon_pg_warn_min_per_osd 0 0 mon_pg_warn_max_per_osd 32768 32768 osd_pg_bits 8 8 osd_pgp_bits 8 8 RBD cache rbd_cache true true Other mon_compact_on_trim true false log_to_syslog false false log_file /var/log/ceph/$name.log /var/log/ceph/$name.log perf true true mutex_perf_counter false true throttler_perf_counter true false [mon] CBT specific mon_data /var/lib/ceph/mon/ceph-0 /home/bmpa/tmp_cbt/ceph/mon.$id mon_max_pool_pg_num 65536 166496 mon_osd_max_split_count 32 10000 [osd] Filestore parameters filestore_wbthrottle_enable true false filestore_queue_max_bytes 104857600 1048576000 filestore_queue_committing_max_bytes 104857600 1048576000 filestore_queue_max_ops 50 5000 filestore_queue_committing_max_ops 500 5000 filestore_max_sync_interval 5 10 filestore_fd_cache_size 128 64 filestore_fd_cache_shards 16 32 filestore_op_threads 2 6 Mount parameters osd_mount_options_xfs rw,noatime,inode64,logbsize=256k,delaylog osd_mkfs_options_xfs -f -i size=2048 Journal parameters journal_max_write_entries 100 1000 journal_queue_max_ops 300 3000 journal_max_write_bytes 10485760 1048576000 journal_queue_max_bytes 33554432 1048576000 Op tracker osd_enable_op_tracker true false OSD client osd_client_message_size_cap 524288000 0 osd_client_message_cap 100 0 Objecter objecter_inflight_ops 1024 102400 objecter_inflight_op_bytes 104857600 1048576000 Throttles ms_dispatch_throttle_bytes 104857600 1048576000 OSD number of threads osd_op_threads 2 32 osd_op_num_shards 5 5 osd_op_num_threads_per_shard 2 2
  • 21. DCG Storage Group 31 Configuration Detail - CBT YAML File cluster: user: "bmpa" head: "ft01" clients: ["ft01", "ft02", "ft03", "ft04", "ft05", "ft06"] osds: ["hswNode01", "hswNode02", "hswNode03", "hswNode04", "hswNode05"] mons: ft02: a: "192.168.142.202:6789" osds_per_node: 8 fs: xfs mkfs_opts: '-f -i size=2048 -n size=64k' mount_opts: '-o inode64,noatime,logbsize=256k' conf_file: '/home/bmpa/cbt/ceph_nvme_2partition_5node_hsw.conf' use_existing: False rebuild_every_test: False clusterid: "ceph" iterations: 1 tmp_dir: "/home/bmpa/tmp_cbt" pool_profiles: 2rep: pg_size: 4096 pgp_size: 4096 replication: 2
  • 22. DCG Storage Group 32 Configuration Detail - CBT YAML File (Continued) benchmarks: librbdfio: time: 300 ramp: 600 vol_size: 81920 mode: ['randrw‘] rwmixread: [0, 70, 100] op_size: [4096] procs_per_volume: [1] volumes_per_client: [10] use_existing_volumes: False iodepth: [4, 8, 16, 32, 64, 96, 128] osd_ra: [128] norandommap: True cmd_path: '/usr/bin/fio' pool_profile: '2rep' log_avg_msec: 250
  • 23. DCG Storage Group 33 Storage Node Diagram Two CPU Sockets: Socket 0 and Socket 1  Socket 0 • 2 NVMes • Intel X540-AT2 (10Gbps) • 64GB: 8x 8GB 2133 DIMMs  Socket 1 • 2 NVMes • 64GB: 8x 8GB 2133 DIMMs Explore additional optimizations using cgroups, IRQ affinity
  • 24. DCG Storage Group • Generally available server designs built for high density and high performance • High density 1U standard high volume server • Dual socket 3rd Generation Xeon E5 (2699v3) • 10 Front-removable 2.5” Formfactor Drive slots, 8639 connector • Multiple 10Gb network ports, additional slots for 40Gb networking • Intel DC P3700 NVMe drives are available in 2.5” drive form-factor • Allowing easier service in a datacenter environment High Performance Ceph Node Hardware Building Blocks