SlideShare a Scribd company logo
OptimizingCephPerformancebyLeveraging
Intel®Optane™and3DNANDTLCSSDs
Kenny Chang, Storage Solution Architect, kenny.chang@intel.com
July, 2017
2
Agenda
• Ceph* configuration with Intel® Non-Volatile Memory Technologies
• 2.8M IOPS Ceph* cluster with Intel® Optane™ SSDs + Intel® 3D TLC SSDs
• Ceph* Performance analysis on Intel® Optane™ SSDs based all-flash array
• Summary
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
3Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
4
Intel®3DNANDSSDsandOPTANESSDTransformStorage
Expand the reach of Intel® SSDs. Deliver disruptive value to the data center.
11
Refer to appendix for footnotes
Capacity
for Less
Performance
for Less
Optimized STORAGE
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
5
InnovationforCloudSTORAGE: Intel® Optane™ + Intel® 3D NAND SSDs
New Storage Infrastructure: enable high
performance and cost effective storage:
Journal/Log/Cache Data
Openstack/Ceph:
‒ Intel Optane™ as Journal/Metadata/WAL (Best write
performance, Lowest latency and Best QoS)
‒ Intel 3D NAND TLC SSD as data store (cost effective storage)
‒ Best IOPS/$, IOPS/TB and TB/Rack
Ceph Node (Yesterday)
P3520
2TB
P3520
2TB
P3520
2TB
P3520
2TB
P3700
U.2 800GB
Ceph Node (Today)
Intel® Optane™ P4800X (375GB)
3D NAND
P4500
4TB
3D XPoint™
P4500
4TB
P4500
4TB
P4500
4TB
P4500
4TB
P4500
4TB
Transition to
+
P4500
4TB
P4500
4TB
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
6
Ceph* on all-flash array
Storage providers are struggling to achieve the required high performance
 There is a growing trend for cloud providers to adopt SSD
– CSP who wants to build EBS alike service for their OpenStack* based public/private cloud
Strong demands to run enterprise applications
 OLTP workloads running on Ceph, tail latency is critical
 high performance multi-purpose Ceph cluster is a key advantage
 Performance is still an important factor
SSD price continue to decrease
*Other names and brands may be claimed as the property of others. Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
7
Ceph* performance trend with SSD – 4K Random Write
38x performance improvement in Ceph all-flash array!
1.98x
3.7x
1.66x
1.23x
1.19x
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
8
Who is using Ceph?
Searchable Examples: http://www.slideshare.net/inktank_ceph
Telecom
CSP/IPDC
OEM/ODM
Enterprise, FSI,
Healthcare,
Retailers
Suggested Configurations for Ceph* Storage Node
Standard/good (baseline):
Use cases/Applications: that need high capacity storage with high
throughput performance
 NVMe*/PCIe* SSD for Journal + Caching, HDDs as OSD data
drive
Better IOPS
Use cases/Applications: that need higher performance especially for
throughput, IOPS and SLAs with medium storage capacity requirements
 NVMe/PCIe SSD as Journal, High capacity SATA SSD for data
drive
Best Performance
Use cases/Applications: that need highest performance (throughput
and IOPS) and low latency/QoS (Quality of Service).
 All NVMe/PCIe SSDs
More information at Ceph.com (new RAs update soon!)
http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_
Deployments
9
Ceph* storage node --Good
CPU Intel(R) Xeon(R) CPU E5-2650v4
Memory 64 GB
NIC Dual 10GbE
Disks 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio)
P3700 as Journal and caching
Caching software Intel(R) CAS 3.0, option: Intel(R) RSTe/MD4.3
Ceph* Storage node --Better
CPU Intel(R) Xeon(R) CPU E5-2690v4
Memory 128 GB
NIC Duel 10GbE
Disks 1x Intel(R) DC P3700(800G) + 4x Intel(R) DC S3510 1.6TB
Or 1xIntel P4800X (375GB) + 8x Intel® DC S3520 1.6TB
Ceph* Storage node --Best
CPU Intel(R) Xeon(R) CPU E5-2699v4
Memory >= 128 GB
NIC Dual 40GbE
Disks 1xIntel P4800X (375GB) + 8x Intel® DC P4500 4TB
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
10
11
Ceph* All Flash Optane configuration
8x Client Node
• Intel® Xeon™ processor E5-2699 v4 @ 2.3GHz,
64GB mem
• 1x X710 40Gb NIC
8x Storage Node
• Intel Xeon processor E5-2699 v4 @ 2.3 GHz
• 256GB Memory
• 1x 400G SSD for OS
• 1x Intel® DC P4800 375G SSD as WAL and DB
• 8x 4.0TB Intel® SSD DC P4500 as data drive
• 2 OSD instances one each P4500 SSD
• Ceph 12.0.0 with Ubuntu 16.10
2x40Gb NIC
Test Environment
FIO FIO
CLIENT 1
1x40Gb NIC
FIO FIO
CLIENT 2
FIO FIO FIO FIO
…..
FIO FIO
CLIENT 8
*Other names and brands may be claimed as the property of others.
CEPH3 … CEPH8
Workloads
• Fio with librbd
• 20x 30 GB volumes each client
• 4 test cases: 4K random read & write; 64K
Sequential read & write
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
CLIENT 3
CEPH2CEPH1
MON
OSD1 OSD16…
12
 Excellent performance on Optane cluster
– random read & write hit CPU bottleneck
Ceph* Optane Performance overview
Throughput Latency (avg.) 99.99% latency
4K Random Read 2876K IOPS 0.9 ms 2.25
4K Random Write 610K IOPS 4.0 ms 25.435
64K Sequential Read 27.5 GB/s 7.6 ms 13.744
64K Sequential Write 13.2 GB/s 11.9 ms 215
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
13
• Good Node Scalability, poor disk scalability for 4K block workloads (CPU throttled!)
• NUMA helps a lot on the Performance
• Fine tune the # of OSD per node and Drive per Node.
Ceph* Optane Performance – Tunings
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
14
Ceph* Optane Performance – Performance improvement
• The breakthrough high performance of Optane eliminated the WAL & rocksdb bottleneck
• 1 P4800X or P3700 covers up to 8x P4500 data drivers as both WAL and rocksdb
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
15
Ceph* Optane Performance – latency improvement
• Significant tail latency improvement with Optane
• 20x latency reduction for 99.99% latency
5.7 ms
14.7 ms
18.3 ms
317.8 ms
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
16
17
Ceph* Optane Performance Analysis
- CPU utilizations
• Random Read & Write performance are throttled by CPU
• Unbalanced CPU utilization caused by HT efficiency for random workloads
• Limiting # of drive scalability for small block random workloads
• Need to optimize CPU utilization
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Ceph* Optane Performance Analysis
- CPU profiling for 4K RR
• Perf record for 30 second
• Ceph-osd: 34.1%, tp_osd_tp:
65.6%
• Heavy network messenger
overhead
AsyncMsg
18Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
19
Ceph* Optane Performance Analysis
- CPU profiling for 4K RR
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Ceph* Optane Performance Analysis
- CPU profiling for 4K RR
• The top three
consumers of
tp_osd_tp thread are
• KernelDevice::read,
• OSD::ShardedOpWQ::_
process
• PrimaryLogPG::do_op.
• Perf record 30s.
20Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
P3700 Optane P3700 Optane
bluefs_bufferedIO bluefs_directIO
IOPS(K) 389 418 322 429
0
50
100
150
200
250
300
350
400
450
500
KIOPS
4K RW Throughput
Ceph Optane : Performance improvement
 4K RW performance increased by 7% with Optane with BlueFS buffer IO enabled
 The date set used in the test is too small, so most of the metadata could be cached, so P3700 was not a bottleneck.
 But in the real environment with large data set, metadata cache miss will lead to lots of db-disk read, which will
made P3700 a bottleneck. We can use bluefs_buffered_io = false to simulate the scenario.
– Optane can bring 1.33x performance improvement
21
7%
33%
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Ceph* Optane Performance Analysis
- WAL tunings and optimizations
• Key takeaways:
• rocksdb memTable overhead lead to 50000 IOPS difference. (tuning2 vs. tuning3)
• If we don’t clean bluestore WAL data in rocksdb, rocksdb overhead increase dramatically with the # of
metadata increase:
• Use an external WAL device to store WAL data, and just write WAL metadata into rocksdb!
Tunings 4K RW IOPS comments
Default: Baseline(NVMe as DB&& WAL drive) 340000 Separated DB&&WAL device
Tuning1: DB on NVMe && WAL on Ramdisk 360000 Move WAL to Ramdisk
Tuning2: Tuning1+disable rocksdb WAL 360000 RocksDB tuning
Tuning3: Tuning2+omit WAL in deferred write mode 410000 Don’t write WAL in deferred write mode
Tuning4: Tuning1+write WAL but don’t remove WAL
from rocksdb
240000 Write WAL before write metadata into rocksdb, but will not clean WAL
after write data to data device in deferred write mode
Tuning5: Tuning1+external WAL 380000 Write WAL to an external WAL device, and write its metadata to rocksdb
22
Based on 5-OSD node cluster
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
23
24
Summary & Next
Summary
• Ceph* is awesome!
• Strong demands for all-flash array Ceph* solutions
• Optane based all-flash array Ceph* cluster is capable of delivering over 2.8M IOPS
with very low latency!
• Let’s work together to make Ceph* more efficient with all-flash array!
Next
• OLTP workloads over AFA Ceph
• Client side cache on Optane
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
25
Call for action
• Participate in the open source community, and different storage projects
• Try our tools – and give us feedback
• CeTune: https://github.com/01org/CeTune
• Virtual Storage Manager: https://01.org/virtual-storage-manager
• COSBench: https://github.com/intel-cloud/cosbench
• Optimize Ceph* for efficient SDS solutions!
Intel Confidential
*Other names and brands may be claimed as the property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
26
Backup
27
[global]
pid_path = /var/run/ceph
auth_service_required = none
auth_cluster_required = none
auth_client_required = none
mon_data = /var/lib/ceph/ceph.$id
osd_pool_default_pg_num = 2048
osd_pool_default_pgp_num = 2048
osd_objectstore = bluestore
public_network = 172.16.0.0/16
cluster_network = 172.18.0.0/16
enable experimental unrecoverable data
corrupting features = *
bluestore_bluefs = true
bluestore_block_create = false
bluestore_block_db_create = false
bluestore_block_wal_create = false
mon_allow_pool_delete = true
bluestore_block_wal_separate = false
debug objectcacher = 0/0
debug paxos = 0/0
debug journal = 0/0
mutex_perf_counter = True
rbd_op_threads = 4
debug ms = 0/0
debug mds = 0/0
mon_pg_warn_max_per_osd = 10000
debug lockdep = 0/0
debug auth = 0/0
ms_crc_data = False
debug mon = 0/0
debug perfcounter = 0/0
perf = True
debug monc = 0/0
debug throttle = 0/0
debug mds_migrator = 0/0
debug mds_locker = 0/0
[mon]
mon_data = /var/lib/ceph/mon.$id
mon_max_pool_pg_num = 166496
mon_osd_max_split_count = 10000
mon_pg_warn_max_per_osd = 10000
[osd]
osd_data = /var/lib/ceph/mnt/osd-device-$id-data
osd_mkfs_type = xfs
osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
bluestore_extent_map_shard_min_size = 50
bluefs_buffered_io = true
mon_osd_full_ratio = 0.97
mon_osd_nearfull_ratio = 0.95
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge
=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=6710886
4,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction
_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=7,max_b
ytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher
_threads=8
bluestore_min_alloc_size = 65536
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 8
bluestore_extent_map_shard_max_size = 200
bluestore_extent_map_shard_target_size = 100
bluestore_csum_type = none
bluestore_max_bytes = 1073741824
bluestore_wal_max_bytes = 2147483648
bluestore_max_ops = 8192
bluestore_wal_max_ops = 8192
Ceph All Flash Tunings
debug rgw = 0/0
debug finisher = 0/0
debug osd = 0/0
debug mds_balancer = 0/0
rocksdb_collect_extended_stats = True
debug hadoop = 0/0
debug client = 0/0
debug zs = 0/0
debug mds_log = 0/0
debug context = 0/0
rocksdb_perf = True
debug bluestore = 0/0
debug bluefs = 0/0
debug objclass = 0/0
debug objecter = 0/0
debug log = 0
ms_crc_header = False
debug filer = 0/0
debug rocksdb = 0/0
rocksdb_collect_memory_stats = True
debug mds_log_expire = 0/0
debug crush = 0/0
debug optracker = 0/0
osd_pool_default_size = 2
debug tp = 0/0
cephx require signatures = False
cephx sign messages = False
debug rados = 0/0
debug journaler = 0/0
debug heartbeatmap = 0/0
debug buffer = 0/0
debug asok = 0/0
debug rbd = 0/0
rocksdb_collect_compaction_stats = False
debug filestore = 0/0
debug timer = 0/0
rbd_cache = False
throttler_perf_counter = False
Intel Confidential
Legalnotices
Copyright © 2016 Intel Corporation.
All rights reserved. Intel, the Intel logo, Xeon, Intel Inside, and 3D XPoint are trademarks of Intel Corporation in the U.S. and/or
other countries.
*Other names and brands may be claimed as the property of others.
FTC Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not
unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured
by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product
User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision
#20110804
The cost reduction scenarios described in this document are intended to enable you to get a better understanding of how
the purchase of a given Intel product, combined with a number of situation-specific variables, might affect your future cost
and savings. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software,
or configuration will affect actual performance. Consult other sources of information to evaluate performance as you
consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
28
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nandtlc ss-ds

More Related Content

PPTX
AWR and ASH Deep Dive
PDF
Elasticsearch in Netflix
PDF
Introduction into Ceph storage for OpenStack
PDF
Ceph and RocksDB
PPTX
High Availability for Oracle SE2
PDF
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
PPTX
Storage Basics
PDF
EDB Postgres DBA Best Practices
 
AWR and ASH Deep Dive
Elasticsearch in Netflix
Introduction into Ceph storage for OpenStack
Ceph and RocksDB
High Availability for Oracle SE2
Exploring Oracle Database Performance Tuning Best Practices for DBAs and Deve...
Storage Basics
EDB Postgres DBA Best Practices
 

What's hot (20)

PDF
New Generation Oracle RAC Performance
PDF
Oracle Active Data Guard: Best Practices and New Features Deep Dive
PDF
Rman Presentation
ODP
Block Storage For VMs With Ceph
PDF
Log Structured Merge Tree
PDF
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
PDF
Oracle RAC 12c Overview
PDF
PostgreSQL replication
PDF
Oracle RAC 19c: Best Practices and Secret Internals
PPTX
Nosql databases
PDF
MySQL-InnoDB
PDF
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
PDF
Smart monitoring how does oracle rac manage resource, state ukoug19
PDF
Oracle data guard for beginners
PDF
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
PDF
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
PDF
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Make Your Application “Oracle RAC Ready” & Test For It
PDF
Best Practices for Oracle Exadata and the Oracle Optimizer
New Generation Oracle RAC Performance
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Rman Presentation
Block Storage For VMs With Ceph
Log Structured Merge Tree
Oracle Real Application Clusters (RAC) 12c Rel. 2 - Operational Best Practices
Oracle RAC 12c Overview
PostgreSQL replication
Oracle RAC 19c: Best Practices and Secret Internals
Nosql databases
MySQL-InnoDB
Oracle AHF Insights 23c: Deeper Diagnostic Insights for your Oracle Database
Smart monitoring how does oracle rac manage resource, state ukoug19
Oracle data guard for beginners
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
Iceberg: A modern table format for big data (Strata NY 2018)
Make Your Application “Oracle RAC Ready” & Test For It
Best Practices for Oracle Exadata and the Oracle Optimizer
Ad

Similar to Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nandtlc ss-ds (20)

PDF
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
PPTX
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
PDF
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
PPTX
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
PDF
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
PDF
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
PDF
3.INTEL.Optane_on_ceph_v2.pdf
PDF
Ceph Day Beijing - Storage Modernization with Intel and Ceph
PDF
Ceph Day Beijing - Storage Modernization with Intel & Ceph
PDF
Intel Technologies for High Performance Computing
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
PPTX
Features of modern intel microprocessors
PPTX
Ceph Day Seoul - Ceph on All-Flash Storage
PPTX
Ceph Day Taipei - Ceph on All-Flash Storage
PPTX
Ceph Day KL - Ceph on All-Flash Storage
PDF
Yeni Nesil Sunucular ile Veritabanınız
PDF
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
PPTX
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
PPTX
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
PDF
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
3.INTEL.Optane_on_ceph_v2.pdf
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Intel Technologies for High Performance Computing
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Features of modern intel microprocessors
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
Yeni Nesil Sunucular ile Veritabanınız
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Ad

More from inwin stack (20)

PDF
Migrating to Cloud Native Solutions
PDF
Cloud Native 下的應用網路設計
PDF
當電子發票遇見 Google Cloud Function
PDF
運用高效、敏捷全新平台極速落實雲原生開發
PDF
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
PDF
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
PDF
An Open, Open source way to enable your Cloud Native Journey
PDF
維運Kubernetes的兩三事
PDF
Serverless framework on kubernetes
PDF
Train.IO 【第六期-OpenStack 二三事】
PDF
Web後端技術的演變
PDF
以 Kubernetes 部屬 Spark 大數據計算環境
PDF
Setup Hybrid Clusters Using Kubernetes Federation
PDF
基於 K8S 開發的 FaaS 專案 - riff
PPTX
使用 Prometheus 監控 Kubernetes Cluster
PDF
Extend the Kubernetes API with CRD and Custom API Server
PDF
利用K8S實現高可靠應用
PPTX
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
PPTX
Distributed tensorflow on kubernetes
PDF
Build your own kubernetes apiserver and resource type
Migrating to Cloud Native Solutions
Cloud Native 下的應用網路設計
當電子發票遇見 Google Cloud Function
運用高效、敏捷全新平台極速落實雲原生開發
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
An Open, Open source way to enable your Cloud Native Journey
維運Kubernetes的兩三事
Serverless framework on kubernetes
Train.IO 【第六期-OpenStack 二三事】
Web後端技術的演變
以 Kubernetes 部屬 Spark 大數據計算環境
Setup Hybrid Clusters Using Kubernetes Federation
基於 K8S 開發的 FaaS 專案 - riff
使用 Prometheus 監控 Kubernetes Cluster
Extend the Kubernetes API with CRD and Custom API Server
利用K8S實現高可靠應用
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Distributed tensorflow on kubernetes
Build your own kubernetes apiserver and resource type

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation theory and applications.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation theory and applications.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks

Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nandtlc ss-ds

  • 2. 2 Agenda • Ceph* configuration with Intel® Non-Volatile Memory Technologies • 2.8M IOPS Ceph* cluster with Intel® Optane™ SSDs + Intel® 3D TLC SSDs • Ceph* Performance analysis on Intel® Optane™ SSDs based all-flash array • Summary Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 3. 3Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 4. 4 Intel®3DNANDSSDsandOPTANESSDTransformStorage Expand the reach of Intel® SSDs. Deliver disruptive value to the data center. 11 Refer to appendix for footnotes Capacity for Less Performance for Less Optimized STORAGE Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 5. 5 InnovationforCloudSTORAGE: Intel® Optane™ + Intel® 3D NAND SSDs New Storage Infrastructure: enable high performance and cost effective storage: Journal/Log/Cache Data Openstack/Ceph: ‒ Intel Optane™ as Journal/Metadata/WAL (Best write performance, Lowest latency and Best QoS) ‒ Intel 3D NAND TLC SSD as data store (cost effective storage) ‒ Best IOPS/$, IOPS/TB and TB/Rack Ceph Node (Yesterday) P3520 2TB P3520 2TB P3520 2TB P3520 2TB P3700 U.2 800GB Ceph Node (Today) Intel® Optane™ P4800X (375GB) 3D NAND P4500 4TB 3D XPoint™ P4500 4TB P4500 4TB P4500 4TB P4500 4TB P4500 4TB Transition to + P4500 4TB P4500 4TB Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 6. 6 Ceph* on all-flash array Storage providers are struggling to achieve the required high performance  There is a growing trend for cloud providers to adopt SSD – CSP who wants to build EBS alike service for their OpenStack* based public/private cloud Strong demands to run enterprise applications  OLTP workloads running on Ceph, tail latency is critical  high performance multi-purpose Ceph cluster is a key advantage  Performance is still an important factor SSD price continue to decrease *Other names and brands may be claimed as the property of others. Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 7. 7 Ceph* performance trend with SSD – 4K Random Write 38x performance improvement in Ceph all-flash array! 1.98x 3.7x 1.66x 1.23x 1.19x Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 8. 8 Who is using Ceph? Searchable Examples: http://www.slideshare.net/inktank_ceph Telecom CSP/IPDC OEM/ODM Enterprise, FSI, Healthcare, Retailers
  • 9. Suggested Configurations for Ceph* Storage Node Standard/good (baseline): Use cases/Applications: that need high capacity storage with high throughput performance  NVMe*/PCIe* SSD for Journal + Caching, HDDs as OSD data drive Better IOPS Use cases/Applications: that need higher performance especially for throughput, IOPS and SLAs with medium storage capacity requirements  NVMe/PCIe SSD as Journal, High capacity SATA SSD for data drive Best Performance Use cases/Applications: that need highest performance (throughput and IOPS) and low latency/QoS (Quality of Service).  All NVMe/PCIe SSDs More information at Ceph.com (new RAs update soon!) http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_ Deployments 9 Ceph* storage node --Good CPU Intel(R) Xeon(R) CPU E5-2650v4 Memory 64 GB NIC Dual 10GbE Disks 1x 1.6TB P3700 + 12 x 4TB HDDs (1:12 ratio) P3700 as Journal and caching Caching software Intel(R) CAS 3.0, option: Intel(R) RSTe/MD4.3 Ceph* Storage node --Better CPU Intel(R) Xeon(R) CPU E5-2690v4 Memory 128 GB NIC Duel 10GbE Disks 1x Intel(R) DC P3700(800G) + 4x Intel(R) DC S3510 1.6TB Or 1xIntel P4800X (375GB) + 8x Intel® DC S3520 1.6TB Ceph* Storage node --Best CPU Intel(R) Xeon(R) CPU E5-2699v4 Memory >= 128 GB NIC Dual 40GbE Disks 1xIntel P4800X (375GB) + 8x Intel® DC P4500 4TB Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 10. 10
  • 11. 11 Ceph* All Flash Optane configuration 8x Client Node • Intel® Xeon™ processor E5-2699 v4 @ 2.3GHz, 64GB mem • 1x X710 40Gb NIC 8x Storage Node • Intel Xeon processor E5-2699 v4 @ 2.3 GHz • 256GB Memory • 1x 400G SSD for OS • 1x Intel® DC P4800 375G SSD as WAL and DB • 8x 4.0TB Intel® SSD DC P4500 as data drive • 2 OSD instances one each P4500 SSD • Ceph 12.0.0 with Ubuntu 16.10 2x40Gb NIC Test Environment FIO FIO CLIENT 1 1x40Gb NIC FIO FIO CLIENT 2 FIO FIO FIO FIO ….. FIO FIO CLIENT 8 *Other names and brands may be claimed as the property of others. CEPH3 … CEPH8 Workloads • Fio with librbd • 20x 30 GB volumes each client • 4 test cases: 4K random read & write; 64K Sequential read & write Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. CLIENT 3 CEPH2CEPH1 MON OSD1 OSD16…
  • 12. 12  Excellent performance on Optane cluster – random read & write hit CPU bottleneck Ceph* Optane Performance overview Throughput Latency (avg.) 99.99% latency 4K Random Read 2876K IOPS 0.9 ms 2.25 4K Random Write 610K IOPS 4.0 ms 25.435 64K Sequential Read 27.5 GB/s 7.6 ms 13.744 64K Sequential Write 13.2 GB/s 11.9 ms 215 Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 13. 13 • Good Node Scalability, poor disk scalability for 4K block workloads (CPU throttled!) • NUMA helps a lot on the Performance • Fine tune the # of OSD per node and Drive per Node. Ceph* Optane Performance – Tunings Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 14. 14 Ceph* Optane Performance – Performance improvement • The breakthrough high performance of Optane eliminated the WAL & rocksdb bottleneck • 1 P4800X or P3700 covers up to 8x P4500 data drivers as both WAL and rocksdb Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 15. 15 Ceph* Optane Performance – latency improvement • Significant tail latency improvement with Optane • 20x latency reduction for 99.99% latency 5.7 ms 14.7 ms 18.3 ms 317.8 ms Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 16. 16
  • 17. 17 Ceph* Optane Performance Analysis - CPU utilizations • Random Read & Write performance are throttled by CPU • Unbalanced CPU utilization caused by HT efficiency for random workloads • Limiting # of drive scalability for small block random workloads • Need to optimize CPU utilization Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 18. Ceph* Optane Performance Analysis - CPU profiling for 4K RR • Perf record for 30 second • Ceph-osd: 34.1%, tp_osd_tp: 65.6% • Heavy network messenger overhead AsyncMsg 18Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 19. 19 Ceph* Optane Performance Analysis - CPU profiling for 4K RR Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 20. Ceph* Optane Performance Analysis - CPU profiling for 4K RR • The top three consumers of tp_osd_tp thread are • KernelDevice::read, • OSD::ShardedOpWQ::_ process • PrimaryLogPG::do_op. • Perf record 30s. 20Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 21. P3700 Optane P3700 Optane bluefs_bufferedIO bluefs_directIO IOPS(K) 389 418 322 429 0 50 100 150 200 250 300 350 400 450 500 KIOPS 4K RW Throughput Ceph Optane : Performance improvement  4K RW performance increased by 7% with Optane with BlueFS buffer IO enabled  The date set used in the test is too small, so most of the metadata could be cached, so P3700 was not a bottleneck.  But in the real environment with large data set, metadata cache miss will lead to lots of db-disk read, which will made P3700 a bottleneck. We can use bluefs_buffered_io = false to simulate the scenario. – Optane can bring 1.33x performance improvement 21 7% 33% Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 22. Ceph* Optane Performance Analysis - WAL tunings and optimizations • Key takeaways: • rocksdb memTable overhead lead to 50000 IOPS difference. (tuning2 vs. tuning3) • If we don’t clean bluestore WAL data in rocksdb, rocksdb overhead increase dramatically with the # of metadata increase: • Use an external WAL device to store WAL data, and just write WAL metadata into rocksdb! Tunings 4K RW IOPS comments Default: Baseline(NVMe as DB&& WAL drive) 340000 Separated DB&&WAL device Tuning1: DB on NVMe && WAL on Ramdisk 360000 Move WAL to Ramdisk Tuning2: Tuning1+disable rocksdb WAL 360000 RocksDB tuning Tuning3: Tuning2+omit WAL in deferred write mode 410000 Don’t write WAL in deferred write mode Tuning4: Tuning1+write WAL but don’t remove WAL from rocksdb 240000 Write WAL before write metadata into rocksdb, but will not clean WAL after write data to data device in deferred write mode Tuning5: Tuning1+external WAL 380000 Write WAL to an external WAL device, and write its metadata to rocksdb 22 Based on 5-OSD node cluster Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 23. 23
  • 24. 24 Summary & Next Summary • Ceph* is awesome! • Strong demands for all-flash array Ceph* solutions • Optane based all-flash array Ceph* cluster is capable of delivering over 2.8M IOPS with very low latency! • Let’s work together to make Ceph* more efficient with all-flash array! Next • OLTP workloads over AFA Ceph • Client side cache on Optane Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 25. 25 Call for action • Participate in the open source community, and different storage projects • Try our tools – and give us feedback • CeTune: https://github.com/01org/CeTune • Virtual Storage Manager: https://01.org/virtual-storage-manager • COSBench: https://github.com/intel-cloud/cosbench • Optimize Ceph* for efficient SDS solutions! Intel Confidential *Other names and brands may be claimed as the property of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
  • 27. 27 [global] pid_path = /var/run/ceph auth_service_required = none auth_cluster_required = none auth_client_required = none mon_data = /var/lib/ceph/ceph.$id osd_pool_default_pg_num = 2048 osd_pool_default_pgp_num = 2048 osd_objectstore = bluestore public_network = 172.16.0.0/16 cluster_network = 172.18.0.0/16 enable experimental unrecoverable data corrupting features = * bluestore_bluefs = true bluestore_block_create = false bluestore_block_db_create = false bluestore_block_wal_create = false mon_allow_pool_delete = true bluestore_block_wal_separate = false debug objectcacher = 0/0 debug paxos = 0/0 debug journal = 0/0 mutex_perf_counter = True rbd_op_threads = 4 debug ms = 0/0 debug mds = 0/0 mon_pg_warn_max_per_osd = 10000 debug lockdep = 0/0 debug auth = 0/0 ms_crc_data = False debug mon = 0/0 debug perfcounter = 0/0 perf = True debug monc = 0/0 debug throttle = 0/0 debug mds_migrator = 0/0 debug mds_locker = 0/0 [mon] mon_data = /var/lib/ceph/mon.$id mon_max_pool_pg_num = 166496 mon_osd_max_split_count = 10000 mon_pg_warn_max_per_osd = 10000 [osd] osd_data = /var/lib/ceph/mnt/osd-device-$id-data osd_mkfs_type = xfs osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k bluestore_extent_map_shard_min_size = 50 bluefs_buffered_io = true mon_osd_full_ratio = 0.97 mon_osd_nearfull_ratio = 0.95 bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge =2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=6710886 4,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction _trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=7,max_b ytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher _threads=8 bluestore_min_alloc_size = 65536 osd_op_num_threads_per_shard = 2 osd_op_num_shards = 8 bluestore_extent_map_shard_max_size = 200 bluestore_extent_map_shard_target_size = 100 bluestore_csum_type = none bluestore_max_bytes = 1073741824 bluestore_wal_max_bytes = 2147483648 bluestore_max_ops = 8192 bluestore_wal_max_ops = 8192 Ceph All Flash Tunings debug rgw = 0/0 debug finisher = 0/0 debug osd = 0/0 debug mds_balancer = 0/0 rocksdb_collect_extended_stats = True debug hadoop = 0/0 debug client = 0/0 debug zs = 0/0 debug mds_log = 0/0 debug context = 0/0 rocksdb_perf = True debug bluestore = 0/0 debug bluefs = 0/0 debug objclass = 0/0 debug objecter = 0/0 debug log = 0 ms_crc_header = False debug filer = 0/0 debug rocksdb = 0/0 rocksdb_collect_memory_stats = True debug mds_log_expire = 0/0 debug crush = 0/0 debug optracker = 0/0 osd_pool_default_size = 2 debug tp = 0/0 cephx require signatures = False cephx sign messages = False debug rados = 0/0 debug journaler = 0/0 debug heartbeatmap = 0/0 debug buffer = 0/0 debug asok = 0/0 debug rbd = 0/0 rocksdb_collect_compaction_stats = False debug filestore = 0/0 debug timer = 0/0 rbd_cache = False throttler_perf_counter = False Intel Confidential
  • 28. Legalnotices Copyright © 2016 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Intel Inside, and 3D XPoint are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. FTC Optimization Notice Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 The cost reduction scenarios described in this document are intended to enable you to get a better understanding of how the purchase of a given Intel product, combined with a number of situation-specific variables, might affect your future cost and savings. Nothing in this document should be interpreted as either a promise of or contract for a given level of costs. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. 28