SlideShare a Scribd company logo
1
How to Build a
True no Single Point of Failure
Ceph Cluster
周振倫
Aaron JOUE
Founder & CEO
Agenda
• About Ambedded
• Ceph Architecture
• Why Ceph is no single point of failure
• How to build a true no single point of failure Ceph cluster
• High Availability
• Scalable
• How does Ceph Support OpenStack?
• Build a no Single Point of Failure Ceph Cluster
• Build a OpenStack A-Team Taiwan
2
About Ambedded Technology
Y2013
Y2017
Y2016
Y2014-2015
Founded in Taiwan Taipei,
Office in National Taiwan University Innovative Innovation Center
Deliver 2000+ Gen 1 microservers to partner Cynny for its Cloud
Storage Service. 9 Petabytes capacity on service till now.
Demo in ARM Global Partner Meeting UK Cambridge.
• Launch the 1st ever Ceph Storage Appliance powered by
Gen 2 ARM microserver
• Awarded as the 2016 Best of INTEROP Las Vegas Storage
product. Defeat VMware virtual SAN.
3
• Won Computex 2017 Best Choice Golden Award
• Product Mars 200 is successfully deployed to France
and Taiwan tier 1 telecom companies
4
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-
managing, intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and
fully-distributed
block device, with
a Linux kernel
client and a
QEMU/KVM driver
CephFS
A POSIX-
compliant
distributed file
system, with a
Linux kernel client
and support for
FUSE
RADOS Gateway
A bucket-based
REST gateway,
compatible with
S3 and Swift
APP APP HOST/VM CLIENT
Ceph is Unified Storage
MDS MDS
OSD OSD OSD OSD OSD OSD MON MON MON
Object Storage
Object Storage Block Storage File System
Why Ceph is NO Single Point of Failure?
• Distributed data and replications throughout disks in cluster
• CRUSH algorithm - Controlled Replication Under Scalable Hashing
• Distribute object across OSDs according to pre-defined a failure
domain
• CRUSH rule ensure data is never stored in same failure domain
• Self-healing automates data recovering while server/device fail
• No controller, no bottleneck limit the scalability
• Clients use maps and Hash calculation to write/read objects to/from
OSDs
• Geo replication & Mirroring
5
CRUSH Algorithm & Replication
6
3 2
1
1
2 3
1
3 2
1
2 3
CRUSH rule ensures replicated data are
located in different server node
Failure domain can be defined as: Node, Chassis, Rack, Data Center
Cluster map from MON
Client Server
Compute Object Location
Placement Group
Primary OSD
1
2
3
CRUSH Map of a Large Cluster
7
Root
Rack 11 Rack 12 Rack 12
Disk 1
.
.
.
.
.
.
Disk 8
Node 11 Node 12 Node 13 Node 21 Node 22 Node 23 Node 31 Node 32 Node 33
1
2
3
CRUSH Algorithm & Erasure Coding
8
Client Server
Compute Object
Location
Placement Group -> (K+M) OSD located in
different failure domain
Data Chunks
Coding Chunks
K=4
M=2
K+M = 4+2, Allows max. 2 OSDs fail.
Capacity consumed is (K+M)/K of original data
9
Self Healing
Auto-detection,
Re-generate the
missing copies of data
The re-generated
data copies will save
to existing cluster
follow the CRUSH rule
setting via UVS
Autonomic!
The Self Healing will be active when cluster detect
the data risk, no need with human hand-in.
Auto Balance vs. Auto Scale-Out
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
FULL
FULL
FULL
FULL
BALANCED
BALANCED
BALANCED
BALANCED
BALANCED
When new Mars200/201
join the cluster, the total
capacity will scale out
automatically
Autonomic scale out the
performance and
capacity load
OSD Self-Heal vs. RAID Re-build
11
Test Condition Microserver Ceph Cluster Disk Array
Disk number/capacity 16 x 10TB HDD OSD 16 x 3TB HDD
Data Protection Replica = 2 RAID 5
Data Stored in the disk 3TB Not related
Time for re-heal/re-build 5 hours, 10 min. 41 Hours
Administrator involvement Re-heal activate
automatically
Re-build after
replacing a new disk
Re-heal rate 169 MB/s
10MB/s/OSD
21 MB/s
Re-heal time vs. total
number of disks
More disk - > less recover
time
More disk -> longer
recover time
*OSD Backfilling configuration is default value
Build a no Single Point of Failure Ceph Cluster
• Hardware will always fail
• Protect data by software intelligence instead of
using hardware redundancy
• Minimize and Configurable Failure Domain
12
Issues of Using Single Server Node
with Multiple Ceph OSDs
• Large Failure Domain: One Server failure causes many
OSDs down.
• CPU utility is only 30%-40% when network is saturated. The
bottleneck is network - not computing.
• The power consumption and thermal / heat is eating
your budget
13
1x OSD with 1x Micro Server
X 8 X 8 X 8
Network
M
S
M
S
xN M
S
M
S
xN M
S
M
S
xNM
S
M
S
M
S
4x 100Gb 4x 100Gb 4x10Gb
Micro server
cluster
Micro server
cluster
Micro server
cluster
ARM micro server
cluster
- 1 to 1 to reduce
failure risk
- Aggregated
network bandwidth
without bottleneck
Traditional
Server #1
Traditional
Server #2
Traditional
Server #3
x N x N x N
Client #1 Client #2
Network
20Gb 20Gb 20Gb
Traditional server
- 1 to many causes
higher risk of a
server failure
- CPU utility is low
due to Network
bottleneck
14
Mars 200: 8-Node ARM Microserver Cluster
8x 1.6GHz ARM v7 Dual Core microserver
- 2G Bytes DRAM
- 8G Bytes Flash: System disk
- 5 Gbps LAN
- < 5 Watts power consumption
Every node can be OSD, MON, MDS, Gateways
Storage Device
- 8x SATA3 HDD/SSD OSD
- 8x SATA3 Journal SSD
OOB BMC port
Dual uplink switches
- Total 4x 10 Gbps
15
Hot Swappable
 Micro Server
 HDD/SSD
 Ethernet Switch
 Power supply
The Basic High Availability Cluster
16
Scale it out
The Benefit of Using
1 Node to 1 OSD Architecture on CEPH
• Minimize the failure domain to single OSD.
• The MTBF of a micro server is much higher than an all-in-one
motherboard ( MTBF>120K hours)
• High Availability: 15x9 (3 replication)
• High Performance: Dedicated H/W resource
– CPU, Memory, Network, SATA interface, SSD Journal disk
• High Bandwidth: Aggregated network bandwidth with failover
• 60Watts Low power consumption and cooling cost savings
• 3 x 1U chassis forms a high availability cluster
17
Ceph Storage Appliance
18
2U 8 Nodes
Front Panel Disk
Access
1U 8 Nodes
High Density
2017
RBD Performance Test
19
40 VM clients on Xeon Server as load workers
4 x 10Gbps
10G switch2x10Gbps
21 x SSD OSD + SSD journal ,
3 MON
Ceph cluster
40 x RBD
Use fio from 1x client up
to 40 clients.
Use the maximum un-
saturated bandwidth as
the aggregated
performance
Scale Out Test (SSD)
62,546
125,092
187,639
8,955
17,910
26,866
0
5,000
10,000
15,000
20,000
25,000
30,000
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
0 5 10 15 20 25
4K Read
4K Write
Number of OSDs
7 OSD
14 OSD
21 OSD
Random
Read
IOPS
Random
write
IOPS
20
Network does Matters
The purpose of this test is to measure improvement when the uplink bandwidth is increased
from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test results show 42-57% IOPS
improvement.
21
Ceph Management GUI
Demo
22
Build a OpenStack A-Team Taiwan
23
晨宇創新 數位無限
The Power of Partnership
24
Aaron Joue
aaron@ambedded.com.tw
LIBRADOS
OpenStack
Ceph and OpenStack
25
KEYSTONE SWIFT CINDER GLANCE NOVA MANILA
CEILOMETE
R
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MON
OSD
OSD
MDS
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
RADOS Gateway
librgw
RADOS CLUSTER
LIBRBD libcephfs
KVM/QEM
U
libvirt

More Related Content

PDF
Arm - ceph on arm update
PDF
SUSE - performance analysis-with_ceph
PDF
Redhat - rhcs 2017 past, present and future
PDF
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
PDF
inwinSTACK - ceph integrate with kubernetes
PDF
2016-JAN-28 -- High Performance Production Databases on Ceph
PPTX
Walk Through a Software Defined Everything PoC
PPTX
Ceph Day Melabourne - Community Update
Arm - ceph on arm update
SUSE - performance analysis-with_ceph
Redhat - rhcs 2017 past, present and future
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
inwinSTACK - ceph integrate with kubernetes
2016-JAN-28 -- High Performance Production Databases on Ceph
Walk Through a Software Defined Everything PoC
Ceph Day Melabourne - Community Update

What's hot (20)

PPTX
Ceph: Low Fail Go Scale
PDF
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
PDF
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
PPTX
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
PPTX
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
PPTX
Ceph Day KL - Ceph on All-Flash Storage
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PPTX
Ceph Day Taipei - Accelerate Ceph via SPDK
PPTX
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PPTX
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PDF
Ceph, the future of Storage - Sage Weil
ODP
Ceph Day Melbourne - Troubleshooting Ceph
PDF
Developing a Ceph Appliance for Secure Environments
PDF
Ceph Day Tokyo -- Ceph on All-Flash Storage
PDF
Ceph Day Seoul - Ceph: a decade in the making and still going strong
PPTX
Which Hypervisor is Best?
PPTX
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
PPTX
Red Hat Storage Day Boston - Supermicro Super Storage
Ceph: Low Fail Go Scale
Backup management with Ceph Storage - Camilo Echevarne, Félix Barbeira
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Ceph Day KL - Ceph on All-Flash Storage
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - How ARM Microserver Cluster Performs in Ceph
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph, the future of Storage - Sage Weil
Ceph Day Melbourne - Troubleshooting Ceph
Developing a Ceph Appliance for Secure Environments
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Seoul - Ceph: a decade in the making and still going strong
Which Hypervisor is Best?
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Storage Day Boston - Supermicro Super Storage
Ad

Similar to Ambedded - how to build a true no single point of failure ceph cluster (20)

PDF
How Ceph performs on ARM Microserver Cluster
PPTX
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PPTX
QCT Ceph Solution - Design Consideration and Reference Architecture
PPTX
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
PDF
Presentazione VMware @ VMUGIT UserCon 2015
PDF
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
PDF
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
PDF
Reference Architecture: Architecting Ceph Storage Solutions
PPTX
Databases love nutanix
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
PDF
Quick-and-Easy Deployment of a Ceph Storage Cluster
PPTX
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
PhegData X - High Performance EBS
PPTX
BigData Developers MeetUp
PPTX
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
PDF
Presentation architecting a cloud infrastructure
PDF
Presentation architecting a cloud infrastructure
How Ceph performs on ARM Microserver Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage
Presentazione VMware @ VMUGIT UserCon 2015
CEPH DAY BERLIN - 5 REASONS TO USE ARM-BASED MICRO-SERVER ARCHITECTURE FOR CE...
VMworld Europe 2014: Virtual SAN Best Practices and Use Cases
Reference Architecture: Architecting Ceph Storage Solutions
Databases love nutanix
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Quick-and-Easy Deployment of a Ceph Storage Cluster
Energy Saving ARM Server Cluster Born for Distributed Storage & Computing
HPC DAY 2017 | HPE Storage and Data Management for Big Data
Taking Splunk to the Next Level - Architecture Breakout Session
PhegData X - High Performance EBS
BigData Developers MeetUp
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
Presentation architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
Ad

More from inwin stack (20)

PDF
Migrating to Cloud Native Solutions
PDF
Cloud Native 下的應用網路設計
PDF
當電子發票遇見 Google Cloud Function
PDF
運用高效、敏捷全新平台極速落實雲原生開發
PDF
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
PDF
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
PDF
An Open, Open source way to enable your Cloud Native Journey
PDF
維運Kubernetes的兩三事
PDF
Serverless framework on kubernetes
PDF
Train.IO 【第六期-OpenStack 二三事】
PDF
Web後端技術的演變
PDF
以 Kubernetes 部屬 Spark 大數據計算環境
PDF
Setup Hybrid Clusters Using Kubernetes Federation
PDF
基於 K8S 開發的 FaaS 專案 - riff
PPTX
使用 Prometheus 監控 Kubernetes Cluster
PDF
Extend the Kubernetes API with CRD and Custom API Server
PDF
利用K8S實現高可靠應用
PPTX
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
PPTX
Distributed tensorflow on kubernetes
PDF
Build your own kubernetes apiserver and resource type
Migrating to Cloud Native Solutions
Cloud Native 下的應用網路設計
當電子發票遇見 Google Cloud Function
運用高效、敏捷全新平台極速落實雲原生開發
The last mile of digital transformation AI大眾化:數位轉型的最後一哩
整合Cloud Foundry 和 Kubernetes 技術打造企業級雲應用平台解決方案
An Open, Open source way to enable your Cloud Native Journey
維運Kubernetes的兩三事
Serverless framework on kubernetes
Train.IO 【第六期-OpenStack 二三事】
Web後端技術的演變
以 Kubernetes 部屬 Spark 大數據計算環境
Setup Hybrid Clusters Using Kubernetes Federation
基於 K8S 開發的 FaaS 專案 - riff
使用 Prometheus 監控 Kubernetes Cluster
Extend the Kubernetes API with CRD and Custom API Server
利用K8S實現高可靠應用
Integrate Kubernetes into CORD(Central Office Re-architected as a Datacenter)
Distributed tensorflow on kubernetes
Build your own kubernetes apiserver and resource type

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
Encapsulation_ Review paper, used for researhc scholars
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Ambedded - how to build a true no single point of failure ceph cluster

  • 1. 1 How to Build a True no Single Point of Failure Ceph Cluster 周振倫 Aaron JOUE Founder & CEO
  • 2. Agenda • About Ambedded • Ceph Architecture • Why Ceph is no single point of failure • How to build a true no single point of failure Ceph cluster • High Availability • Scalable • How does Ceph Support OpenStack? • Build a no Single Point of Failure Ceph Cluster • Build a OpenStack A-Team Taiwan 2
  • 3. About Ambedded Technology Y2013 Y2017 Y2016 Y2014-2015 Founded in Taiwan Taipei, Office in National Taiwan University Innovative Innovation Center Deliver 2000+ Gen 1 microservers to partner Cynny for its Cloud Storage Service. 9 Petabytes capacity on service till now. Demo in ARM Global Partner Meeting UK Cambridge. • Launch the 1st ever Ceph Storage Appliance powered by Gen 2 ARM microserver • Awarded as the 2016 Best of INTEROP Las Vegas Storage product. Defeat VMware virtual SAN. 3 • Won Computex 2017 Best Choice Golden Award • Product Mars 200 is successfully deployed to France and Taiwan tier 1 telecom companies
  • 4. 4 RADOS A reliable, autonomous, distributed object store comprised of self-healing, self- managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver CephFS A POSIX- compliant distributed file system, with a Linux kernel client and support for FUSE RADOS Gateway A bucket-based REST gateway, compatible with S3 and Swift APP APP HOST/VM CLIENT Ceph is Unified Storage MDS MDS OSD OSD OSD OSD OSD OSD MON MON MON Object Storage Object Storage Block Storage File System
  • 5. Why Ceph is NO Single Point of Failure? • Distributed data and replications throughout disks in cluster • CRUSH algorithm - Controlled Replication Under Scalable Hashing • Distribute object across OSDs according to pre-defined a failure domain • CRUSH rule ensure data is never stored in same failure domain • Self-healing automates data recovering while server/device fail • No controller, no bottleneck limit the scalability • Clients use maps and Hash calculation to write/read objects to/from OSDs • Geo replication & Mirroring 5
  • 6. CRUSH Algorithm & Replication 6 3 2 1 1 2 3 1 3 2 1 2 3 CRUSH rule ensures replicated data are located in different server node Failure domain can be defined as: Node, Chassis, Rack, Data Center Cluster map from MON Client Server Compute Object Location Placement Group Primary OSD 1 2 3
  • 7. CRUSH Map of a Large Cluster 7 Root Rack 11 Rack 12 Rack 12 Disk 1 . . . . . . Disk 8 Node 11 Node 12 Node 13 Node 21 Node 22 Node 23 Node 31 Node 32 Node 33 1 2 3
  • 8. CRUSH Algorithm & Erasure Coding 8 Client Server Compute Object Location Placement Group -> (K+M) OSD located in different failure domain Data Chunks Coding Chunks K=4 M=2 K+M = 4+2, Allows max. 2 OSDs fail. Capacity consumed is (K+M)/K of original data
  • 9. 9 Self Healing Auto-detection, Re-generate the missing copies of data The re-generated data copies will save to existing cluster follow the CRUSH rule setting via UVS Autonomic! The Self Healing will be active when cluster detect the data risk, no need with human hand-in.
  • 10. Auto Balance vs. Auto Scale-Out EMPTY EMPTY EMPTY EMPTY EMPTY FULL FULL FULL FULL BALANCED BALANCED BALANCED BALANCED BALANCED When new Mars200/201 join the cluster, the total capacity will scale out automatically Autonomic scale out the performance and capacity load
  • 11. OSD Self-Heal vs. RAID Re-build 11 Test Condition Microserver Ceph Cluster Disk Array Disk number/capacity 16 x 10TB HDD OSD 16 x 3TB HDD Data Protection Replica = 2 RAID 5 Data Stored in the disk 3TB Not related Time for re-heal/re-build 5 hours, 10 min. 41 Hours Administrator involvement Re-heal activate automatically Re-build after replacing a new disk Re-heal rate 169 MB/s 10MB/s/OSD 21 MB/s Re-heal time vs. total number of disks More disk - > less recover time More disk -> longer recover time *OSD Backfilling configuration is default value
  • 12. Build a no Single Point of Failure Ceph Cluster • Hardware will always fail • Protect data by software intelligence instead of using hardware redundancy • Minimize and Configurable Failure Domain 12
  • 13. Issues of Using Single Server Node with Multiple Ceph OSDs • Large Failure Domain: One Server failure causes many OSDs down. • CPU utility is only 30%-40% when network is saturated. The bottleneck is network - not computing. • The power consumption and thermal / heat is eating your budget 13
  • 14. 1x OSD with 1x Micro Server X 8 X 8 X 8 Network M S M S xN M S M S xN M S M S xNM S M S M S 4x 100Gb 4x 100Gb 4x10Gb Micro server cluster Micro server cluster Micro server cluster ARM micro server cluster - 1 to 1 to reduce failure risk - Aggregated network bandwidth without bottleneck Traditional Server #1 Traditional Server #2 Traditional Server #3 x N x N x N Client #1 Client #2 Network 20Gb 20Gb 20Gb Traditional server - 1 to many causes higher risk of a server failure - CPU utility is low due to Network bottleneck 14
  • 15. Mars 200: 8-Node ARM Microserver Cluster 8x 1.6GHz ARM v7 Dual Core microserver - 2G Bytes DRAM - 8G Bytes Flash: System disk - 5 Gbps LAN - < 5 Watts power consumption Every node can be OSD, MON, MDS, Gateways Storage Device - 8x SATA3 HDD/SSD OSD - 8x SATA3 Journal SSD OOB BMC port Dual uplink switches - Total 4x 10 Gbps 15 Hot Swappable  Micro Server  HDD/SSD  Ethernet Switch  Power supply
  • 16. The Basic High Availability Cluster 16 Scale it out
  • 17. The Benefit of Using 1 Node to 1 OSD Architecture on CEPH • Minimize the failure domain to single OSD. • The MTBF of a micro server is much higher than an all-in-one motherboard ( MTBF>120K hours) • High Availability: 15x9 (3 replication) • High Performance: Dedicated H/W resource – CPU, Memory, Network, SATA interface, SSD Journal disk • High Bandwidth: Aggregated network bandwidth with failover • 60Watts Low power consumption and cooling cost savings • 3 x 1U chassis forms a high availability cluster 17
  • 18. Ceph Storage Appliance 18 2U 8 Nodes Front Panel Disk Access 1U 8 Nodes High Density 2017
  • 19. RBD Performance Test 19 40 VM clients on Xeon Server as load workers 4 x 10Gbps 10G switch2x10Gbps 21 x SSD OSD + SSD journal , 3 MON Ceph cluster 40 x RBD Use fio from 1x client up to 40 clients. Use the maximum un- saturated bandwidth as the aggregated performance
  • 20. Scale Out Test (SSD) 62,546 125,092 187,639 8,955 17,910 26,866 0 5,000 10,000 15,000 20,000 25,000 30,000 0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000 180,000 200,000 0 5 10 15 20 25 4K Read 4K Write Number of OSDs 7 OSD 14 OSD 21 OSD Random Read IOPS Random write IOPS 20
  • 21. Network does Matters The purpose of this test is to measure improvement when the uplink bandwidth is increased from 20Gb to 40Gb. Mars 200 has 4x 10Gb uplinks ports. The test results show 42-57% IOPS improvement. 21
  • 23. Build a OpenStack A-Team Taiwan 23 晨宇創新 數位無限 The Power of Partnership
  • 25. LIBRADOS OpenStack Ceph and OpenStack 25 KEYSTONE SWIFT CINDER GLANCE NOVA MANILA CEILOMETE R OSD OSD MON OSD OSD MON OSD OSD MON OSD OSD MDS OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD RADOS Gateway librgw RADOS CLUSTER LIBRBD libcephfs KVM/QEM U libvirt