SlideShare a Scribd company logo
By Sergey Sverchkov
Software Architect at Altoros
sergey.sverchkov@altoros.com
Taking Cloud to Extremes: Scaled-down, Highly
Available, and Mission-critical Architecture
www.altoros.com
@altoros
2
Requirements
@altoros
Solution Requirements
● An IoT healthcare solution:
○ Connect devices and users located at customer sites
○ Thousands of devices
○ Hundreds of customers
○ Collect, process, and visualize device data
@altoros
Solution Requirements
● Available as a private regional cloud:
○ Operated by a third-party
○ Addressing specific region regulations
○ Serving clients and providing region proximity
● A “scaled-down” version for on-site deployments:
○ Cost-effective
○ Easy remote maintenance
○ Backup data to the regional cloud
Regional Cloud
Customer Facility 1
Local Cloud
Customer Facility 2
Local Cloud
@altoros
Solution Requirements
● Consider implementation restrictions:
○ Limited resources for on-site deployment
● Review and approval by government agencies:
○ Open source technologies and products
○ Unified architecture for regional and local clouds
@altoros
Solution Requirements
● High availability and scalability:
○ A hardware and infrastructure platform
○ Cloud services and applications
● Security is essential:
○ VPN connectivity
○ Non-VPN connections should be supported
○ WebSocket, TCP, and HTTP protocols
Implementation
@altoros
Infrastructure: OpenStack vs. VMware
● VMware vSphere is about virtualization:
○ ESXi is the only supported hypervisor
○ vCenter for management
● OpenStack is about cloud:
○ Storage, network, and compute services
○ Security groups and access control
○ Projects and quotas
○ Supports KVM, ESXi, and QEMU
@altoros
VMware component License cost, USD
VMware vSphere Standard, 1 CPU $995
VMware vCenter Server Standard $4,995
Server CPU Cost per node, USD
SuperMicro 5038MR-H8TRF Intel Xeon E5-2620 v2 $1,800
OpenStack Cost, USD
5 compute nodes 5 * $1,800
3 controller nodes 3 * $1,800
Total $14,400
VMware Cost, USD
5 ESXi (compute) nodes 5 * $1,800 + 5 * $995
1 vCenter appliance 1 * $4,995
Total $18,970
Infrastructure: OpenStack vs. VMware
● Cost estimation for 5 nodes
@altoros
Platform Deployment View
@altoros
OpenStack Deployment Considerations
● Availability zones:
○ Identical zones for compute and storage services
● Support for VM migration:
○ Use Ceph for volumes and ephemeral disks
○ Free the capacity of one compute node
● Increase default values in nova.conf:
○ security_groups = 100
○ security_group_rule=300
○ volumes = 500
○ cpu_overcommit = 4
@altoros
Cloud Services
● Cloud Services—HA support:
○ Cassandra
○ MariaDB Galera
○ RabbitMQ
○ ElasticSearch, Logstash, Kibana (ELK)
@altoros
● For microservices architecture
● Runtime automation
● Organizations, users, spaces, and security groups
● Health checks, load balancing, and scaling
● AWS, OpenStack, and VMware
The Application Platform: Cloud Foundry
@altoros
The Cloud Platform: HA Deployment
@altoros
Jobs
Instances,
zone 1
Instances,
zone 2
Instances,
zone 3
CPU per
instance
RAM per
instance, GB
RAM
total, GB CPU total
etcd 1 1 1 1 2 6 3
UAA + CC DB 1 1 2 2 1
Cloud Controller 1 1 1 4 8 2
Doppler 1 1 1 1 1 3 3
Traffic Controller 1 1 1 1 2 2
Runners 2 2 2 16 64 384 96
Total for CF jobs 33 447 133
Cloud Foundry Planning
@altoros
Cloud Foundry HA Deployment Issues
● CC and UAA databases?
✓ Use BOSH Resurrector
✓ Use external MariaDB Galera
● BOSH Director ?
✓ Plan BOSH VM Recovery
● Blob store ?
✓ Store blobs in OpenStack Swift
@altoros
BOSH Director Recovery
● You will need:
○ bosh-state.json
○ bosh.yml manifest
○ BOSH persistent disk
● Edit bosh-state.json only with these properties:
○ installation_id
○ current_disk_id
● Re-deploy BOSH and attach the persistent disk:
bosh-init deploy bosh.yml
Total time: around 25 min
@altoros
Blob Storage in OpenStack Swift
● Set OpenStack as the provider in the deployment manifest:
properties:
cc:
packages:
app_package_directory_key: cc-packages
fog_connection: &fog_connection
provider: 'OpenStack'
openstack_username: 'cfdeployer'
openstack_api_key: 'ddd3dd23'
openstack_auth_url: 'http://172.30.0.3:5000/v2.0/tokens'
openstack_temp_url_key: '1328d0212'
@altoros
BOSH Resurrection
● Configure resurrection for the database VM:
$ bosh vm resurrection pg_data/0 on
● Measure the approximate time for restoring a VM:
○ 60 sec: agent health-check every
○ 60 sec: to mark agent as unresponsive
○ 120 sec: time to recreate the VM on OpenStack
○ 60 sec: time to initialize
Total: around 5 min.
● When a physical VM is down:
○ Resurrector recreates all VMs in the same AZ
@altoros
Cassandra in OpenStack Ceph
@altoros
Cassandra in OpenStack Ceph: Pros and Cons
● Pros:
○ Automation—all cloud services are in OpenStack.
○ Ceph is distributed and replicated storage.
○ Low cost compared to hardware SAN.
● Cons:
○ The replication factor is 6: 2 in Ceph * 3 in Cassandra.
○ Cassandra performance is impacted by network performance.
@altoros
Testing Cassandra in OpenStack Ceph
● OpenStack configuration:
○ 1 Gb network
○ 1 CPU per node — E5-2630 v3 2.40 GHz
○ 2.0 TB SATA 6.0 Gb/s 7200RPM for Ceph
● Cassandra configuration:
○ Node: 8 vCPUs, 32 GB of RAM
○ 6 nodes in 3 AZ; 2 nodes per AZ
○ A simple strategy with a replication factor of 3
○ Cassandra stress-test tool
@altoros
Operations / sec Avg. latency, ms Latency 99%, ms Max. latency, ms
47,700 2.8 10.1 3,851.7
Operations / sec Avg. latency, ms Latency 99%, ms Max latency, ms
65,250 2.1 5.5 50.8
Operations / sec Avg. latency, ms Latency 99%, ms Max latency, ms
54,150 2.5 7.1 2,062.1
Testing Cassandra in OpenStack Ceph
● 100% writes
● 100% reads
● 50% writes, 50% reads
@altoros
Cassandra Recommendations
● Cluster and node sizing:
○ Effective data size per node: 3–5 TB
○ Tables in all keyspaces: 500–1,000
○ 30–50% of free space for the compaction process
● DataStax storage recommendations:
○ Use local SSD drives in the JBOD mode
Contributions
@altoros
Altoros’s Contributions to Cloud Foundry
● Cassandra Service Broker for CF :
https://github.com/Altoros/cf-cassandra-broker-release.git
● Improvements to the ELK BOSH release and CF integration:
○ RabbitMQ input, Cassandra output for Logstash
○ Logstash filters
https://github.com/logsearch/logsearch-boshrelease/commits?author=axelaris
https://github.com/cloudfoundry-community/logsearch-for-cloudfoundry/
@altoros
Altoros’s Contributions to Other Projects
● Cassandra Web Tool for Developers—run CQL
○ Coming soon in open source!
@altoros
@altoros
Questions?
sergey.sverchkov@altoros.com
Thank you!
For more:
altoros.com
altoros.com/research-papers
blog.altoros.com
twitter.com/altoros

More Related Content

PPTX
Cloud Foundry Deployment Tools: BOSH vs Juju Charms
PDF
Open stack china_201109_sjtu_jinyh
PDF
Web後端技術的演變
PDF
Cloud data center and openstack
PPTX
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
PDF
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
PPTX
OpenStack QA Tooling & How to use it for Production Cloud Testing | Ghanshyam...
PPTX
Challenges of Kubernetes On-premise Deployment
Cloud Foundry Deployment Tools: BOSH vs Juju Charms
Open stack china_201109_sjtu_jinyh
Web後端技術的演變
Cloud data center and openstack
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Meetup 23 - 01 - The things I wish I would have known before doing OpenStack ...
OpenStack QA Tooling & How to use it for Production Cloud Testing | Ghanshyam...
Challenges of Kubernetes On-premise Deployment

What's hot (20)

PPTX
Baylisa - Dive Into OpenStack
PDF
OpenNebulaConf 2016 - Icinga2 - APIFY them all by Achim Ledermüller, Netways ...
PDF
[OpenStack Days Korea 2016] How open HW and SW drives telco infrastucture inn...
PDF
[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift
PPTX
Containers and CloudStack
PDF
Disaggregating Ceph using NVMeoF
PPTX
Hostvn ceph in production v1.1 dungtq
PDF
Kubernetes Day 2017 - Build, Ship and Run Your APP, Production !!
PDF
OpenStack Watcher
PDF
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
PDF
Running OpenShift Clusters in a Cloudstack Environment
PDF
John Spray - Ceph in Kubernetes
PDF
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
PPTX
OpenStack Neutron behind the Scenes
PDF
[OpenStack Days Korea 2016] Track2 - 가상화 네트워크와 클라우드간 협업
PDF
[OpenStack Days Korea 2016] Track1 - Red Hat enterprise Linux OpenStack Platform
PDF
Managing Ceph operational complexity with Juju
PDF
Monitor PowerKVM using Ganglia, Nagios
PDF
Deploying openstack using ansible
PDF
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Baylisa - Dive Into OpenStack
OpenNebulaConf 2016 - Icinga2 - APIFY them all by Achim Ledermüller, Netways ...
[OpenStack Days Korea 2016] How open HW and SW drives telco infrastucture inn...
[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift
Containers and CloudStack
Disaggregating Ceph using NVMeoF
Hostvn ceph in production v1.1 dungtq
Kubernetes Day 2017 - Build, Ship and Run Your APP, Production !!
OpenStack Watcher
Sven Vogel: Running CloudStack and OpenShift with NetApp on KVM
Running OpenShift Clusters in a Cloudstack Environment
John Spray - Ceph in Kubernetes
StarlingX - A Platform for the Distributed Edge | Ildiko Vancsa
OpenStack Neutron behind the Scenes
[OpenStack Days Korea 2016] Track2 - 가상화 네트워크와 클라우드간 협업
[OpenStack Days Korea 2016] Track1 - Red Hat enterprise Linux OpenStack Platform
Managing Ceph operational complexity with Juju
Monitor PowerKVM using Ganglia, Nagios
Deploying openstack using ansible
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Ad

Viewers also liked (18)

PPTX
How to scale up, out or down in Windows Azure - Webinar
ODP
Constructors, Intro to Ruby Classes Part II
PDF
BUSINESS MODEL CANVAS
PDF
Cloud Foundry Anniversary: Technical Slides
PPTX
Cloud Foundry Open Tour - London
PPTX
Cloud foundry history
PDF
Speeding up Development with Cloud Foundry
PPTX
Quantifying the Benefits of Cloud Foundry
PDF
Continuous Delivery for Microservice Architectures with Concourse & Cloud Fou...
PDF
Securing Cassandra
PPTX
Home Depot - From Platform Ops to Dev Enablement
PPTX
Always On: Building Highly Available Applications on Cassandra
PPTX
Cloud Foundry Vancouver Meetup July 2016
PDF
Pivotal Cloud Foundry: Building a diverse geo-architecture for Cloud Native A...
PDF
Cloud Foundry and Microservices: A Mutualistic Symbiotic Relationship
PDF
Introduction to Platform-as-a-Service and Cloud Foundry
PDF
Practical PaaS presentation
PDF
Cloudfoundry architecture
How to scale up, out or down in Windows Azure - Webinar
Constructors, Intro to Ruby Classes Part II
BUSINESS MODEL CANVAS
Cloud Foundry Anniversary: Technical Slides
Cloud Foundry Open Tour - London
Cloud foundry history
Speeding up Development with Cloud Foundry
Quantifying the Benefits of Cloud Foundry
Continuous Delivery for Microservice Architectures with Concourse & Cloud Fou...
Securing Cassandra
Home Depot - From Platform Ops to Dev Enablement
Always On: Building Highly Available Applications on Cassandra
Cloud Foundry Vancouver Meetup July 2016
Pivotal Cloud Foundry: Building a diverse geo-architecture for Cloud Native A...
Cloud Foundry and Microservices: A Mutualistic Symbiotic Relationship
Introduction to Platform-as-a-Service and Cloud Foundry
Practical PaaS presentation
Cloudfoundry architecture
Ad

Similar to Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical Architecture (Cloud Foundry Summit 2016) (20)

PDF
Cloud Foundry and OpenStack: How They Fit - Cloud Expo 2014
PPTX
Optimizing Cloud Foundry and OpenStack for large scale deployments
PDF
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
PPTX
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
PPTX
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
PDF
Introduction into Cloud Foundry and Bosh | anynines
PDF
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
PPTX
Cloud Foundry and OpenStack – Marriage Made in Heaven !
PPTX
Cloud Foundry: Hands-on Deployment Workshop
PDF
Sanger OpenStack presentation March 2017
PDF
Running OpenStack in Production - Barcamp Saigon 2016
PDF
Cloud Foundry BOSH CPI for OpenStack
PPTX
Automated Lifecycle Management - CloudFoundry on OpenStack
PPTX
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
PPTX
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
PPTX
Cloud Foundry: Infrastructure Options
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
PDF
Building a PaaS Platform like Bluemix on OpenStack
PPTX
Ceph Deployment at Target: Customer Spotlight
PPTX
Ceph Deployment at Target: Customer Spotlight
Cloud Foundry and OpenStack: How They Fit - Cloud Expo 2014
Optimizing Cloud Foundry and OpenStack for large scale deployments
OpenStack + Cloud Foundry for the OpenStack Boston Meetup
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Introduction into Cloud Foundry and Bosh | anynines
As a Service: Cloud Foundry on OpenStack - Lessons Learnt
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry: Hands-on Deployment Workshop
Sanger OpenStack presentation March 2017
Running OpenStack in Production - Barcamp Saigon 2016
Cloud Foundry BOSH CPI for OpenStack
Automated Lifecycle Management - CloudFoundry on OpenStack
«Практика построения высокодоступного решения на базе Cloud Foundry Paas»
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Cloud Foundry: Infrastructure Options
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Building a PaaS Platform like Bluemix on OpenStack
Ceph Deployment at Target: Customer Spotlight
Ceph Deployment at Target: Customer Spotlight

More from Altoros (20)

PDF
Maturing with Kubernetes
PDF
Kubernetes Platform Readiness and Maturity Assessment
PDF
Journey Through Four Stages of Kubernetes Deployment Maturity
PPTX
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
PPTX
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
PPTX
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
PPTX
Crap. Your Big Data Kitchen Is Broken.
PDF
Containers and Kubernetes
PPTX
Distributed Ledger Technology for Over-the-Counter Trading
PPTX
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
PPTX
Deploying Kubernetes on GCP with Kubespray
PPTX
UAA for Kubernetes
PPTX
Troubleshooting .NET Applications on Cloud Foundry
PPTX
Continuous Integration and Deployment with Jenkins for PCF
PPTX
How to Never Leave Your Deployment Unattended
PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
PDF
Smart Baggage Tracking: End-to-End Sensor-Based Solution
PPTX
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
PPTX
AI as a Catalyst for IoT
PPTX
Over-Engineering: Causes, Symptoms, and Treatment
Maturing with Kubernetes
Kubernetes Platform Readiness and Maturity Assessment
Journey Through Four Stages of Kubernetes Deployment Maturity
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
Crap. Your Big Data Kitchen Is Broken.
Containers and Kubernetes
Distributed Ledger Technology for Over-the-Counter Trading
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
Deploying Kubernetes on GCP with Kubespray
UAA for Kubernetes
Troubleshooting .NET Applications on Cloud Foundry
Continuous Integration and Deployment with Jenkins for PCF
How to Never Leave Your Deployment Unattended
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
AI as a Catalyst for IoT
Over-Engineering: Causes, Symptoms, and Treatment

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
KodekX | Application Modernization Development
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
KodekX | Application Modernization Development
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical Architecture (Cloud Foundry Summit 2016)

  • 1. By Sergey Sverchkov Software Architect at Altoros sergey.sverchkov@altoros.com Taking Cloud to Extremes: Scaled-down, Highly Available, and Mission-critical Architecture www.altoros.com @altoros
  • 3. @altoros Solution Requirements ● An IoT healthcare solution: ○ Connect devices and users located at customer sites ○ Thousands of devices ○ Hundreds of customers ○ Collect, process, and visualize device data
  • 4. @altoros Solution Requirements ● Available as a private regional cloud: ○ Operated by a third-party ○ Addressing specific region regulations ○ Serving clients and providing region proximity ● A “scaled-down” version for on-site deployments: ○ Cost-effective ○ Easy remote maintenance ○ Backup data to the regional cloud Regional Cloud Customer Facility 1 Local Cloud Customer Facility 2 Local Cloud
  • 5. @altoros Solution Requirements ● Consider implementation restrictions: ○ Limited resources for on-site deployment ● Review and approval by government agencies: ○ Open source technologies and products ○ Unified architecture for regional and local clouds
  • 6. @altoros Solution Requirements ● High availability and scalability: ○ A hardware and infrastructure platform ○ Cloud services and applications ● Security is essential: ○ VPN connectivity ○ Non-VPN connections should be supported ○ WebSocket, TCP, and HTTP protocols
  • 8. @altoros Infrastructure: OpenStack vs. VMware ● VMware vSphere is about virtualization: ○ ESXi is the only supported hypervisor ○ vCenter for management ● OpenStack is about cloud: ○ Storage, network, and compute services ○ Security groups and access control ○ Projects and quotas ○ Supports KVM, ESXi, and QEMU
  • 9. @altoros VMware component License cost, USD VMware vSphere Standard, 1 CPU $995 VMware vCenter Server Standard $4,995 Server CPU Cost per node, USD SuperMicro 5038MR-H8TRF Intel Xeon E5-2620 v2 $1,800 OpenStack Cost, USD 5 compute nodes 5 * $1,800 3 controller nodes 3 * $1,800 Total $14,400 VMware Cost, USD 5 ESXi (compute) nodes 5 * $1,800 + 5 * $995 1 vCenter appliance 1 * $4,995 Total $18,970 Infrastructure: OpenStack vs. VMware ● Cost estimation for 5 nodes
  • 11. @altoros OpenStack Deployment Considerations ● Availability zones: ○ Identical zones for compute and storage services ● Support for VM migration: ○ Use Ceph for volumes and ephemeral disks ○ Free the capacity of one compute node ● Increase default values in nova.conf: ○ security_groups = 100 ○ security_group_rule=300 ○ volumes = 500 ○ cpu_overcommit = 4
  • 12. @altoros Cloud Services ● Cloud Services—HA support: ○ Cassandra ○ MariaDB Galera ○ RabbitMQ ○ ElasticSearch, Logstash, Kibana (ELK)
  • 13. @altoros ● For microservices architecture ● Runtime automation ● Organizations, users, spaces, and security groups ● Health checks, load balancing, and scaling ● AWS, OpenStack, and VMware The Application Platform: Cloud Foundry
  • 15. @altoros Jobs Instances, zone 1 Instances, zone 2 Instances, zone 3 CPU per instance RAM per instance, GB RAM total, GB CPU total etcd 1 1 1 1 2 6 3 UAA + CC DB 1 1 2 2 1 Cloud Controller 1 1 1 4 8 2 Doppler 1 1 1 1 1 3 3 Traffic Controller 1 1 1 1 2 2 Runners 2 2 2 16 64 384 96 Total for CF jobs 33 447 133 Cloud Foundry Planning
  • 16. @altoros Cloud Foundry HA Deployment Issues ● CC and UAA databases? ✓ Use BOSH Resurrector ✓ Use external MariaDB Galera ● BOSH Director ? ✓ Plan BOSH VM Recovery ● Blob store ? ✓ Store blobs in OpenStack Swift
  • 17. @altoros BOSH Director Recovery ● You will need: ○ bosh-state.json ○ bosh.yml manifest ○ BOSH persistent disk ● Edit bosh-state.json only with these properties: ○ installation_id ○ current_disk_id ● Re-deploy BOSH and attach the persistent disk: bosh-init deploy bosh.yml Total time: around 25 min
  • 18. @altoros Blob Storage in OpenStack Swift ● Set OpenStack as the provider in the deployment manifest: properties: cc: packages: app_package_directory_key: cc-packages fog_connection: &fog_connection provider: 'OpenStack' openstack_username: 'cfdeployer' openstack_api_key: 'ddd3dd23' openstack_auth_url: 'http://172.30.0.3:5000/v2.0/tokens' openstack_temp_url_key: '1328d0212'
  • 19. @altoros BOSH Resurrection ● Configure resurrection for the database VM: $ bosh vm resurrection pg_data/0 on ● Measure the approximate time for restoring a VM: ○ 60 sec: agent health-check every ○ 60 sec: to mark agent as unresponsive ○ 120 sec: time to recreate the VM on OpenStack ○ 60 sec: time to initialize Total: around 5 min. ● When a physical VM is down: ○ Resurrector recreates all VMs in the same AZ
  • 21. @altoros Cassandra in OpenStack Ceph: Pros and Cons ● Pros: ○ Automation—all cloud services are in OpenStack. ○ Ceph is distributed and replicated storage. ○ Low cost compared to hardware SAN. ● Cons: ○ The replication factor is 6: 2 in Ceph * 3 in Cassandra. ○ Cassandra performance is impacted by network performance.
  • 22. @altoros Testing Cassandra in OpenStack Ceph ● OpenStack configuration: ○ 1 Gb network ○ 1 CPU per node — E5-2630 v3 2.40 GHz ○ 2.0 TB SATA 6.0 Gb/s 7200RPM for Ceph ● Cassandra configuration: ○ Node: 8 vCPUs, 32 GB of RAM ○ 6 nodes in 3 AZ; 2 nodes per AZ ○ A simple strategy with a replication factor of 3 ○ Cassandra stress-test tool
  • 23. @altoros Operations / sec Avg. latency, ms Latency 99%, ms Max. latency, ms 47,700 2.8 10.1 3,851.7 Operations / sec Avg. latency, ms Latency 99%, ms Max latency, ms 65,250 2.1 5.5 50.8 Operations / sec Avg. latency, ms Latency 99%, ms Max latency, ms 54,150 2.5 7.1 2,062.1 Testing Cassandra in OpenStack Ceph ● 100% writes ● 100% reads ● 50% writes, 50% reads
  • 24. @altoros Cassandra Recommendations ● Cluster and node sizing: ○ Effective data size per node: 3–5 TB ○ Tables in all keyspaces: 500–1,000 ○ 30–50% of free space for the compaction process ● DataStax storage recommendations: ○ Use local SSD drives in the JBOD mode
  • 26. @altoros Altoros’s Contributions to Cloud Foundry ● Cassandra Service Broker for CF : https://github.com/Altoros/cf-cassandra-broker-release.git ● Improvements to the ELK BOSH release and CF integration: ○ RabbitMQ input, Cassandra output for Logstash ○ Logstash filters https://github.com/logsearch/logsearch-boshrelease/commits?author=axelaris https://github.com/cloudfoundry-community/logsearch-for-cloudfoundry/
  • 27. @altoros Altoros’s Contributions to Other Projects ● Cassandra Web Tool for Developers—run CQL ○ Coming soon in open source!

Editor's Notes

  • #2: Hello, colleagues. My name is Sergey and I’m glad to see you at this session. I work as a project manager and software architect at Altoros. Today, I’m going to share with you the experience our team gained when working on an ongoing project related to healthcare. The project is about building a highly available solution for customers who operate with various medical devices.
  • #3: Let’s take a look at some of the requirements.
  • #4: First of all, what are the business requirements to the solution? We call this system the “Internet of Things for healthcare”. And the main idea is to create a Software-as-a-Service solution available to clients who can connect medical devices and users to this service in a secure way. The service allows to collect data from devices, and also store and visualise device data. Users will have various dashboards to view data in near real-time, and they will also be able to locate and manage devices. Some of the customers are large organizations that operate many facilities and thousands of devices. It is expected that this IoT solution will dramatically simplify and make devices and users connectivity transparent and unified. The cloud solution should reduce time to deliver, upgrade, and support healthcare applications for clients.
  • #5: The new solution should serve customers in different geographical regions, providing region proximity. It also allows to address specific regulations for each region. You know, rules for healthcare industry are different in North and South America, Europe and Asia. Besides the regional cloud, there is a plan to create a scaled-down, small version of the cloud for on-site deployments. So that some of the biggest customers who are sensitive to data locality can install the solution and keep all the data inside their datacenter. This scaled-down version needs to be cost-effective and support remote maintenance in the same way as it is planned for the regional cloud. As an additional feature, the data stored in the local cloud can be backed up to a regional cloud.
  • #6: And if we talk about 2 versions of the cloud - the regional cloud and the cloud for on-site deployment, we understand that their architecture must be very similar or identical. First of all, when you have this type of implementation for different scales of deployment - it reduces the time to deliver the solution to the market. Also you need to consider a whole range of implementation restrictions. For example cloud for on-site deployment has limited resources and cost. It is clear that when you deliver a healthcare solution it must be reviewed and approved by government agencies. So the platform must be based on open source products that can be tested and examined for vulnerabilities. Open components also make it possible to easily extend the functionality of the platform and the products that are used in the solution. Also this makes it easier to review all the components and get all the necessary approvals.
  • #7: Another set of requirements is related to availability and security of the solution. High availability is extremely important in healthcare. In our case, it means that all apps and services, as well as hardware and the infrastructure platform must be available all the time. The platform deals with very sensitive data, so security is essential. In most cases customers are connected to the cloud through secure VPN tunnels. But, for small customers, the cloud needs to provide connectivity without a firewall. As for the communication with cloud, devices operate using internet protocols, and support for TCP devices is planned to be added in near future.
  • #8: Ok, now let’s take a look at how the platform is implemented. I won’t go into all the technical details. Instead, I will focus on the infrastructure platform, and the cloud services that we’ve selected, as well as some of the high availability and scalability aspects. Also I’ll share what parts of this project we contributed to the community.
  • #9: So, speaking about infrastructure, we were choosing between VMWare vSphere and OpenStack, because we have to build a private cloud. We chose OpenStack, because VMware vSphere is about virtualization and management of virtual resources. All VMware products are licensed and proprietary. In contrast to VMWare, OpenStack is open source and includes components that allow to build storage, network, and compute services in the cloud. OpenStack supports multi-tenancy for cloud resorces called projects, it has fine-grained security and access controls. Also it supports several hypervisors AND it can be integrated with the ESXi hypervisor too.
  • #10: Let’s take a look at this rough cost estimation for an infrastructure platform running VMware and OpenStack. As an example, we are calculating the effective cost for 5 nodes. We’re using an blades chassis with five compute nodes for virtual machines and storage. The cost is estimated for SuperMicro chassis with 6-cores Intel Xeon CPUs. If we use VMware vSphere, we have to buy licenses for 5 ESXi hypervisors and vCenter management and the initial cost will be around 19K USD. In OpenStack, we need use 3 additional nodes for OpenStack management services. As you can see, even though OpenStack uses three additional machines, the total cost is less than with VMWare. This is an example that you may use for costs estimation when making the selection for private infrastructure platform.
  • #11: On next slide, you can see a high-level deployment view of our OpenStack cloud. It is protected by a firewall that supports VPN tunnels and non-VPN HTTPS connections. At the hardware level, we are using a blades chassis to build a highly available OpenStack. At least three nodes are used for OpenStack management components. And compute services are distributed across three availability zones. This creates redundancy for virtual machines launched by OpenStack. One of the reason why we create 3 zones is because some of components and cloud services require 3 or more virtual nodes for availability. The OpenStack storage may be distributed across compute nodes OR we can setup separate storage nodes. Additional management services, like DNS, NTP, and OpenStack deployment tools, run on additional chassis nodes. This approach for deployment means, we can scale OpenStack’s computing and storage capacity simply by adding new blades or nodes.
  • #12: What are some of the important OpenStack deployment considerations. First, it’s required to create identical availability zones for compute and storage services. Second, to enable support for live VM migration we need to configure OpenStack Ceph for persistent volumes and ephemeral disks. And additionally there should be spare capacity of around 1 physical node in every availability zone. Third, we recommend to increase default limits for number of security groups, security rules and volumes. And also it is important is to evaluate CPU overcommit ratio. Recommended value is from 1.5 to 2, but we were able to test OpenStack and reach CPU overcommit ratio value of 4.
  • #13:   Besides the OpenStack platform, we are using a number of other services in cloud platform. Cassandra is a scalable, redundant, and master-less data store. This is where we keep all the device data. MariaDB Galera is our relational database cluster for structured data with low velocity. RabbitMQ provides queueing and messaging for different applications. And ElasticSearch, LogStash and KIbana serves for application logs aggregation and indexing.
  • #14: What about running applications? The solution we are building is based on microservice architecture, so we need an application platform that will manage them effectively. When it comes to microservices, we think Cloud Foundry is by far the best option. It automates up to 90% of all routine work related to application lifecycle management. It is a complete platform that supports traditional application runtime automation and also Docker containers. And the most important advantage, at least for our customer, was that, with Cloud Foundry, new features and apps can be released a lot faster.
  • #15: So what does it take to distribute the components of the Cloud Foundry platform and cloud services inside the OpenStack deployment. As I have already said, there are three availability zones that are actually three groups of physical nodes in chassis. If we distribute our service instances across the availability zones, we can ensure redundancy on the service level. For example, the MariaDB cluster requires at least three nodes, and for redundancy ,we place one node in every availability zone. The same approach is applied for RabbitMQ and Cassandra. As for Cloud Foundry, we need to place the components that support HA in at least two zones. We can expect that most of the platform resources in Cloud Foundry are allocated to application runners (the DEA and Diego cells). The Runners are deployed to three availability zones, so that the application workload can be distributed evenly to all hardware nodes. Management services can be replicated, as well. The approach for replication depends of the specific service. Some of services, like DNS and NTP, are assumed to be mission critical services, so they have 2 instances on 2 physical nodes. Some of the services may use more relaxed HA requirements. Ok let’s see a more detailed planning of resources for a CF deployment.
  • #16: On this slide, you can see how we distributed the configuration of Cloud Foundry in three OpenStack availability zones. Of course, on this slide we have only some of the CF jobs to give thel idea of how the planning is done. This planning page helps us to calculate usage of memory and CPU by OpenStack zones and the number of virtual machines for Cloud Foundry. The values in the row called “Total” represent the total number of instances, memory and virtual CPUs. There are also totals calculated for each availability zone. The cells highlighted in yellow are the jobs that we recommend to place in the three availability zones. They are the service registry, “etcd”, which should have 3 instances, “loggregator traffic controller” that is recommended to have at least one instance in every zone, and, as I mentioned - Runners for application containers. Runners are major resource consumers for any Cloud Foundry deployment. But, at the same time, there are CF jobs that don’t support High Availability configuration by default. And we need to decide, how to recover them in case they fail. Or find some workarounds . Let’s move on to the next slide and see what can be done.
  • #17: One of non-HA job is CC and UAA databases. So, for databases, we can configure BOSH resurrection or use external MariaDB Galera Cluster. Other non-HA components in the deployment are: BOSH Director for CF and cloud services automation. BOSH is not directly related to availability of Cloud Foundry and applications. But we need to plan how we will recover the virtual machine with BOSH director. Another non-HA component is the BLOBstore. The default NFS blobstore is a single instance. We can use object storage, for example, OpenStack SWIFT for it. Let’s take a look at some of the details for these points.
  • #18: So, what does a plan for recovering a BOSH Director virtual machine look like? The approach is quite straightforward. To recover BOSH Director, we need the BOSH state file, the deployment manifest, and a persistent disk for the BOSH VM. First, we have to edit the BOSH state file, leaving only several properties. And then can we re-deploy BOSH and attach the persistent disk. So in our tests, the recovery of the BOSH director VM according to this scenario took around 25 minutes. As an alternative, we can use OpenStack VM migration functionality, if ephemeral drives are located in OpenStack Ceph storage and can be attached to the new VM in the same way as the persistent disk. In addition to that, the OpenStack Ceph option for ephemeral drives allows to do Live migrations of VMs in OpenStack.
  • #19:   To set OpenStack SWIFT as the blob store, we need to define the credentials and URL to connect to OpenStack and also set a temporary key in Cloud Foundry deployment manifest. It is very important that the temporary key is unique for every Cloud Foundry installation on OpenStack, if you have 2 installations in one OpenStack, for example. And it should work!
  • #20: Ok let’s see the effect of BOSH resurrection. What is important about BOSH Resurrection is that it takes around two minutes to mark the Agent as unresponsive. After that, the VM will be recreated. So, the total time in our tests ranged from 4 to 6 minutes for a Postgres database instance. This timeframe can be acceptable for applications that have already been deployed - they will continue to work and if we don’t install new applications during this downtime. In our case we decided to go with this approach. But take into account the side effect - when you stop a physical machine intentionally, BOSH Resurrector tries to recreate all the VMs hosted on this physical machine in the same OpenStack availability zone. And you should have enough resources for this process in the same zone. As an alternative to BOSH resurrection, you can configure an external MariaDB cluster for CF databases.
  • #21: Let’s take a look at the Cassandra storage. In our case, we are using OpenStack Ceph with replication. And the data blocks are distributed among all storage nodes. This means a single data read request triggers several network operations. First, the Application calls the Cassandra coordinator node - this is the virtual machine where application is connected. Second, the Cassandra coordinator contacts the Cassandra data node that should store requested data row. This Cassandra node runs on a specific compute node in OpenStack. Then the compute node talks to the OpenStack Ceph controller. And, finally, the Ceph controller reads data blocks from the OpenStack storage nodes.
  • #22: So, what are the pros and cons of running Cassandra in OpenStack CEPH? On the good side: - With CEPH, all cloud services are in OpenStack. This simplifies deployment automation and management, because services can be deployed and managed for example by BOSH. - CEPH is scalable and replicated storage. So, failure of one drive or storage node should not affect the availability of data volumes. - And last, but not least, the price of storage is quite cheap compared to special hardware SAN systems. Speaking about the cons, we can say that: - In CEPH storage, an additional replication factor of 2 will result in a total of 6 replicas for Cassandra data, if we use the recommended replica factor of 3 in Cassandra. And Cassandra performance depends directly on network performance. So, it is recommended to use a 10G or faster network for connecting OpenStack storage nodes.
  • #23: In our case, we decided to benchmark Cassandra in OpenStack to understand whether it can satisfy our requirements. We used Cassandra Stress Test Tool on a cluster of 6 nodes. There was a simple replication strategy with a factor of 3. The network was 1Gb. Every Cassandra node was configured with 8 vCPU and 32GB of RAM. It’s the recommended ratio between CPU and memory for one Cassandra node. The test was conducted with one table, the approximate test duration was 300 seconds.
  • #24: On this slide, you can see the results of the benchmark. Cassandra Stress test tool measures throughput as number of operations per seconds, and several latencies for requests that show the distribution of response time during the test. We put on the number of operations per second, average latency, 99%, maximum, and minimum latencies. They are measured in milliseconds. In terms of deviation, we may be interested in the 99 percent latency and the maximum latency - these figures can give you idea what should be examined in more details. This type of test can be executed very quickly after you’ve installed the cluster. And it can give you an insight in what kind of performance you can expect. For example, if your requirements are to serve 10,000 operations per second with a latency of less than 10 ms in average, the Cassandra deployment in OpenStack can meet these requirements. But also remember that Cassandra’s data model and access patterns also influence application performance.
  • #25: Other recommendation for Cassandra cluster planning include effective Data size per one Cassandra node is 3-5 TB Number of tables in all keyspaces should be less than 1000 to make compaction process effective. And 30 to 50% of space should be be free for compaction process. As for recommended storage options, DataStax recommends to run Cassandra on bare metal using SSD drives.
  • #26: These are some of the technical details from the project that I decided to share with you within our short time frame. So, in the last part of my presentation, I would like to say a few word about Altoros contributed to the community from the project. Don’t be surprised - even if we are working in such restricted area as healthcare, we can find a way to spread ideas and experience.
  • #27: During the project we created a CF service broker for a Cassandra cluster supports authentication and keyspace provisioning. We are updating it regularly to accommodate for changes in the latest Cassandra versions. And we’re continuously improving the ELK stack, specifically we have added number of inputs and outputs to Logstash, like RabbitMQ and Cassandra. In this project, ELK serves as the main storage of all log events. Our team has developed an approach and some Logstash filters to merge multiple lines of exceptions and stack traces in one message object in ElasticSearch. This helps to find and view the full context of any application error in KIbana.
  • #28: Also we developed a Web tool that allows developers who work with Cassandra to view keyspaces, objects, run any valid Cassandra CQL statements and store them in history. This tool is extremely useful if you need to interact with a Cassandra cluster in a private cloud without access to any of Cassandra nodes. We were inspired by DataStax DevCenter, a desktop tool to work with Cassandra cluster with direct connectivity to cluster nodes. But for private cloud in OpenStack behind firewall DataStax DevCenter doesn’t work, and we need web-based tool. Moreover it’s Cloud FOundry ready application.
  • #29: That’s all in this short presentation. I’ll be glad to answer your questions.