SlideShare a Scribd company logo
Embracing Failure
Self-Healing, Decentralized Resource Management
for Apache CloudStack
John Burwell
Vice President, Software Engineering
john.burwell@shapeblue.com | @john_burwell
@shapeblue #ccceu
 VP of Software Engineering @ ShapeBlue
 Member, Apache CloudStack PMC (June 2013)
 Ran operations and designed automated
provisioning for analytic/virtualization clouds
 Led architectural design and server-side
development of a SaaS physical security
platform
About Me
@shapeblue #ccceu
“ShapeBlue are expert builders of public &
private clouds. They are the leading global
Apache CloudStack integrator & consultancy”
…and we’re hiring!
About ShapeBlue
@shapeblue #ccceu
Bang ups and Hang Ups
Can Happen to You
Derive the normative operation
design from failure recovery
@shapeblue #ccceu
What is a Resource?
Control
Plane
DeviceDeviceDevice
(Desired
State)
(Actual
State)
Resou
rce
(Converges
Desired with
Actual State)
Eventually, the desired and actual states will be co
@shapeblue #ccceu
CloudStack partitions resources
into zones, clusters, and pods
@shapeblue #ccceu
 Resource status information is stale or lost
 Resource definitions conflict with device
state
 Entropy
Failure Modes
@shapeblue #ccceu
@shapeblue #ccceu
Consistency
AvailabilityPartition Tolerance
@shapeblue #ccceu
Orchestration operations are available and
eventually consistent
... but device modifications must be
consistent.
@shapeblue #ccceu
@shapeblue #ccceu
Orchestration TierAP
CP Automation Control Tier
@shapeblue #ccceu
Desired Resource
State
AP
CP Actual Resource State
@shapeblue #ccceu
SchedulingAP
CP State Convergence
Resource Off
Resource Sta
State Transitions
Hoke
@shapeblue #ccceu
 Simple
 Self-contained
 Locality
 Non-persistent
Hoke Design Goals
@shapeblue #ccceu
Runtime Resource View
Reso
urce
FSM
Manage
ment
Process
Devic
e
Queu
e
State
Transition
1
1
Monitor
Process
Reso
urce
Offer
Reso
urce
Status
@shapeblue #ccceu
 An actor represents state and behavior
 Communicate by message passing — each
actor has a dedicated queue or mailbox
 Each actor is allocated a lightweight thread
— implicit lock
Actor Model
@shapeblue #ccceu
 All resources represented in a directed,
acyclic graph
 The root node of the graph is the region
organized in the following manner:
region -> zone -> pod -> cluster
 Each resource is a child of the partition node
in which owns it
Resource Graph
@shapeblue #ccceu
 Google’s resource scheduler
 Transactional shared state model enabling
sophisticated, global decision making
 Supports both high churn and low churn
workloads
 Multiple, pluggable schedulers working in
parallel
Inspiration from Omega
@shapeblue #ccceu
 Two level scheduler
 Resource Offers
 Pessimistic Locking
 Pluggable
 Geared towards high churn workloads
Inspiration from Mesos
@shapeblue #ccceu
 Best Effort shared-state scheduler
 Multiple parallel schedulers distributed by
partition
 Combines allocators and planners
 Pluggable
Hybrid Scheduler
@shapeblue #ccceu
 Partition controllers spawn system VMs for
their child partitions as need to address
scheduler business and reliability guarantees
 Parent partition controllers monitor the
health of their child partition controllers and
re-spawn as necessary
Auto Scaling, Self Healing
@shapeblue #ccceu
 Evaluate implementing the concepts in the
Orleans paper to reduce the number of active
actors required
 Determine best approach causality tracking
for state transitions (e.g. version vectors)
 Create a library implementing these concepts
to demonstrate viability and separate
concerns and performance test
Next Steps
@shapeblue #ccceu
 Gilbert, Seth & Nancy Lynch. Brewer’s
Conjecture and the Feasibility of Consistent,
Available, Partition-Tolerant Web Services.
2002.
 Schwarkopf, Malte; Konwinski, Andy; et. al.
Omega: flexible, scalable schedulers for large
compute clusters. 2013.
References
@shapeblue #ccceu
 Hindman, Benjamin; Konwinski, Andy; et. al.
Mesos: A Platform for Fine-Grained Resource
Sharing in the Data Center. 2011.
 Bernstien, Philip; Bykov, Sergey; et. al.
Orleans: Distributed Virtual Actors for
Programmability and Scalability. 2014.
References
@shapeblue #ccceu
Questions
Comments
@shapeblue #ccceu
Thank you

More Related Content

PPTX
Fracture healing
PPTX
Wound healing
PPTX
Service recovery
PPT
wound healing PPT
PPTX
When the Cloud is a Rockin: High Availability in Apache CloudStack
PDF
CCCNA17 Reliable Host Fencing
PPTX
Silicon Valley CloudStack User Group - Introduction to Apache CloudStack
PPTX
Nova states summit
Fracture healing
Wound healing
Service recovery
wound healing PPT
When the Cloud is a Rockin: High Availability in Apache CloudStack
CCCNA17 Reliable Host Fencing
Silicon Valley CloudStack User Group - Introduction to Apache CloudStack
Nova states summit

Similar to Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack (20)

PDF
OpenStack Toronto Q2 MeetUp - June 1st 2017
PDF
OpenStack Ottawa Q2 MeetUp - May 31st 2017
PDF
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
PPTX
Designing CloudStack Clouds
ODP
OpenStack Nova Scheduler
PPTX
ShapeBlue South Africa Launch-Iaas business use cases
PDF
Orchestration: Fancy Buzzword, or the Inevitable fate of Docker Containers?
PPT
A Survey on Resource Allocation & Monitoring in Cloud Computing
PPTX
It’s a Multi-Cloud World, But What About The Data?
PDF
PDF
Resisting to The Shocks
PPTX
Apache CloudStack 4.2: A First Look
PDF
Podila mesos con europe keynote aug sep 2016
PDF
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
PPTX
(R)evolution of the computing continuum - A few challenges
PPTX
HPC and cloud distributed computing, as a journey
PPTX
Challenges and Issues of Next Cloud Computing Platforms
PDF
Availability in a cloud native world v1.6 (Feb 2019)
PDF
Montreal OpenStack Q2 MeetUp - May 30th 2017
PDF
A Novel Scheduling Mechanism for Hybrid Cloud Systems
OpenStack Toronto Q2 MeetUp - June 1st 2017
OpenStack Ottawa Q2 MeetUp - May 31st 2017
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Designing CloudStack Clouds
OpenStack Nova Scheduler
ShapeBlue South Africa Launch-Iaas business use cases
Orchestration: Fancy Buzzword, or the Inevitable fate of Docker Containers?
A Survey on Resource Allocation & Monitoring in Cloud Computing
It’s a Multi-Cloud World, But What About The Data?
Resisting to The Shocks
Apache CloudStack 4.2: A First Look
Podila mesos con europe keynote aug sep 2016
Deterministic capacity planning for OpenStack as elastic cloud infrastructure
(R)evolution of the computing continuum - A few challenges
HPC and cloud distributed computing, as a journey
Challenges and Issues of Next Cloud Computing Platforms
Availability in a cloud native world v1.6 (Feb 2019)
Montreal OpenStack Q2 MeetUp - May 30th 2017
A Novel Scheduling Mechanism for Hybrid Cloud Systems
Ad

More from John Burwell (6)

PDF
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
PDF
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
PDF
Building Complete Private Clouds with Apache CloudStack and Riak CS
PDF
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
PDF
Who the heck are you? Integrating CloudStack Authentication
PDF
How to Run from a Zombie: CloudStack Distributed Process Management
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Building Complete Private Clouds with Apache CloudStack and Riak CS
Making Cloudy Peanut Butter Cups: Apache CloudStack + Riak CS
Who the heck are you? Integrating CloudStack Authentication
How to Run from a Zombie: CloudStack Distributed Process Management
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Machine Learning_overview_presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Spectral efficient network and resource selection model in 5G networks
A comparative analysis of optical character recognition models for extracting...
Unlocking AI with Model Context Protocol (MCP)
Machine Learning_overview_presentation.pptx
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Review of recent advances in non-invasive hemoglobin estimation

Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack

Editor's Notes

  • #2: Live from Alexandria, VA #cloudstackworks — thank Sebastian and the audience for their accommodation. Time crunch! Follow-up to “How to Run from a Zombie” June 2013 @ CCC Santa Clara -> processes fail slowly and quietly. Led down a rabbit hole to resource management Resource management is the core of the control backplane. It’s resilience is critical to the overall reliability of the system. Comm odity Hardware -> pulling reliability out of expensive, specialized hardware into the control plane Networks partition, disks and power supplies fail, bugs happen Derive the normative operation of the system from its failure modes Distribute the resource management function across the infrastructure to isolate failures Automatically recognize failures and recover with little to no operator intervention Churn Hoke
  • #3: I have experienced “cloud” from both a developer and operations perspective.
  • #4: We do engineering as well. I am presenting an important initiative for ShapeBlue to expand the workloads supported by CloudStack, and provide world-class infrastructure resilience.
  • #5: Normative operation assumes it is recovering from a failure. For example, if we start the system with no data was that the result of a fresh installation or data corruption? Assuming it was data corruption means it will always work.
  • #6: Highlight the transient (non-persistent) nature of a resource
  • #7: These partitions become the isolation barriers for failures.
  • #8: Not an exhaustive list. They are the most significant.
  • #9: CloudStack must obey the rules …
  • #10: Consistency is not “C” from ACID. It is atomic consistency applied to the scope of a single request/response operation sequence. “There must exist a total order on all operations such that each operation looks as if it were completed in a single instant.” Linearizable. Availability: “Every request received by the a non-failing node in the system must result in a response” Partition Tolerance: The ability of the system to operate when all messages sent between one or more nodes are lost.
  • #11: Resources require operations be applied in a specific order. In order to prevent race conditions that would cause overcommitment or an invalid state, only one state transition can be applied in any given instant.
  • #12: We have an apparent violation of the CAP Theorem
  • #15: The flexibility of an opaque resource offer abstraction — the what not the how. We current work on the how.
  • #16: I am willing sacrifice some performance for simplicity and correctness
  • #17: Each resource has a finite state machine with a single queue and processing thread. We attempt to converge state. If we cannot find a convergence path, we fail fast and loudly.
  • #18: No locking overhead or locks to clean up on failure
  • #28: Slides will posted Slideshare shortly