SlideShare a Scribd company logo
Embracing Failure
The art of being at the edge
Thanks to
Embracing Failure
«Failures are given, and everything will eventually
fail over time»
(Werner Vogels – CTO Amazon)
Embracing Failure
The art of being at the edge
Change Mindset
Building a reliable application in the cloud is different
than building a reliable application in an enterprise
setting
A new Mindset is needed.
Eight Fallacies of Distributed Computing
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t exist
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Peter Deutsch
Conway's law
On-premises Application
- Before the Cloud, users were connected
to our applications through the
Company's local network;
- A server's downtime was planned and
involved stopping production
- Conway’s law model
Modern Application
- Now our users connect through the
Internet
- The workload to which our services are
subjected will increase significantly,
thanks to the greater spread of the
applications themselves.
- Many Microservices replace Monolithic
Microservices: is it really a matter of sizes?
We cannot say there is a formal definition of the
microservices architectural style, but we can attempt to
describe what we see as common characteristics for
architectures that fit the label.
Common Characteristics
Componentisation via services
Organised around business capabilities
Decentralised data management
Products not projects
Decentralised governance
Smart endpoints and dumb pipes
Evolutionary design
Infrastructure automation
?????????????
(Martin Fowler, James Lewis)
Microservices: or a question of Business?
Or is it a matter of paradigms?
Sync Communication (e.g. http)
Async Communication (e.g. ServiceBus)
VS
Reactive Manifesto (16.01.2014)
• (Jones Boner, Dave Farley, Roland Kuhn, Martin Thompson)
• The absolute, most import thing is it needs to be responsive.
This means that a reactive system needs to remain responsive event when a failure occurs.
Responsive
“The system responds in a timely manner if at all possible. Responsiveness is the cornerstone of usability and utility,
but more than that, responsiveness means that problems may be detected quickly and dealt with effectively.”
https://www.reactivemanifesto.org/it
Availability
Availability Downtime per year Categories
95% (1-nine) 18 days 6 hours Batch processing, Data extraction, Load jobs
99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking
99.9% (3-nines) 8 hours 45 minutes Online Commerce
99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems
99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions)
99.9999% (6-nines) 31 seconds Answering to me loved one
Availability
The beauty of Math at work!
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Y 99.99% (4-nines) 52 minutes
X and Y Combined 98.99% 3 days 16 hours 33 minutes
Component Availability Downtime
X 99% (2-nines) 3 days 15 hours
Two X in parallel 99.99% (4-nines) 52 minutes
Three X in parallel 99.9999% (6-nines) 31 seconds
Reactive Manifesto - Resilient
Resilient
• Resilient systems embrace the idea that failures are normal and that it
is perfectly acceptable to run systems in what we call partially failing
mode.
Services resiliency
All Azure management services are architected to be resilient
from region-level failures. In the spectrum of failures, one or
more Availability Zone failures within a region have a smaller
failure radius compared to an entire region failure. Azure can
recover from a zone-level failure of management services
within the region or from another Azure region. Azure
performs critical maintenance one zone at a time within a
region, to prevent any failures impacting customer resources
deployed across Availability Zones within a region.
Azure solution
• Availability Zones
• Zonal services: you pin the resource to a specific zone (for example,
virtual machines, managed disks, Standard IP addresses)
• Zone-redundant services: platform replicates automatically across
zones (for example, zone-redundant storage, SQL Database)
What are Availability Zones in Azure?
Reactive Manifesto - Elastic
Elastic
The degree to which a system is able to
adapt to workload changes by provisioning
and de-provisioning resources in an
autonomic manner, such that at each
point in time the available resources
match the current demand as closely as
possible.
• In free and shared service plan, you cannot scale the
application as only one instance is available.
• In basic plan, you can scale the application manually. This
means you have to check the metrics manually to see if
more instances are needed and then can increase or
decrease them from your Azure management portal.
• In standard and premium plan, you can choose to auto
scale based on few parameters.
Azure solution
• The code that we use for scripting (PowerShell or bash) …
it’s code. So we have to treat him as such.
Reactive Manifesto – Message Driven
Guaranteering Delivery
- The Two Generals Problem
- When we have an unreliable network, which we always do, we cannot guarantee message receipt.
- Instead we must be satisfied with either
- At Most Once
- At Least Once
- Exactly Once
• Event Grid
• Event Hubs
• Service Bus
Azure solution
SERVICE PURPOSE TYPE WHEN TO USE
Event Grid Reactive programming Event distribution (discrete) React to status changes
Event Hubs Big data pipeline Event streaming (series) Telemetry and distributed
data streaming
Service Bus High-value enterprise
messaging
Message Order processing and
financial transaction
Chaos Engineering
Before starting your journey into chaos engineering, make sure you’ve done your homework and have built resiliency
into every level of your organization. Building resilient systems isn’t all about software. It starts at the infrastructure
layer, progresses to the network and data, influences application design and extends to people and culture.
Adrian Hornsby
Chaos Engineering
- Chaos engineering is a technique to meet the resilience requirement.
- Chaos engineering can be use to achieve resilience against
- Infrastructure failures
- Network failures
- Application failures
The logo for Chaos Monkey used by
Netflix
Is the discipline of experimenting on a software system in production in order
to build confidence in the system's capability to withstand turbulent and
unexpected conditions.
Which Chaos Engineering Experiments?
The Phases of Chaos Engineering
It’s important to understand that chaos engineering is NOT about letting monkeys loose or allowing them to break
things randomly without a purpose. Chaos engineering is about breaking things in a controlled environment, through
well-planned experiments in order to build confidence in your application to withstand turbulent conditions.
https://medium.com/@adhorn/chaos-engineering-ab0cc9fbd12a
Canary Deployment
Canary deployment: Start small, and slowly build confidence within your team and your organization
- How many customers
are affected?
- What functionality is
impaired?
- Which locations are
imapcted?
New Tools
One of the most efficient methods for uncovering misalignments in software is put the code together and
run it. Continuos Integration was promoted heavily as part of XP methodology as a way to achieve this
and is now a common industry norm.
Continuos Delivery builds on the success of CI by automated the steps of preparing code and deploying it
to an environment. CD tools allow engineers to choose a build that passed the CI stage and promote that
through the pipeline to run in production.
Like CI/CD, Continuos Verification is born out of a need to navigate increasingly complex systems. Modern
organizations can’t validate that the internal machinations of the system work as intended, so instead
they verify that the output of the system matches expectations.
Benefits of Chaos Engineering
- Customer: the increased availability and durability of
service means no outages disrupt their day-to-day lives.
- Business: Chaos Engineering can help prevent
extremely large losses in revenue and maintenance
costs, create happier and more engaged engineers,
improve in on-call training for engineering teams
- Technical: the insights from chaos experiments can
mean a reduction in incidents, reduction in on-call
burden, increased understanding of system failure
modes, improved system design
Designed for failure
Common Characteristics
Componentisation via services
Organised around business capabilities
Decentralised data management
Products not projects
Decentralised governance
Smart endpoints and dumb pipes
Evolutionary design
Infrastructure automation
designed for failure
Chaos Engineering is an experiment to ensure that the
impact of failures is mitigated.
Adrian Crockcroft
Tools don’t create reliability.
Human do.
@CaseyRosenthal
Thank You!!!
Tools don’t create reliability.
Human do.
[But tools can help.]
@CaseyRosenthal
Thank You!!!
• Reactive Manifesto
• Asynchronous Message-Based-Communication (Microsoft)
• Patterns For Resilient Architecture (Medium)
• The Quest for Availability
• Chaos Engineering
• Availability modes for an Always On availability group
• Configure availability group on Azure SQL Server VM manually
Resources
Thanks to
@aacerbis
Linkedin
alberto.acerbis@4solid.it
Software Architect

More Related Content

PPTX
What does performance mean in the cloud
PDF
CA Infrastructure Management 2.0 vs. Solarwinds Orion: Speed and ease of mana...
PPTX
Software Testing in a Digital Transformation Journey
PDF
Traficon Case Study
PDF
Enterprise grade disaster recovery without breaking the bank
PDF
Intergen Twilight Seminar: Constructive Disruption with Cloud Technologies
PPTX
VDI Performance Assurance With Monitoring and Testing
PPTX
The Cloud Strategy
What does performance mean in the cloud
CA Infrastructure Management 2.0 vs. Solarwinds Orion: Speed and ease of mana...
Software Testing in a Digital Transformation Journey
Traficon Case Study
Enterprise grade disaster recovery without breaking the bank
Intergen Twilight Seminar: Constructive Disruption with Cloud Technologies
VDI Performance Assurance With Monitoring and Testing
The Cloud Strategy

What's hot (17)

PDF
Smart Enterprise Drivers 2020 - Strategic Realities Reshaping the Smart Enter...
PDF
Deep Dive into Disaster Recovery in the Cloud
PDF
Continuous Engineering with IBM Rational RELM
PDF
Symantec Disaster Recovery Orchestrator: One Click Disaster Recovery to the C...
PDF
Ca technology exchange virtualization
PPTX
Classrooms - Anywhere, Anytime! - Geoff Green, MCPc
PDF
Living in the Cloud
PPTX
Job Postings
PDF
Thought_Frameworks_Brochure
PPT
Mitel Virtual Solutions[1]
PPT
BusinessIntelligenze - On Cloud BI (English)
DOC
Disaster recovery with cloud computing
PDF
Stationarity is the new speed
PDF
Yes to virtualization projects but dont virtualize waste
PPTX
Expanding our Understanding: Complex Adaptive Systems
PDF
Flying blind white_paper-final
Smart Enterprise Drivers 2020 - Strategic Realities Reshaping the Smart Enter...
Deep Dive into Disaster Recovery in the Cloud
Continuous Engineering with IBM Rational RELM
Symantec Disaster Recovery Orchestrator: One Click Disaster Recovery to the C...
Ca technology exchange virtualization
Classrooms - Anywhere, Anytime! - Geoff Green, MCPc
Living in the Cloud
Job Postings
Thought_Frameworks_Brochure
Mitel Virtual Solutions[1]
BusinessIntelligenze - On Cloud BI (English)
Disaster recovery with cloud computing
Stationarity is the new speed
Yes to virtualization projects but dont virtualize waste
Expanding our Understanding: Complex Adaptive Systems
Flying blind white_paper-final
Ad

Similar to Embracing Failure - AzureDay Rome (20)

PPTX
Introduction to Chaos Engineering
PPTX
Wicsa2011 cloud tutorial
PDF
Agile and continuous delivery – How IBM Watson Workspace is built
PDF
Predicting the Future of Endpoint Management in a Mobile World
PDF
Migrating to cloud-native_app_architectures_pivotal
PDF
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
PDF
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
PDF
Migrating_to_Cloud-Native_App_Architectures_Pivotal
PPTX
Technology insights: Decision Science Platform
PPT
Cloud strategy briefing 101
PDF
SPLUNK_empower-engineers-with-unified-observability.pdf
PPTX
Red Hat Ansible Client presentation Level 2.PPTX
PPTX
Declare Victory with Big Data
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
PDF
Building Cloud capability for startups
PPT
Effektives Consulting - Performance Engineering
PDF
Brighttalk understanding the promise of sde - final
PPTX
Cloud Computing for Small & Medium Businesses
PDF
Introduction to DevOps
PPTX
Insurtech, Cloud and Cybersecurity - Chartered Insurance Institute
Introduction to Chaos Engineering
Wicsa2011 cloud tutorial
Agile and continuous delivery – How IBM Watson Workspace is built
Predicting the Future of Endpoint Management in a Mobile World
Migrating to cloud-native_app_architectures_pivotal
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal (2)
Migrating_to_Cloud-Native_App_Architectures_Pivotal
Technology insights: Decision Science Platform
Cloud strategy briefing 101
SPLUNK_empower-engineers-with-unified-observability.pdf
Red Hat Ansible Client presentation Level 2.PPTX
Declare Victory with Big Data
From Duke of DevOps to Queen of Chaos - Api days 2018
Building Cloud capability for startups
Effektives Consulting - Performance Engineering
Brighttalk understanding the promise of sde - final
Cloud Computing for Small & Medium Businesses
Introduction to DevOps
Insurtech, Cloud and Cybersecurity - Chartered Insurance Institute
Ad

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
history of c programming in notes for students .pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Softaken Excel to vCard Converter Software.pdf
PPT
Introduction Database Management System for Course Database
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Transform Your Business with a Software ERP System
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
history of c programming in notes for students .pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
CHAPTER 2 - PM Management and IT Context
Understanding Forklifts - TECH EHS Solution
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms I-SECS-1021-03
Softaken Excel to vCard Converter Software.pdf
Introduction Database Management System for Course Database
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Transform Your Business with a Software ERP System
How to Choose the Right IT Partner for Your Business in Malaysia
Wondershare Filmora 15 Crack With Activation Key [2025
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf

Embracing Failure - AzureDay Rome

  • 1. Embracing Failure The art of being at the edge
  • 3. Embracing Failure «Failures are given, and everything will eventually fail over time» (Werner Vogels – CTO Amazon)
  • 5. The art of being at the edge
  • 6. Change Mindset Building a reliable application in the cloud is different than building a reliable application in an enterprise setting A new Mindset is needed.
  • 7. Eight Fallacies of Distributed Computing - The network is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn’t exist - There is one administrator - Transport cost is zero - The network is homogeneous Peter Deutsch
  • 9. On-premises Application - Before the Cloud, users were connected to our applications through the Company's local network; - A server's downtime was planned and involved stopping production - Conway’s law model
  • 10. Modern Application - Now our users connect through the Internet - The workload to which our services are subjected will increase significantly, thanks to the greater spread of the applications themselves. - Many Microservices replace Monolithic
  • 11. Microservices: is it really a matter of sizes? We cannot say there is a formal definition of the microservices architectural style, but we can attempt to describe what we see as common characteristics for architectures that fit the label. Common Characteristics Componentisation via services Organised around business capabilities Decentralised data management Products not projects Decentralised governance Smart endpoints and dumb pipes Evolutionary design Infrastructure automation ????????????? (Martin Fowler, James Lewis)
  • 12. Microservices: or a question of Business?
  • 13. Or is it a matter of paradigms? Sync Communication (e.g. http) Async Communication (e.g. ServiceBus) VS
  • 14. Reactive Manifesto (16.01.2014) • (Jones Boner, Dave Farley, Roland Kuhn, Martin Thompson) • The absolute, most import thing is it needs to be responsive. This means that a reactive system needs to remain responsive event when a failure occurs.
  • 15. Responsive “The system responds in a timely manner if at all possible. Responsiveness is the cornerstone of usability and utility, but more than that, responsiveness means that problems may be detected quickly and dealt with effectively.” https://www.reactivemanifesto.org/it
  • 16. Availability Availability Downtime per year Categories 95% (1-nine) 18 days 6 hours Batch processing, Data extraction, Load jobs 99% (2-nines) 3 days 15 hours Internal Tools, Project Tracking 99.9% (3-nines) 8 hours 45 minutes Online Commerce 99.99% (4-nines) 52 minutes Video Delivery, Broadcast systems 99.999% (5-nines) 5 minutes Telecom Industry (ATM Transactions) 99.9999% (6-nines) 31 seconds Answering to me loved one
  • 17. Availability The beauty of Math at work! Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds
  • 18. Reactive Manifesto - Resilient
  • 19. Resilient • Resilient systems embrace the idea that failures are normal and that it is perfectly acceptable to run systems in what we call partially failing mode.
  • 20. Services resiliency All Azure management services are architected to be resilient from region-level failures. In the spectrum of failures, one or more Availability Zone failures within a region have a smaller failure radius compared to an entire region failure. Azure can recover from a zone-level failure of management services within the region or from another Azure region. Azure performs critical maintenance one zone at a time within a region, to prevent any failures impacting customer resources deployed across Availability Zones within a region. Azure solution • Availability Zones • Zonal services: you pin the resource to a specific zone (for example, virtual machines, managed disks, Standard IP addresses) • Zone-redundant services: platform replicates automatically across zones (for example, zone-redundant storage, SQL Database) What are Availability Zones in Azure?
  • 22. Elastic The degree to which a system is able to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner, such that at each point in time the available resources match the current demand as closely as possible.
  • 23. • In free and shared service plan, you cannot scale the application as only one instance is available. • In basic plan, you can scale the application manually. This means you have to check the metrics manually to see if more instances are needed and then can increase or decrease them from your Azure management portal. • In standard and premium plan, you can choose to auto scale based on few parameters. Azure solution • The code that we use for scripting (PowerShell or bash) … it’s code. So we have to treat him as such.
  • 24. Reactive Manifesto – Message Driven
  • 25. Guaranteering Delivery - The Two Generals Problem - When we have an unreliable network, which we always do, we cannot guarantee message receipt. - Instead we must be satisfied with either - At Most Once - At Least Once - Exactly Once
  • 26. • Event Grid • Event Hubs • Service Bus Azure solution SERVICE PURPOSE TYPE WHEN TO USE Event Grid Reactive programming Event distribution (discrete) React to status changes Event Hubs Big data pipeline Event streaming (series) Telemetry and distributed data streaming Service Bus High-value enterprise messaging Message Order processing and financial transaction
  • 27. Chaos Engineering Before starting your journey into chaos engineering, make sure you’ve done your homework and have built resiliency into every level of your organization. Building resilient systems isn’t all about software. It starts at the infrastructure layer, progresses to the network and data, influences application design and extends to people and culture. Adrian Hornsby
  • 28. Chaos Engineering - Chaos engineering is a technique to meet the resilience requirement. - Chaos engineering can be use to achieve resilience against - Infrastructure failures - Network failures - Application failures The logo for Chaos Monkey used by Netflix Is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.
  • 29. Which Chaos Engineering Experiments?
  • 30. The Phases of Chaos Engineering It’s important to understand that chaos engineering is NOT about letting monkeys loose or allowing them to break things randomly without a purpose. Chaos engineering is about breaking things in a controlled environment, through well-planned experiments in order to build confidence in your application to withstand turbulent conditions. https://medium.com/@adhorn/chaos-engineering-ab0cc9fbd12a
  • 31. Canary Deployment Canary deployment: Start small, and slowly build confidence within your team and your organization - How many customers are affected? - What functionality is impaired? - Which locations are imapcted?
  • 32. New Tools One of the most efficient methods for uncovering misalignments in software is put the code together and run it. Continuos Integration was promoted heavily as part of XP methodology as a way to achieve this and is now a common industry norm. Continuos Delivery builds on the success of CI by automated the steps of preparing code and deploying it to an environment. CD tools allow engineers to choose a build that passed the CI stage and promote that through the pipeline to run in production. Like CI/CD, Continuos Verification is born out of a need to navigate increasingly complex systems. Modern organizations can’t validate that the internal machinations of the system work as intended, so instead they verify that the output of the system matches expectations.
  • 33. Benefits of Chaos Engineering - Customer: the increased availability and durability of service means no outages disrupt their day-to-day lives. - Business: Chaos Engineering can help prevent extremely large losses in revenue and maintenance costs, create happier and more engaged engineers, improve in on-call training for engineering teams - Technical: the insights from chaos experiments can mean a reduction in incidents, reduction in on-call burden, increased understanding of system failure modes, improved system design
  • 34. Designed for failure Common Characteristics Componentisation via services Organised around business capabilities Decentralised data management Products not projects Decentralised governance Smart endpoints and dumb pipes Evolutionary design Infrastructure automation designed for failure Chaos Engineering is an experiment to ensure that the impact of failures is mitigated. Adrian Crockcroft
  • 35. Tools don’t create reliability. Human do. @CaseyRosenthal Thank You!!!
  • 36. Tools don’t create reliability. Human do. [But tools can help.] @CaseyRosenthal Thank You!!!
  • 37. • Reactive Manifesto • Asynchronous Message-Based-Communication (Microsoft) • Patterns For Resilient Architecture (Medium) • The Quest for Availability • Chaos Engineering • Availability modes for an Always On availability group • Configure availability group on Azure SQL Server VM manually Resources