SlideShare a Scribd company logo
Bảo mật Dành cho Tên công ty Phiên bản 1.0
Latency Control & Supervision in
Resilience Design Patterns
Tu Pham - CTO @ Eway
Bảo mật Dành cho Tên công ty Phiên bản 1.0
Terminology
Why It So
IMPORTANT
Why It So HARD
Design Patterns
Anti Patterns
Q & A
TOC
Terminology
Distributed Systems
These are networked components which communicate with each other
by passing messages most often to achieve a common goal.
Resiliency
The capacity of any system to recover from difficulties.
Availability
Probability that any system is operating at time `t`.
Reliability
Degree to which a system / component performs specified functions
under specified conditions for a specified period of time
Faults
Fault is an incorrect internal state in your
system. Examples:
1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most
often because there’s not enough validations
on input data)
Terminology
Failure
Failure is an inability of the system to perform
its intended job. Examples:
Failure means loss of Up-Time and availability
on systems. Faults if not contained from
propagating, can lead to failures.
Latency Control And Supervision In Resilience Design Patterns
Why It So IMPORTANT
1
Losing customers and partners to
competitors => Financial losses for the
company
2
Affecting livelihood of publishers and
advertisers
3
Affecting salary and bonus of OUR TEAM
:))
4
Affecting services for customers and
colleges
But building resiliency in a complex
micro-services architecture with
multiple distributed systems
communicating with each other is
difficult.
Why It So HARD
Some of the things which make it
hard are:
1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable
Why It So HARD
Patterns
Latency Control And Supervision In Resilience Design Patterns
Latency
Control
● Complements isolation
● Detection and handling of non-timely
responses
● Avoid cascading temporal failures
● Different approaches and patterns available
0
20
40
60
80
Timeout
● Preserve responsiveness
independent of downstream latency
● Measure response time of
downstream calls
● Stop waiting after a pre-determined
timeout
● Take alternate action if timeout was
reached
Latency Control And Supervision In Resilience Design Patterns
Fail Fast
● “If you know you’re going to fail, you
better fail fast”
● Avoid foreseeable failures
● Usually implemented by adding
checks in front of costly actions
● Enhances probability of not failing
Circuit Breaker
● Probably most often cited resilience
pattern
● Extension of the timeout pattern
● Takes downstream unit offline if
calls fail multiple times
● Specific variant of the fail fast
pattern
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
Fan out & quickest
reply
● Send request to multiple workers
● Use quickest reply and discard all
other responses
● Reduces probability of latent
responses
● Tradeoff is WASTE of resources
Bounded Queues
● Limit request queue sizes in front of
highly utilized resources
● Avoids latency due to overloaded
resources
● Introduces pushback on the callers
● Another variant of the fail fast
pattern
Latency Control And Supervision In Resilience Design Patterns
Supervision
● Provides failure handling beyond the means of
a single failure unit
● Detect unit failures
● Provide means for error escalation
● Different approaches and patterns available
Shed Load
● Upstream isolation pattern
● Avoid becoming overloaded due to
too many requests
● Install a gatekeeper in front of the
resource
● Shed requests based on resource
load
Monitor
● Observe unit behavior and
interactions from the outside
● Automatically respond to detected
failures
● Part of the system – complex failure
handling strategies possible
● Outside the system – more robust
against system level failures
Error Handler
● Units often don’t have enough time
or information to handle errors
● Separate business logic and error
handling
● Business logic just focuses on
getting the task done (quickly)
● Error handler has sufficient time
and information to handle errors
Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
Latency Control And Supervision In Resilience Design Patterns
Other
Patterns
Fallback
● Units often don’t have enough time
or information to handle errors
● Instead of aborting the computation
because of a missing response, we
fill in a fallback value.
● Of course, it can be DANGEROUS !!!
Retry
● Units have enough time or
information to handle errors
● Just send the requests again and
again til it reach the BOUNDARY of
policy
Escalation
● Units often don’t have enough time
or information to handle errors
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue
Just Don’t
● Infinity delay
● One config / policy for all situations
● Fallback logics without confirmation from
business departments / upper managers
● Laggy / buggy monitoring system
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
References
● https://github.com/Netflix/Hystrix
● https://github.com/alibaba/Sentinel
● https://github.com/resilience4j/resilience4j
● https://github.com/jhalterman/failsafe
“Just Design Our Systems For Failure”
Q&A

More Related Content

PDF
Cyber Security and Cloud Computing
PPTX
Security on Cloud Computing
PPTX
Security in the cloud Workshop HSTC 2014
PPTX
45 Minutes to PCI Compliance in the Cloud
PPTX
Webinar compiled powerpoint
PPTX
Cloud Security
PPTX
Technologies You Need to Safely Use the Cloud
PDF
Cloud university intel security
Cyber Security and Cloud Computing
Security on Cloud Computing
Security in the cloud Workshop HSTC 2014
45 Minutes to PCI Compliance in the Cloud
Webinar compiled powerpoint
Cloud Security
Technologies You Need to Safely Use the Cloud
Cloud university intel security

What's hot (19)

PPTX
Security As A Service In Cloud(SECaaS)
PPTX
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
PPTX
Securing Applications in the Cloud
PPTX
The Top Cloud Security Issues
PPTX
Rethinking Security: The Cloud Infrastructure Effect
PPTX
cloud security ppt
PPTX
Security for cloud native workloads
PPTX
Assessing System Risk the Smart Way
PPT
Cloud security
PDF
Cloud Security Demystified
PPTX
Cloud security privacy- org
PDF
Managed Threat Detection & Response for AWS Applications
PDF
Cloud Security Engineering - Tools and Techniques
PDF
Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...
PPTX
Managing Cloud Security Risks in Your Organization
PPTX
#ALSummit: Realities of Security in the Cloud
PDF
CSS17: Houston - Azure Shared Security Model Overview
PPTX
Venom vulnerability Overview and a basic demo
PDF
Cloud Security - Kloudlearn
Security As A Service In Cloud(SECaaS)
RightScale Webinar: Security Monitoring in the Cloud: How RightScale Does It
Securing Applications in the Cloud
The Top Cloud Security Issues
Rethinking Security: The Cloud Infrastructure Effect
cloud security ppt
Security for cloud native workloads
Assessing System Risk the Smart Way
Cloud security
Cloud Security Demystified
Cloud security privacy- org
Managed Threat Detection & Response for AWS Applications
Cloud Security Engineering - Tools and Techniques
Css sf azure_8-9-17 - 5_ways to_optimize_your_azure_infrastructure_thayer gla...
Managing Cloud Security Risks in Your Organization
#ALSummit: Realities of Security in the Cloud
CSS17: Houston - Azure Shared Security Model Overview
Venom vulnerability Overview and a basic demo
Cloud Security - Kloudlearn
Ad

Similar to Latency Control And Supervision In Resilience Design Patterns (20)

PPTX
Resilience reloaded - more resilience patterns
PPTX
Fault Tolerance in Distributed Environment
PDF
Resilience Planning & How the Empire Strikes Back
PDF
Resisting to The Shocks
PPTX
Resilience engineering
PPTX
Microservices Resilient Engineering - Java meetup.pptx
PDF
Resilience-Patterns in Cloud-Applications
PPTX
Designing distributed systems
PPTX
Using Hystrix to Build Resilient Distributed Systems
PPTX
Fault Tolerance in Distributed System
PDF
Expect the unexpected: Prepare for failures in microservices
PDF
Patterns of resilience
PDF
Architecting for Failures in micro services: patterns and lessons learned
PPTX
Resilience planning and how the empire strikes back
PDF
Reliability and Resilience Patterns
PDF
The Anatomy of Failure - Lessons from running systems to serve millions of pe...
PDF
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
PDF
The 7 quests of resilient software design
PPTX
Expect the unexpected: Anticipate and prepare for failures in microservices b...
PDF
[WSO2Con EU 2017] Resilience Patterns with Ballerina
Resilience reloaded - more resilience patterns
Fault Tolerance in Distributed Environment
Resilience Planning & How the Empire Strikes Back
Resisting to The Shocks
Resilience engineering
Microservices Resilient Engineering - Java meetup.pptx
Resilience-Patterns in Cloud-Applications
Designing distributed systems
Using Hystrix to Build Resilient Distributed Systems
Fault Tolerance in Distributed System
Expect the unexpected: Prepare for failures in microservices
Patterns of resilience
Architecting for Failures in micro services: patterns and lessons learned
Resilience planning and how the empire strikes back
Reliability and Resilience Patterns
The Anatomy of Failure - Lessons from running systems to serve millions of pe...
Go Reactive: Event-Driven, Scalable, Resilient & Responsive Systems (Soft-Sha...
The 7 quests of resilient software design
Expect the unexpected: Anticipate and prepare for failures in microservices b...
[WSO2Con EU 2017] Resilience Patterns with Ballerina
Ad

More from Tu Pham (20)

PDF
Multimodal Search in Google Cloud: LLMs with vision
PPTX
From CTO To CEO: The Pathway and Rewards
PPTX
Go from idea to app with no coding using AppSheet.pptx
PDF
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
PDF
Challenges In Implementing SRE
PDF
IT Strategy
PDF
Set up Learn and Development program
PDF
Cost Management For IT Project / Product
PDF
Minimum Viable Product 101
PDF
Understand your customers
PDF
Let's build great products for mid-size companies
PDF
End To End Business Intelligence On Google Cloud
PDF
High Output Tech Management
PDF
Big Data Driven At Eway
PDF
Security On The Cloud
PPTX
Eway Tech Talk #2 Coding Guidelines
PDF
End To End Machine Learning With Google Cloud
PPTX
Eway Tech Talk #0 Knowledge Sharing
PPTX
Php 5.6 vs Php 7 performance comparison
PDF
System Security on Cloud
Multimodal Search in Google Cloud: LLMs with vision
From CTO To CEO: The Pathway and Rewards
Go from idea to app with no coding using AppSheet.pptx
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Challenges In Implementing SRE
IT Strategy
Set up Learn and Development program
Cost Management For IT Project / Product
Minimum Viable Product 101
Understand your customers
Let's build great products for mid-size companies
End To End Business Intelligence On Google Cloud
High Output Tech Management
Big Data Driven At Eway
Security On The Cloud
Eway Tech Talk #2 Coding Guidelines
End To End Machine Learning With Google Cloud
Eway Tech Talk #0 Knowledge Sharing
Php 5.6 vs Php 7 performance comparison
System Security on Cloud

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...

Latency Control And Supervision In Resilience Design Patterns

  • 1. Bảo mật Dành cho Tên công ty Phiên bản 1.0 Latency Control & Supervision in Resilience Design Patterns Tu Pham - CTO @ Eway
  • 2. Bảo mật Dành cho Tên công ty Phiên bản 1.0 Terminology Why It So IMPORTANT Why It So HARD Design Patterns Anti Patterns Q & A TOC
  • 3. Terminology Distributed Systems These are networked components which communicate with each other by passing messages most often to achieve a common goal. Resiliency The capacity of any system to recover from difficulties. Availability Probability that any system is operating at time `t`. Reliability Degree to which a system / component performs specified functions under specified conditions for a specified period of time
  • 4. Faults Fault is an incorrect internal state in your system. Examples: 1. Slowing down of storage layer 2. Memory leaks in application 3. Blocked threads 4. Dependency failures 5. Bad data propagating in the system (Most often because there’s not enough validations on input data) Terminology Failure Failure is an inability of the system to perform its intended job. Examples: Failure means loss of Up-Time and availability on systems. Faults if not contained from propagating, can lead to failures.
  • 6. Why It So IMPORTANT 1 Losing customers and partners to competitors => Financial losses for the company 2 Affecting livelihood of publishers and advertisers 3 Affecting salary and bonus of OUR TEAM :)) 4 Affecting services for customers and colleges
  • 7. But building resiliency in a complex micro-services architecture with multiple distributed systems communicating with each other is difficult. Why It So HARD
  • 8. Some of the things which make it hard are: 1. The network is unreliable 2. Dependencies can always fail 3. User behavior is unpredictable Why It So HARD
  • 11. Latency Control ● Complements isolation ● Detection and handling of non-timely responses ● Avoid cascading temporal failures ● Different approaches and patterns available 0 20 40 60 80
  • 12. Timeout ● Preserve responsiveness independent of downstream latency ● Measure response time of downstream calls ● Stop waiting after a pre-determined timeout ● Take alternate action if timeout was reached
  • 14. Fail Fast ● “If you know you’re going to fail, you better fail fast” ● Avoid foreseeable failures ● Usually implemented by adding checks in front of costly actions ● Enhances probability of not failing
  • 15. Circuit Breaker ● Probably most often cited resilience pattern ● Extension of the timeout pattern ● Takes downstream unit offline if calls fail multiple times ● Specific variant of the fail fast pattern
  • 19. Fan out & quickest reply ● Send request to multiple workers ● Use quickest reply and discard all other responses ● Reduces probability of latent responses ● Tradeoff is WASTE of resources
  • 20. Bounded Queues ● Limit request queue sizes in front of highly utilized resources ● Avoids latency due to overloaded resources ● Introduces pushback on the callers ● Another variant of the fail fast pattern
  • 22. Supervision ● Provides failure handling beyond the means of a single failure unit ● Detect unit failures ● Provide means for error escalation ● Different approaches and patterns available
  • 23. Shed Load ● Upstream isolation pattern ● Avoid becoming overloaded due to too many requests ● Install a gatekeeper in front of the resource ● Shed requests based on resource load
  • 24. Monitor ● Observe unit behavior and interactions from the outside ● Automatically respond to detected failures ● Part of the system – complex failure handling strategies possible ● Outside the system – more robust against system level failures
  • 25. Error Handler ● Units often don’t have enough time or information to handle errors ● Separate business logic and error handling ● Business logic just focuses on getting the task done (quickly) ● Error handler has sufficient time and information to handle errors
  • 26. Escalation ● Units often don’t have enough time or information to handle errors ● Escalation peer with more time and information needed ● Often multi-level hierarchies ● Pure design issue
  • 29. Fallback ● Units often don’t have enough time or information to handle errors ● Instead of aborting the computation because of a missing response, we fill in a fallback value. ● Of course, it can be DANGEROUS !!!
  • 30. Retry ● Units have enough time or information to handle errors ● Just send the requests again and again til it reach the BOUNDARY of policy
  • 31. Escalation ● Units often don’t have enough time or information to handle errors ● Escalation peer with more time and information needed ● Often multi-level hierarchies ● Pure design issue
  • 32. Just Don’t ● Infinity delay ● One config / policy for all situations ● Fallback logics without confirmation from business departments / upper managers ● Laggy / buggy monitoring system
  • 35. References ● https://github.com/Netflix/Hystrix ● https://github.com/alibaba/Sentinel ● https://github.com/resilience4j/resilience4j ● https://github.com/jhalterman/failsafe
  • 36. “Just Design Our Systems For Failure” Q&A