Latency Control And Supervision In Resilience Design Patterns

Bảo mật Dành cho Tên công ty Phiên bản 1.0
Latency Control & Supervision in
Resilience Design Patterns
Tu Pham - CTO @ Eway

Bảo mật Dành cho Tên công ty Phiên bản 1.0
Terminology
Why It So
IMPORTANT
Why It So HARD
Design Patterns
Anti Patterns
Q & A
TOC

Terminology
Distributed Systems
These are networked components which communicate with each other
by passing messages most often to achieve a common goal.
Resiliency
The capacity of any system to recover from difficulties.
Availability
Probability that any system is operating at time `t`.
Reliability
Degree to which a system / component performs specified functions
under specified conditions for a specified period of time

Faults
Fault is an incorrect internal state in your
system. Examples:
1. Slowing down of storage layer
2. Memory leaks in application
3. Blocked threads
4. Dependency failures
5. Bad data propagating in the system (Most
often because there’s not enough validations
on input data)
Terminology
Failure
Failure is an inability of the system to perform
its intended job. Examples:
Failure means loss of Up-Time and availability
on systems. Faults if not contained from
propagating, can lead to failures.

Why It So IMPORTANT
1
Losing customers and partners to
competitors => Financial losses for the
company
2
Affecting livelihood of publishers and
advertisers
3
Affecting salary and bonus of OUR TEAM
:))
4
Affecting services for customers and
colleges

But building resiliency in a complex
micro-services architecture with
multiple distributed systems
communicating with each other is
difﬁcult.
Why It So HARD

Some of the things which make it
hard are:
1. The network is unreliable
2. Dependencies can always fail
3. User behavior is unpredictable
Why It So HARD

Latency
Control
● Complements isolation
● Detection and handling of non-timely
responses
● Avoid cascading temporal failures
● Different approaches and patterns available
0
20
40
60
80

Timeout
● Preserve responsiveness
independent of downstream latency
● Measure response time of
downstream calls
● Stop waiting after a pre-determined
timeout
● Take alternate action if timeout was
reached

Fail Fast
● “If you know you’re going to fail, you
better fail fast”
● Avoid foreseeable failures
● Usually implemented by adding
checks in front of costly actions
● Enhances probability of not failing

Circuit Breaker
● Probably most often cited resilience
pattern
● Extension of the timeout pattern
● Takes downstream unit ofﬂine if
calls fail multiple times
● Speciﬁc variant of the fail fast
pattern

Fan out & quickest
reply
● Send request to multiple workers
● Use quickest reply and discard all
other responses
● Reduces probability of latent
responses
● Tradeoff is WASTE of resources

Bounded Queues
● Limit request queue sizes in front of
highly utilized resources
● Avoids latency due to overloaded
resources
● Introduces pushback on the callers
● Another variant of the fail fast
pattern

Supervision
● Provides failure handling beyond the means of
a single failure unit
● Detect unit failures
● Provide means for error escalation
● Different approaches and patterns available

Shed Load
● Upstream isolation pattern
● Avoid becoming overloaded due to
too many requests
● Install a gatekeeper in front of the
resource
● Shed requests based on resource
load

Monitor
● Observe unit behavior and
interactions from the outside
● Automatically respond to detected
failures
● Part of the system – complex failure
handling strategies possible
● Outside the system – more robust
against system level failures

Error Handler
● Units often don’t have enough time
or information to handle errors
● Separate business logic and error
handling
● Business logic just focuses on
getting the task done (quickly)
● Error handler has sufﬁcient time
and information to handle errors

Escalation
● Escalation peer with more time and
information needed
● Often multi-level hierarchies
● Pure design issue

Fallback
● Instead of aborting the computation
because of a missing response, we
ﬁll in a fallback value.
● Of course, it can be DANGEROUS !!!

Retry
● Units have enough time or
information to handle errors
● Just send the requests again and
again til it reach the BOUNDARY of
policy

Just Don’t
● Infinity delay
● One config / policy for all situations
● Fallback logics without confirmation from
business departments / upper managers
● Laggy / buggy monitoring system

References
● https://github.com/Netﬂix/Hystrix
● https://github.com/alibaba/Sentinel
● https://github.com/resilience4j/resilience4j
● https://github.com/jhalterman/failsafe

“Just Design Our Systems For Failure”
Q&A

Latency Control And Supervision In Resilience Design Patterns

More Related Content

What's hot (19)

Similar to Latency Control And Supervision In Resilience Design Patterns (20)

More from Tu Pham (20)

Recently uploaded (20)

Latency Control And Supervision In Resilience Design Patterns