Stability anti patterns in cloud-native applications

Stability anti-patterns in
cloud-native applications
#devfestRO
@ammbra1508

I am Ana
Solutions Architect @ IBM
Co-founder of Bucharest Software Craftsmanship
Community
HELLO!
#devfestRO @ammbra1508

#devfestRO
@ammbra1508
• I am Ana
• Solutions Architect @ IBM
• Co-founder of Bucharest Software
Craftsmanship Community
HELLO!
What are anti-patterns?
An anti-pattern is a common response to a
recurring problem that is usually ineffective
and risks being highly counterproductive.
Wikipedia

#devfestRO
@ammbra1508
• I am Ana
HELLO!
STABILITY Continue to work with a
SYSTEM
DISRUPTIONS OCCUR
TEMPORARY SHOCKS
CONTINUOUS LOAD STRESS
COMPONENT FAILURES
Users can Even when
System Stability

Application deployments in a specific
order

#devfestRO
@ammbra1508
HELLO!
Examples and symptoms

#devfestRO
@ammbra1508
A specific order to the start and stop processes when bringing up applications.
A deployment depends on previous successful deployment of another application.
A service waits for another component to be available.
kubectl wait --for=condition=ready pod -l app=backend

#devfestRO
@ammbra1508
Why is it bad?
Wait time between deployments
equals application not fully functional
When the condition is never met, the next deployment
cannot proceed and the process breaks.

#devfestRO
@ammbra1508
Solutions at design level
Concurrently deploy and start all
parts of an application.
Use retry patterns. Use circuit-breaker patterns.

#devfestRO
@ammbra1508
Choose a deployment strategy
Blue/green deployments for instant rollout/rollback
LB
Pod
v2
Pod
v2
1.
Pod
v3
Pod
v3
LB
Pod
v2
Pod
v2
2.
Pod
v3
Pod
v3
LB
Pod
v3
Pod
v3
3.
Pod
v2
Pod
v2
LB
Pod
v3
Pod
v3
4.
Pod
v2
Pod
v2

#devfestRO
@ammbra1508
Choose a deployment strategy
Canary deployments when the user does the testing
LB
Pod
v3
Pod
v3
1.
Pod
v4
LB
Pod
v3
Pod
v3
2.
Pod
v4
LB
Pod
v3
Pod
v3
Pod
v4
3.
LB
Pod
v4
Pod
v4
Pod
v4
4.

Integration communication and
composition

#devfestRO
@ammbra1508
HELLO!
Synchronous
call-and-response
based system
Queue-based
messaging
systems
System-to-System
messaging via
SMTP or SMS
Integration via synchronous communication with
a software that forces the calling system to
wait/stop from what is doing.
Containerizing the middleware as is.
Distorted usage of declarative deployment pattern.

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Requests may be sent but
not receive a reply.
The provider claims to
send a different response
format.
Synchronous calls are
vicious amplifiers that
facilitate blockages.
Tightly coupled
middleware amplifies
shocks to the system.
Why is it bad?

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Application Level Solutions (1)
Circuit Breaker pattern
✚ Consider delayed retries.
✚ Report, record and correlate state changes.

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Application Level Solutions (2)
Failure rate
threshold
Configuration example for SpringBoot with Resillience4j
Allowed number
of calls in half-
open state
Sliding window
Wait duration

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Is this enough?
Yes, if your end-user is happy that
<<real>> data is not present in the response.
Yes, if your fallback method retrieves a
unified satisfying response.

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Transition Solutions (1)
DETERMINE THE
OUTAGE PATTERN
BASED ON MAINTENANCE
WINDOW
CACHE SOLUTION
Stress
testing
+
Load
Testing

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Transition Solutions (2)
Cache Hit:
Check if TTL of first layer cache expired
NO: Return the response from Redis.
YES: Go at 2nd layer cache and find entry:
Try to call the real method
On Success: Store the new result with a proper TTL
On Failure: Extend the existing TTL to put it back into the
first layer and return the result.
Cache miss:
Try to call the real method
On Success: Store the new result with a proper TTL
On Failure: Return a result covering the miss. Record the failure.
helm3 install my-redis stable/redis

Source: “Inside out” animation by Pixar
https://www.psychologies.co.uk/inside-out-interview-creators

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Circuit
breaker
Traefik
Solutions at Kubernetes level(1)

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Circuit
breaker
Istio Traefik

Unlimited resources mirage

#devfestRO
@ammbra1508

Not setting memory or CPU can result into scheduling an unlimited number of pods
on any node.
The container could use all of the available memory on its node, possibly invoking
the OOM (out of memory) Killer.
The default memory limit of the namespace (in which the container is running) is
assigned to the container.
#devfestRO#devfestRO @ammbra1508
Why is it bad?

#devfestRO
@ammbra1508
Solutions at Kubernetes level
Set memory and CPU requests
below their limits
Control resource limits via
ResourceQuotas and LimitRange in
the namespace settings.
Keep the CPU request at 1 core
or below and
use ReplicaSets to scale it out.

Scaling results

#devfestRO
@ammbra1508
• I am Ana
HELLO!
2 hosts, 2 instances, 300
threads
3000
threads
70
instances
20 hosts
Anytime you have a “many-to-one” or “many-to-few”
relationship.
Amplify the scaling effects through “shared resource” or
”commons project”.
Dangerously combining horizontal and vertical
autoscaling.
Counterproductive usage of predictable demands and
elastic scale patterns.

#devfestRO
@ammbra1508
• I am Ana
HELLO!
One service can flood another with
requests beyond its capacity.
Shared resources are a
capacity constraint.
Why is it bad?

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Solutions at design level
Approximate a shared-nothing
architecture through reducing the number
of callers of the shared resource.
Design for pairs of applications that
each act as a failover for the other.

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Provision cluster nodes to have the same resource
footprint.
Ensure that the cluster autoscaler pod has enough
resources.
To avoid delays in provisioning, over-provision your
cluster.
https://github.com/kubernetes/autoscaler/blob/master/cluster-
autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-
cluster-autoscaler

#devfestRO
@ammbra1508
• I am Ana
HELLO!
Ensure that every pod has resource requests defined.
Validate that resource requests are close to actual usage.
Install metrics-server and configure custom/external metrics.
Specify PodDisruptionBudget for application pods.

Takeaways
Avoid deploying things in a specific order: applications should not wait
because a dependency is not ready.
Consider setting memory and CPU limits to reduce the risk of resource
contention and that resource requests are close to actual usage.
Utilize Kubernetes’s self-healing mechanism, implement retries and
circuit breakers both at application and Kubernetes level.
Avoid using both HPA and VPA; consider installing metrics-server and
adding custom metrics for horizontal scaling.

Thank YOU!

Stability anti patterns in cloud-native applications

More Related Content

What's hot (20)

Similar to Stability anti patterns in cloud-native applications (20)

More from Ana-Maria Mihalceanu (20)

Recently uploaded (20)

Stability anti patterns in cloud-native applications