SlideShare a Scribd company logo
Amazon Builder Library
Notes
Diego Pacheco
@diego_pacheco
❏ Cat's Father
❏ Principal Software Architect
❏ Agile Coach
❏ SOA/Microservices Expert
❏ DevOps Practitioner
❏ Speaker
❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
About me...
https://diegopacheco.github.io/
Amazon Builders Library
https://aws.amazon.com/builders-library/
❏ How the Build AWS
❏ Amazon Experience
❏ Theory
❏ Practice
❏ Real Cases
❏ Techniques and products
❏ Super interesting
❏ 13 Articles so far
The Library
Avoid one way doors
The Importance of Rollback
Type 1 decisions are not reversible, and you have to be very
careful making them. (One way doors)
Type 2 decisions are like walking through a door — if you
don't like the decision, you can always go back. (Two-Way
Doors).
1. No Errors
2. No Service Disruption
Backward Compatibility
2 Phase Deployment
https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/?did=ba_card&trk=ba_card
Local & External Caches
Local cache
❏ Added on Demand
❏ No Ops Overhead
❏ In-memory HashTable
External cache
❏ Memcached | Redis
❏ Reduce Cache Coherence issue
❏ No Cold Start issues
❏ Load of Downstream is reducedIssues
❏ Downstream load
proportional to fleet size
❏ Cache Coherence
❏ Cold Start
Issues
❏ More Complexity
❏ More Ops Overhead
Inline & Side Caches
Inline cache
❏ R/W Trought
❏ Embedded Cache mgmt
❏ Dax, Nginx, Varnish
❏ Uniform API model for
clients
❏ Cache logic outside of
the code (Eliminating
potential bugs).
Side cache
❏ ElastiCache(Redis|Memcached)
❏ Guava | EhCache
❏ Application controls the cache
Cache Challenges
Figure it out the right
❏ Cache Size
❏ Expiration Policy
❏ Eviction Policy
Most Common Expiration policy
❏ Time-based: TTL
Amazon use 2 TTLS
❏ Soft: For updates
❏ Hard: For eviction
* Used in IAM
Most Common Eviction policy
❏ LRU
Keep Eye on
❏ Cache HIT / Miss
metrics
Downstream fallback
Be Careful
❏ Could spike traffic in
downstream
❏ Could lead to:
❏ Throttling
❏ Burnout
Better Options
❏ In case of External Cache
outage:
❏ Fallback to Local Cache
❏ Use Load Shedding -
reduce the number of
requests going to
downstream
Thundering Herd Problem
The Issue
❏ Many clients requesting the same key / data
❏ Uncached - so forces go to downstream.
❏ Empty Local cache (just joined the fleet)
❏ Situation could lead to:
❏ Burnout
❏ Throttling
The Solution
❏ Cache Coaleasing
❏ Varnish nginx have this feature
❏ Make sure just 1 request goto the downstream
Leader Election (Single-Leader)
Benefits
❏ Easier to Understand
❏ Works Simply
❏ Offers client
consistency
Downsides
❏ SPOF
❏ Single Point of Scaling
❏ Single point of truth (bad
leader has high blast
radius)
❏ Partial Deployments are
hard to apply
Leader Election Best Practices
Amazon does:
❏ Modeling systems with TLA+
❏ Check Remaining lease before side-effect ops outside of the leader
❏ Consider on the code: slow network, timeouts, retrys, gc pauses
❏ Avoid Heart Beating leases on background thread
❏ Make it easy to find the host who is current leader
Avoinding Fallback
Issues
❏ Hard to Test
❏ Fallback could fail
❏ Fallback could make it worst
❏ Fallback could introduce latent bug
Let it Crash
❏ Erland
❏ Akka
❏ ...now Amazon
How Amazon Avoid Fallbacks
Do:
❏ Make non-fallback code more resilient
❏ Let the caller handle the failure
❏ Push Data Proactivity (IAM credential push data and its valid for several
hours).
❏ Convert fallaback to failover
❏ Ensure retry/timeouts don't become fallback
Static Stability
Amazon does dor Ec2:
❏ Control Plane vs Data Plane
❏ Control plane is more complex
❏ Data plane is more simple therefore more reliable
❏ AZs(Availability Zones) don't share:
❏ Power
❏ Infrastructure
❏ AZs are connected to each other fast fiber optical network
Static Stability ~ EC2
Control Plane
❏ Finds physical server
❏ Allocate network interface
❏ Generate EBS volume
❏ Install SG rules
❏ More Complex
Data Plane
❏ Routes Packages to the VPC route
table
❏ R/W from Amazon Volumes
❏ Much more simple than Control
plane therefore more available
❏ Control Plane impairment:
❏ Loose updates SGs
❏ But machine keep working
Static Stability Under the hood
Ec2 Static Stability:
❏ 2 Azs in same regions get deploys in different days
❏ Deploy first in one Box / Cell then 1/N Servers
❏ Align Ec2 deploy with AZ boundary ~ if deploy goes wrong affects only 1
AZ, them is rollback, fixed and deployed again.
❏ Packets flow stay under same AZ(avoid cross boundaries)
❏ Always provision capacity you don't need:
❏ AZs are 50% overprovisioned
❏ AZs operate at maximum 66% of the level which was load-tested
Implementing Health Checkers
Types of Health Checkers:
❏ Liveness Health Checker: an I healthy?
❏ Local Health Checker:
❏ Check disk
❏ critical proxy
❏ missing support process ~Observability (flying blind issue)
❏ Dependency Health Checkers
❏ Bad Configuration or State Metadata
❏ Inability to communicate with Peers Services
❏ Other issues: memory leaks, deadlocks can make server show errors
Implementing Health Checkers
Anomaly Detection
❏ Compare Server with peers
To realize if is behaving oddly.
❏ Aggregate data and
compare errors rates.
Cannot Detect
❏ Clock Skew
❏ Old Code
❏ Any unanticipated failure
more
React to HC Failures
❏ Fail Open (ELB)
❏ Central authority
❏ When all fail - allow
traffic
❏ Prioritize your Health
❏ Max socket
connections to avoid
death spiral
Going fast with CD
Takeaways:
❏ Always improve release process without being a blocker to business
❏ Add checkers on the Pipelines/Steps rather than manual process
❏ Reducing risk defect affects customers:
❏ Deployment hygiene (Minimum health hosts ~ CodeDeploy)
❏ Test Prior Production: Unit, Integration, Browser, Inject Failure
❏ Validate in Production: Don't release all at once.
❏ Deploys are done in business hours
Timeouts, Retries, Backoff + Jitter
Takeaways:
❏ It's impossible to avoid failure(only reduce the probability)
❏ Basic Constructs to make systems more reliable(Google SRE saus the same):
❏ Timeouts, Retry, Exponential Backoff + Jitter
❏ Retries make the client survive partial failures
❏ Pick the right timeout is hard. Too low: Increase traffic + latency
❏ Latency metrics help you to pick the right value
❏ Amazon accept the rate of false timeouts o.1% (p99,9)
Timeouts, Retries, Backoff + Jitter
When Default strategy dont work:
❏ Clients with substantial
network latency (over the
internet)
❏ Clients with tight latency
bound p99.9 close to p50
❏ Impls that does not cover
DNS or TLS handshake
times
Retry Issues
❏ Circuit Breakers introduce modal
behavior which is difficult to test
❏ Local Token Bucket fix CB issues
❏ Local Token Bucket is on AWS SDK
since 2016
❏ Also important to know when to
retry and analyze http errors
Using Load shedding to avoid overload
❏ Amazon avoid overload by design systems to scale proactively before the
overload
❏ Protection in layers: Automatic Scaling, Shed excess load gracefully,
monitoring all mechanisms and continuous testing
❏ University Scalability Law
❏ Derivation of amdahl's law
❏ Theory ~ University Scalability Law
❏ While the system throughput can improve using parallelization
❏ But its limited by the throughput points of serialization (what
cannot be parallelized)
Using Load shedding to avoid overload
❏ Throughput is bounded by system resources
❏ Throughput also decreases with Overload
Using Load shedding to avoid overload
❏ Graph is hard to read and is better distinguish good Goodput vs throughput
❏ Throughput = total number of requests per second (RPS)
❏ Goodput = subset of Throughput handle without errors and without low
latency
Using Load shedding to avoid overload
Using Load shedding to avoid overload
● Preventing work going to waste
○ Load Shedding: When server is overloaded start rejecting some requests.
○ Load Shedding: Goal - is to keep the latency low and makes the system
more available
● Even with Load Shedding at some point server preys the price and amdahl's
law and goodput drops.
Using Load shedding to avoid overload
Using Load shedding to avoid overload
❏ Load Shedding mechanisms
❏ Overload might happen:
❏ Unexpected traffic
❏ Loss of Fleet Capacity (Bad Deployment of other reasons)
❏ Client Shifting from making Cheap Requests (like cached reads) to expensive requests(cache misses or writes)
❏ The cost of Dropping Requests
❏ Amazon drop requests only after the Goodput pletou
❏ Amazon make sure the Cost of dropping requests is small
❏ Dropping requests too early could be more expensive than it needs to be
❏ In Rare cases dropping requests could be more expensive then holding the requests
❏ In this cases amazon slow down rejecting requests to a minimum the latency of successful responses
❏ Prioritize Requests
❏ The most important request the server will receive is the ping from load balancer
❏ Prioritization and throttling can be used together
❏ Amazon spend lots of time on placing algorithms but favors predictive provisioned workload over unpredictable workload
Using Load shedding to avoid overload
❏ Keeping an eye on the clock
❏ If the server realize the request is half-way and client timeout it could skip the rest of the work and fail the request
❏ IT's important to include timeout hints on requests which tell the server how long the client will wait
❏ IF an API has start() and end() operations end() should be prioritized over start().
❏ Pagination can be dangerous - amazon design the services to perform bounded work and not paginate endlessly
❏ Watching out for queues
❏ Look request duration when managing internal queue
❏ Record how long the work was sitting on the queue waiting to be processed
❏ Bounded Size Queues are important
❏ Limit upper bound time that the work will wait on the queue and discard if pass it.
❏ Sometimes use a LIFO approach which HTTP/2 supports
❏ LB might queue incoming requests (Surge Queues) - these queues can lead to burnout
❏ It's safair to use a spillover configuration which fails-fast instead of queueing
❏ Classic ELB use surges queue but ALB reject excess traffic
❏ Protecting from overload in lower layers
❏ MAX connection (like nginx has) is used as last resort and not as default mechanism
❏ Iptables can be used to reject connection in emergencies
❏ AWS WAF can shed excess traffic on a number of dimensions
Avoid queue backlogs
Fast Mode
❏ When there is no backlog
❏ Latency is low
❏ System is fast
Sinister Mode
❏ If load increase or failure happens
❏ End-2-end latency goes higher
❏ Sistener mode kicks in
❏ Takes long time to go back to fast
mode.
❏ Queues suppose to increase availability could backfire make recovery time worst
❏ Queue-based system when system is down, message keep arriving (big backlog)
❏ Queue-based systems have 2 models
Avoid queue backlogs
How to measure availability and latency?
❏ Producer Availability is proportional to queue availability
❏ IF we measure availability on consumer side it might look worse than it is.
❏ Availability Measures from DLQ.
❏ DLQ metrics are good but might detect the problem too late.
❏ SQS has timestamps for each message consumed from the queue : Can log produce netrics how behind it is.
❏ IoT Strategy: categorizing metrics of first attempts separate from metrics of the latency of retry
attempts
❏ X-ray and Distribute tracing can help to understand/debug
Avoid queue backlogs
Backlogs in multi tenant async systems
❏ Amazon don't expose internal queue direct to you (aws lambda)
❏ Throttling to guarantee fairness - per consumer rate-based limits
❏ Limits provide guard rails for unexpected spikes allowing aws do the provisionings
need under the hood
❏ Design Patterns to avoid large queue backlogs
❏ Protection at every layer - throttling
❏ Using more than one queue helps to shape the traffic -
❏ Real Time systems use FIFO but prefer LIFO behavior
Avoid queue backlogs
Amazon Approach: Creating Resilient multi tenant async systems
❏ Amazon Separate workloads in different queues
❏ Shuffle sharding - Aws lambda and IoT does have queues for every device/function
❏ Sideling excess traffic to separate queue
❏ Sideling old traffic to separate queue
❏ Dropping old messages
❏ Limiting Threads and other resources per queue
❏ Sending Back Pressure upstream - Amazon MQ
❏ Delay Queues
❏ Avoid many in-flight messages
❏ DLQ for messages that cannot be processed
❏ Ensuring additional buffer for polling threads workloads - to absorb bursts
❏ Heartbeating long running messages
❏ Plan for Cross-host debugging
Workload isolation with shuffle sharding
Amazon Invented Shuffle Sharding
❏ Route53 serves the biggest websites in the world
❏ Use Amazon for Root Domain but thanks to Design decision made in DNS protocol on 1980 its not
simple/easy
❏ CNAME offload part of the sub-domain to another provider but does not work at root top level
❏ To serve customer needs Amazon need to host customers domains.
❏ Host DNS is not small task if there is problems you can make the whole business OFFLINE
❏ Shuffle Sharding was invent to handle DDos attacks in Route53
❏ Powerful pattern to deliver cost-effective / multi-tenant services
❏ Regular sharding can make the whole system go down during a DDoS Attack - Scope of failure is
“Everything for everyone”.
Workload isolation with shuffle sharding
Workload isolation with shuffle sharding
Divide the workers into 4 shards reduced the blast radius from 100% to 25%
Workload isolation with shuffle sharding
Shuffle Sharding we create virtual shards and divide even more - 8 workers = 28 unique combinations =
28 shuffle shards - Scope of the problem is 1/28 == 7 times better than regular sharding.
Workload isolation with shuffle sharding
Route53 has 2048 virtual name servers == 730 billion shuffle shards == unique shuffle shard to every domain
https://github.com/awslabs/route53-infima
Instrumenting dist sys for Observability
Amazon Learnings
❏ Great Instrumentation helps to see what experience we are giving to
customers
❏ Amazon consider more than avg latency and focus on outliers p99.9 and
p.99.99 - 1k in 10k request slow still poor experience.
Instrumenting dist sys for Observability
❏ Amazon has standard libraries to
instrument logs and metrics.
❏ Amazon instrument logs with 2
kinds of data: Request data and
Debug Data (different log files)
Instrumenting dist sys for Observability
Request Log Best Practices
❏ Emit 1 and 1 only log entry per request
❏ Record Request details before doing validations
❏ Sanitize request before logging (encode, escape, and truncate)
❏ Don't add 1MB Strings into the log just because is on the request
❏ Keep metric names short but not too short
❏ Break Long-running task (minutes / hours) in multiple logs entry
❏ Amazon Logs format are binary and use http://amzn.github.io/ion-docs/
❏ Ensure Log Volumes are big enough to handle at Max Throughput
❏ Consider Behavior of the system with disk full - Operate without log is risky,
detect when server has a disk near to be full.
Instrumenting dist sys for Observability
Request Log Best Practices
❏ Synchronize clocks
https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-ser
vice/
❏ Amazon also uses: https://chrony.tuxfamily.org/
❏ Emit zero counts for availability metrics
❏ 1 Request succeeded
❏ 0 Request failed
Instrumenting dist sys for Observability
What to Log?
❏ Log Availability and latency of dependencies
❏ Break out dependency metrics per call, per resource, per status code
❏ Record memory queue depth when accessing them
❏ Organize Errors by Category of Cause | Add Additional counter for error
reason (Diego Pacheco Note: I did this in the past - called "Error
Observability" - Also expose via REST)
❏ Log Important metadata about the unit of work
❏ Protect logs with access control and encryption
Instrumenting dist sys for Observability
What to Log?
❏ Avoid putting overly sensitive information in logs
❏ Log Trace ID and propagate to backend calls (Diego Pacheco Note: I did this
a lot also called MID(Message ID) generated at the Gateway/Edge layer and
propagated to all calls via HTTP HEADERS and Message HEADERS .i.g:
JMS).
❏ Log different latency metrics depending on status code and size
❏ Categorized, like Small Request Latency and Large Request Latency
Instrumenting dist sys for Observability
Application Log Best Practices
❏ Keep the Application log free of spam - INFO / DEBUG are disabled in prod.
❏ Application log is a location for trace information
❏ Include the corresponding request ID
❏ Rate-limit an application log error spam
❏ Prefer format strings over String#format or string concatenation. - Avoid
Format String on DEBUG calls won't be called.
❏ Log request IDs from failed service calls
Instrumenting dist sys for Observability
High throughput Services Log Best Practices
❏ DynamoDB serves 20M RPS of amazon internal traffic
❏ Log Sampling - Write out every N entries not every single one. Prioritize Log
slow and failure requests instead of successful ones.
❏ Offload serialization and log flushing to a separate thread.
❏ Frequent Log Rotation
❏ Write logs pre-compressed
❏ Write to a ramdisk / tmpfs
❏ In-memory aggregates | Monitor resource utilization
Amazon Builder Library
Notes
Diego Pacheco

More Related Content

PDF
Cassandra
PPT
Web Speed And Scalability
PPTX
Caching & Performance In Cold Fusion
PPTX
Show Me The Cache!
PDF
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
PPT
Understanding
PDF
Introduction to Java performance tuning
PDF
DrupalCamp LA 2014 - A Perfect Launch, Every Time
Cassandra
Web Speed And Scalability
Caching & Performance In Cold Fusion
Show Me The Cache!
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Understanding
Introduction to Java performance tuning
DrupalCamp LA 2014 - A Perfect Launch, Every Time

What's hot (20)

PDF
Cloud-Native DevOps Engineering
PPTX
Architecting fail safe data services
PPTX
Cloud Architecture & Distributed Systems Trivia
PDF
Varnish Cache Plus. Random notes for wise web developers
PDF
Boyan Ivanov - latency, the #1 metric of your cloud
PDF
Modern day jvm controversies
ODP
Choosing a Web Architecture for Perl
PPTX
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
PDF
Care and feeding notes
PDF
Take home your very own free Vagrant CFML Dev Environment - Presented at dev....
PDF
Distributed Queue System using Gearman
PPTX
[충격] 당신의 안드로이드 앱이 느린 이유가 있다??!
PDF
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
PDF
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
PPTX
Achieving Massive Scalability and High Availability for PHP Applications in t...
PDF
Gearman: A Job Server made for Scale
PPTX
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
PPTX
BTV PHP - Building Fast Websites
PDF
How to cache your static resources
PPTX
How to make your site 5 times faster in 10 minutes
Cloud-Native DevOps Engineering
Architecting fail safe data services
Cloud Architecture & Distributed Systems Trivia
Varnish Cache Plus. Random notes for wise web developers
Boyan Ivanov - latency, the #1 metric of your cloud
Modern day jvm controversies
Choosing a Web Architecture for Perl
Tech Talk Series, Part 4: How do you achieve high availability in a MySQL env...
Care and feeding notes
Take home your very own free Vagrant CFML Dev Environment - Presented at dev....
Distributed Queue System using Gearman
[충격] 당신의 안드로이드 앱이 느린 이유가 있다??!
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES
Achieving Massive Scalability and High Availability for PHP Applications in t...
Gearman: A Job Server made for Scale
Tech Talk Series, Part 2: Why is sharding not smart to do in MySQL?
BTV PHP - Building Fast Websites
How to cache your static resources
How to make your site 5 times faster in 10 minutes
Ad

Similar to Amazon builder Library notes (20)

PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
Serverless best practices plus design principles 20m version
PDF
Serverless AWS reInvent 2019 recap
PDF
All the Ops you need to know to Dev Serverless
PPTX
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
PDF
Serverless on AWS: Architectural Patterns and Best Practices
PPTX
Keynote - Chaos Engineering: Why breaking things should be practiced
PDF
What to do when it's not you
PDF
What to do when it's not you
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PPTX
AWS fault tolerant architecture
PPTX
From Monolithic to Modern Apps: Best Practices
PPTX
Chaos Engineering: Why Breaking Things Should Be Practised.
PDF
(Kishore Jalleda) Launching products at massive scale - the DevOps way
PDF
Journey towards serverless infrastructure
PPTX
Inovação Rápida: O caso de negócio para desenvolvimento de aplicações modernas.
PPTX
Expect the unexpected: Anticipate and prepare for failures in microservices b...
PDF
How we scaled to 80K users by doing nothing!.pdf
PDF
Resilience Planning & How the Empire Strikes Back
PDF
Serverless use cases with AWS Lambda - More Serverless Event
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
Serverless best practices plus design principles 20m version
Serverless AWS reInvent 2019 recap
All the Ops you need to know to Dev Serverless
Gluecon 2018 - The Best Practices and Hard Lessons Learned of Serverless Appl...
Serverless on AWS: Architectural Patterns and Best Practices
Keynote - Chaos Engineering: Why breaking things should be practiced
What to do when it's not you
What to do when it's not you
Chaos Engineering: Why Breaking Things Should Be Practised.
AWS fault tolerant architecture
From Monolithic to Modern Apps: Best Practices
Chaos Engineering: Why Breaking Things Should Be Practised.
(Kishore Jalleda) Launching products at massive scale - the DevOps way
Journey towards serverless infrastructure
Inovação Rápida: O caso de negócio para desenvolvimento de aplicações modernas.
Expect the unexpected: Anticipate and prepare for failures in microservices b...
How we scaled to 80K users by doing nothing!.pdf
Resilience Planning & How the Empire Strikes Back
Serverless use cases with AWS Lambda - More Serverless Event
Ad

More from Diego Pacheco (20)

PDF
Naming Things Book : Simple Book Review!
PDF
Continuous Discovery Habits Book Review.pdf
PDF
Thoughts about Shape Up
PDF
Holacracy
PDF
AWS IAM
PDF
PDF
Encryption Deep Dive
PDF
Sec 101
PDF
Reflections on SCM
PDF
Management: Doing the non-obvious! III
PDF
Design is not Subjective
PDF
Architecture & Engineering : Doing the non-obvious!
PDF
Management doing the non-obvious II
PDF
Testing in production
PDF
Nine lies about work
PDF
Management: doing the nonobvious!
PDF
AI and the Future
PDF
Dealing with dependencies
PDF
Dealing with dependencies in tests
PDF
Kanban 2020
Naming Things Book : Simple Book Review!
Continuous Discovery Habits Book Review.pdf
Thoughts about Shape Up
Holacracy
AWS IAM
Encryption Deep Dive
Sec 101
Reflections on SCM
Management: Doing the non-obvious! III
Design is not Subjective
Architecture & Engineering : Doing the non-obvious!
Management doing the non-obvious II
Testing in production
Nine lies about work
Management: doing the nonobvious!
AI and the Future
Dealing with dependencies
Dealing with dependencies in tests
Kanban 2020

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
20250228 LYD VKU AI Blended-Learning.pptx
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Amazon builder Library notes

  • 2. @diego_pacheco ❏ Cat's Father ❏ Principal Software Architect ❏ Agile Coach ❏ SOA/Microservices Expert ❏ DevOps Practitioner ❏ Speaker ❏ Author diegopacheco http://diego-pacheco.blogspot.com.br/ About me... https://diegopacheco.github.io/
  • 3. Amazon Builders Library https://aws.amazon.com/builders-library/ ❏ How the Build AWS ❏ Amazon Experience ❏ Theory ❏ Practice ❏ Real Cases ❏ Techniques and products ❏ Super interesting ❏ 13 Articles so far
  • 5. Avoid one way doors The Importance of Rollback Type 1 decisions are not reversible, and you have to be very careful making them. (One way doors) Type 2 decisions are like walking through a door — if you don't like the decision, you can always go back. (Two-Way Doors).
  • 6. 1. No Errors 2. No Service Disruption Backward Compatibility
  • 8. Local & External Caches Local cache ❏ Added on Demand ❏ No Ops Overhead ❏ In-memory HashTable External cache ❏ Memcached | Redis ❏ Reduce Cache Coherence issue ❏ No Cold Start issues ❏ Load of Downstream is reducedIssues ❏ Downstream load proportional to fleet size ❏ Cache Coherence ❏ Cold Start Issues ❏ More Complexity ❏ More Ops Overhead
  • 9. Inline & Side Caches Inline cache ❏ R/W Trought ❏ Embedded Cache mgmt ❏ Dax, Nginx, Varnish ❏ Uniform API model for clients ❏ Cache logic outside of the code (Eliminating potential bugs). Side cache ❏ ElastiCache(Redis|Memcached) ❏ Guava | EhCache ❏ Application controls the cache
  • 10. Cache Challenges Figure it out the right ❏ Cache Size ❏ Expiration Policy ❏ Eviction Policy Most Common Expiration policy ❏ Time-based: TTL Amazon use 2 TTLS ❏ Soft: For updates ❏ Hard: For eviction * Used in IAM Most Common Eviction policy ❏ LRU Keep Eye on ❏ Cache HIT / Miss metrics
  • 11. Downstream fallback Be Careful ❏ Could spike traffic in downstream ❏ Could lead to: ❏ Throttling ❏ Burnout Better Options ❏ In case of External Cache outage: ❏ Fallback to Local Cache ❏ Use Load Shedding - reduce the number of requests going to downstream
  • 12. Thundering Herd Problem The Issue ❏ Many clients requesting the same key / data ❏ Uncached - so forces go to downstream. ❏ Empty Local cache (just joined the fleet) ❏ Situation could lead to: ❏ Burnout ❏ Throttling The Solution ❏ Cache Coaleasing ❏ Varnish nginx have this feature ❏ Make sure just 1 request goto the downstream
  • 13. Leader Election (Single-Leader) Benefits ❏ Easier to Understand ❏ Works Simply ❏ Offers client consistency Downsides ❏ SPOF ❏ Single Point of Scaling ❏ Single point of truth (bad leader has high blast radius) ❏ Partial Deployments are hard to apply
  • 14. Leader Election Best Practices Amazon does: ❏ Modeling systems with TLA+ ❏ Check Remaining lease before side-effect ops outside of the leader ❏ Consider on the code: slow network, timeouts, retrys, gc pauses ❏ Avoid Heart Beating leases on background thread ❏ Make it easy to find the host who is current leader
  • 15. Avoinding Fallback Issues ❏ Hard to Test ❏ Fallback could fail ❏ Fallback could make it worst ❏ Fallback could introduce latent bug
  • 16. Let it Crash ❏ Erland ❏ Akka ❏ ...now Amazon
  • 17. How Amazon Avoid Fallbacks Do: ❏ Make non-fallback code more resilient ❏ Let the caller handle the failure ❏ Push Data Proactivity (IAM credential push data and its valid for several hours). ❏ Convert fallaback to failover ❏ Ensure retry/timeouts don't become fallback
  • 18. Static Stability Amazon does dor Ec2: ❏ Control Plane vs Data Plane ❏ Control plane is more complex ❏ Data plane is more simple therefore more reliable ❏ AZs(Availability Zones) don't share: ❏ Power ❏ Infrastructure ❏ AZs are connected to each other fast fiber optical network
  • 19. Static Stability ~ EC2 Control Plane ❏ Finds physical server ❏ Allocate network interface ❏ Generate EBS volume ❏ Install SG rules ❏ More Complex Data Plane ❏ Routes Packages to the VPC route table ❏ R/W from Amazon Volumes ❏ Much more simple than Control plane therefore more available ❏ Control Plane impairment: ❏ Loose updates SGs ❏ But machine keep working
  • 20. Static Stability Under the hood Ec2 Static Stability: ❏ 2 Azs in same regions get deploys in different days ❏ Deploy first in one Box / Cell then 1/N Servers ❏ Align Ec2 deploy with AZ boundary ~ if deploy goes wrong affects only 1 AZ, them is rollback, fixed and deployed again. ❏ Packets flow stay under same AZ(avoid cross boundaries) ❏ Always provision capacity you don't need: ❏ AZs are 50% overprovisioned ❏ AZs operate at maximum 66% of the level which was load-tested
  • 21. Implementing Health Checkers Types of Health Checkers: ❏ Liveness Health Checker: an I healthy? ❏ Local Health Checker: ❏ Check disk ❏ critical proxy ❏ missing support process ~Observability (flying blind issue) ❏ Dependency Health Checkers ❏ Bad Configuration or State Metadata ❏ Inability to communicate with Peers Services ❏ Other issues: memory leaks, deadlocks can make server show errors
  • 22. Implementing Health Checkers Anomaly Detection ❏ Compare Server with peers To realize if is behaving oddly. ❏ Aggregate data and compare errors rates. Cannot Detect ❏ Clock Skew ❏ Old Code ❏ Any unanticipated failure more React to HC Failures ❏ Fail Open (ELB) ❏ Central authority ❏ When all fail - allow traffic ❏ Prioritize your Health ❏ Max socket connections to avoid death spiral
  • 23. Going fast with CD Takeaways: ❏ Always improve release process without being a blocker to business ❏ Add checkers on the Pipelines/Steps rather than manual process ❏ Reducing risk defect affects customers: ❏ Deployment hygiene (Minimum health hosts ~ CodeDeploy) ❏ Test Prior Production: Unit, Integration, Browser, Inject Failure ❏ Validate in Production: Don't release all at once. ❏ Deploys are done in business hours
  • 24. Timeouts, Retries, Backoff + Jitter Takeaways: ❏ It's impossible to avoid failure(only reduce the probability) ❏ Basic Constructs to make systems more reliable(Google SRE saus the same): ❏ Timeouts, Retry, Exponential Backoff + Jitter ❏ Retries make the client survive partial failures ❏ Pick the right timeout is hard. Too low: Increase traffic + latency ❏ Latency metrics help you to pick the right value ❏ Amazon accept the rate of false timeouts o.1% (p99,9)
  • 25. Timeouts, Retries, Backoff + Jitter When Default strategy dont work: ❏ Clients with substantial network latency (over the internet) ❏ Clients with tight latency bound p99.9 close to p50 ❏ Impls that does not cover DNS or TLS handshake times Retry Issues ❏ Circuit Breakers introduce modal behavior which is difficult to test ❏ Local Token Bucket fix CB issues ❏ Local Token Bucket is on AWS SDK since 2016 ❏ Also important to know when to retry and analyze http errors
  • 26. Using Load shedding to avoid overload ❏ Amazon avoid overload by design systems to scale proactively before the overload ❏ Protection in layers: Automatic Scaling, Shed excess load gracefully, monitoring all mechanisms and continuous testing ❏ University Scalability Law ❏ Derivation of amdahl's law ❏ Theory ~ University Scalability Law ❏ While the system throughput can improve using parallelization ❏ But its limited by the throughput points of serialization (what cannot be parallelized)
  • 27. Using Load shedding to avoid overload ❏ Throughput is bounded by system resources ❏ Throughput also decreases with Overload
  • 28. Using Load shedding to avoid overload ❏ Graph is hard to read and is better distinguish good Goodput vs throughput ❏ Throughput = total number of requests per second (RPS) ❏ Goodput = subset of Throughput handle without errors and without low latency
  • 29. Using Load shedding to avoid overload
  • 30. Using Load shedding to avoid overload ● Preventing work going to waste ○ Load Shedding: When server is overloaded start rejecting some requests. ○ Load Shedding: Goal - is to keep the latency low and makes the system more available ● Even with Load Shedding at some point server preys the price and amdahl's law and goodput drops.
  • 31. Using Load shedding to avoid overload
  • 32. Using Load shedding to avoid overload ❏ Load Shedding mechanisms ❏ Overload might happen: ❏ Unexpected traffic ❏ Loss of Fleet Capacity (Bad Deployment of other reasons) ❏ Client Shifting from making Cheap Requests (like cached reads) to expensive requests(cache misses or writes) ❏ The cost of Dropping Requests ❏ Amazon drop requests only after the Goodput pletou ❏ Amazon make sure the Cost of dropping requests is small ❏ Dropping requests too early could be more expensive than it needs to be ❏ In Rare cases dropping requests could be more expensive then holding the requests ❏ In this cases amazon slow down rejecting requests to a minimum the latency of successful responses ❏ Prioritize Requests ❏ The most important request the server will receive is the ping from load balancer ❏ Prioritization and throttling can be used together ❏ Amazon spend lots of time on placing algorithms but favors predictive provisioned workload over unpredictable workload
  • 33. Using Load shedding to avoid overload ❏ Keeping an eye on the clock ❏ If the server realize the request is half-way and client timeout it could skip the rest of the work and fail the request ❏ IT's important to include timeout hints on requests which tell the server how long the client will wait ❏ IF an API has start() and end() operations end() should be prioritized over start(). ❏ Pagination can be dangerous - amazon design the services to perform bounded work and not paginate endlessly ❏ Watching out for queues ❏ Look request duration when managing internal queue ❏ Record how long the work was sitting on the queue waiting to be processed ❏ Bounded Size Queues are important ❏ Limit upper bound time that the work will wait on the queue and discard if pass it. ❏ Sometimes use a LIFO approach which HTTP/2 supports ❏ LB might queue incoming requests (Surge Queues) - these queues can lead to burnout ❏ It's safair to use a spillover configuration which fails-fast instead of queueing ❏ Classic ELB use surges queue but ALB reject excess traffic ❏ Protecting from overload in lower layers ❏ MAX connection (like nginx has) is used as last resort and not as default mechanism ❏ Iptables can be used to reject connection in emergencies ❏ AWS WAF can shed excess traffic on a number of dimensions
  • 34. Avoid queue backlogs Fast Mode ❏ When there is no backlog ❏ Latency is low ❏ System is fast Sinister Mode ❏ If load increase or failure happens ❏ End-2-end latency goes higher ❏ Sistener mode kicks in ❏ Takes long time to go back to fast mode. ❏ Queues suppose to increase availability could backfire make recovery time worst ❏ Queue-based system when system is down, message keep arriving (big backlog) ❏ Queue-based systems have 2 models
  • 35. Avoid queue backlogs How to measure availability and latency? ❏ Producer Availability is proportional to queue availability ❏ IF we measure availability on consumer side it might look worse than it is. ❏ Availability Measures from DLQ. ❏ DLQ metrics are good but might detect the problem too late. ❏ SQS has timestamps for each message consumed from the queue : Can log produce netrics how behind it is. ❏ IoT Strategy: categorizing metrics of first attempts separate from metrics of the latency of retry attempts ❏ X-ray and Distribute tracing can help to understand/debug
  • 36. Avoid queue backlogs Backlogs in multi tenant async systems ❏ Amazon don't expose internal queue direct to you (aws lambda) ❏ Throttling to guarantee fairness - per consumer rate-based limits ❏ Limits provide guard rails for unexpected spikes allowing aws do the provisionings need under the hood ❏ Design Patterns to avoid large queue backlogs ❏ Protection at every layer - throttling ❏ Using more than one queue helps to shape the traffic - ❏ Real Time systems use FIFO but prefer LIFO behavior
  • 37. Avoid queue backlogs Amazon Approach: Creating Resilient multi tenant async systems ❏ Amazon Separate workloads in different queues ❏ Shuffle sharding - Aws lambda and IoT does have queues for every device/function ❏ Sideling excess traffic to separate queue ❏ Sideling old traffic to separate queue ❏ Dropping old messages ❏ Limiting Threads and other resources per queue ❏ Sending Back Pressure upstream - Amazon MQ ❏ Delay Queues ❏ Avoid many in-flight messages ❏ DLQ for messages that cannot be processed ❏ Ensuring additional buffer for polling threads workloads - to absorb bursts ❏ Heartbeating long running messages ❏ Plan for Cross-host debugging
  • 38. Workload isolation with shuffle sharding Amazon Invented Shuffle Sharding ❏ Route53 serves the biggest websites in the world ❏ Use Amazon for Root Domain but thanks to Design decision made in DNS protocol on 1980 its not simple/easy ❏ CNAME offload part of the sub-domain to another provider but does not work at root top level ❏ To serve customer needs Amazon need to host customers domains. ❏ Host DNS is not small task if there is problems you can make the whole business OFFLINE ❏ Shuffle Sharding was invent to handle DDos attacks in Route53 ❏ Powerful pattern to deliver cost-effective / multi-tenant services ❏ Regular sharding can make the whole system go down during a DDoS Attack - Scope of failure is “Everything for everyone”.
  • 39. Workload isolation with shuffle sharding
  • 40. Workload isolation with shuffle sharding Divide the workers into 4 shards reduced the blast radius from 100% to 25%
  • 41. Workload isolation with shuffle sharding Shuffle Sharding we create virtual shards and divide even more - 8 workers = 28 unique combinations = 28 shuffle shards - Scope of the problem is 1/28 == 7 times better than regular sharding.
  • 42. Workload isolation with shuffle sharding Route53 has 2048 virtual name servers == 730 billion shuffle shards == unique shuffle shard to every domain https://github.com/awslabs/route53-infima
  • 43. Instrumenting dist sys for Observability Amazon Learnings ❏ Great Instrumentation helps to see what experience we are giving to customers ❏ Amazon consider more than avg latency and focus on outliers p99.9 and p.99.99 - 1k in 10k request slow still poor experience.
  • 44. Instrumenting dist sys for Observability ❏ Amazon has standard libraries to instrument logs and metrics. ❏ Amazon instrument logs with 2 kinds of data: Request data and Debug Data (different log files)
  • 45. Instrumenting dist sys for Observability Request Log Best Practices ❏ Emit 1 and 1 only log entry per request ❏ Record Request details before doing validations ❏ Sanitize request before logging (encode, escape, and truncate) ❏ Don't add 1MB Strings into the log just because is on the request ❏ Keep metric names short but not too short ❏ Break Long-running task (minutes / hours) in multiple logs entry ❏ Amazon Logs format are binary and use http://amzn.github.io/ion-docs/ ❏ Ensure Log Volumes are big enough to handle at Max Throughput ❏ Consider Behavior of the system with disk full - Operate without log is risky, detect when server has a disk near to be full.
  • 46. Instrumenting dist sys for Observability Request Log Best Practices ❏ Synchronize clocks https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-time-sync-ser vice/ ❏ Amazon also uses: https://chrony.tuxfamily.org/ ❏ Emit zero counts for availability metrics ❏ 1 Request succeeded ❏ 0 Request failed
  • 47. Instrumenting dist sys for Observability What to Log? ❏ Log Availability and latency of dependencies ❏ Break out dependency metrics per call, per resource, per status code ❏ Record memory queue depth when accessing them ❏ Organize Errors by Category of Cause | Add Additional counter for error reason (Diego Pacheco Note: I did this in the past - called "Error Observability" - Also expose via REST) ❏ Log Important metadata about the unit of work ❏ Protect logs with access control and encryption
  • 48. Instrumenting dist sys for Observability What to Log? ❏ Avoid putting overly sensitive information in logs ❏ Log Trace ID and propagate to backend calls (Diego Pacheco Note: I did this a lot also called MID(Message ID) generated at the Gateway/Edge layer and propagated to all calls via HTTP HEADERS and Message HEADERS .i.g: JMS). ❏ Log different latency metrics depending on status code and size ❏ Categorized, like Small Request Latency and Large Request Latency
  • 49. Instrumenting dist sys for Observability Application Log Best Practices ❏ Keep the Application log free of spam - INFO / DEBUG are disabled in prod. ❏ Application log is a location for trace information ❏ Include the corresponding request ID ❏ Rate-limit an application log error spam ❏ Prefer format strings over String#format or string concatenation. - Avoid Format String on DEBUG calls won't be called. ❏ Log request IDs from failed service calls
  • 50. Instrumenting dist sys for Observability High throughput Services Log Best Practices ❏ DynamoDB serves 20M RPS of amazon internal traffic ❏ Log Sampling - Write out every N entries not every single one. Prioritize Log slow and failure requests instead of successful ones. ❏ Offload serialization and log flushing to a separate thread. ❏ Frequent Log Rotation ❏ Write logs pre-compressed ❏ Write to a ramdisk / tmpfs ❏ In-memory aggregates | Monitor resource utilization