SlideShare a Scribd company logo
Operations @ Scale
Anurag Gupta, VP
AWS Database Services
Dev/Ops
How I learned to stop worrying and love my pager
Dev/Ops - your dev org is your ops org
I get a pager! You get a pager! Everyone gets a pager!
Why would I possibly want this?
It motivates design for operability
It aligns your interests w/ your customer experience
It improves the feedback loop to customer needs
Monitor everything
Every API call to your service,
Every API call you make to a dependent service
Canary traffic for things that vary (eg SQL statements)
Most of the metrics won’t be meaningful. That’s OK
Page on your high signal-to-noise metrics
Monitor these metrics during deployments
Median/Average, Fleet-wide, coarse time grain are obscuring
Measure TP90, TP99 (99th percentile response time)
Measure at finer and finer grain
Evaluate per-customer metrics
Look for the needles in the haystack
Correction-of-Error (COE) Reporting
Meet weekly on operations (execs, service operators)
Review each issue that happened.
“Spin the wheel” to review a service’s metrics
Support a “truth-seeking” culture
Looking for data, process improvements
COE
- Customer impact
- Timeline: incidence to detection to response to resolution
- 5 Whys? Get to actionable changes to extinguish cause
- Actions
Ops is Dev
Humans are fallible
circa 1% defect injection rate
Error rate changes based on time of day (3am vs 3pm)
New ones show up, have unique issues
Limit human access to machines
Use code/scripts/tools instead
Scripts are code
unit test, code review, deploy, automate
Ops load correlates to business growth
As your business does well, your
operations needs to become great
Growing 100-200% YoY is hard.
Improving ops 100-200% YoY is really
hard.
Improving ops 2% each week is possible.
Use Pareto analysis to prioritize work
Bonus – each customer gets a better
experience even as your own ops load
stays constant
Amazon Redshift has grown rapidly since it became generally
available in February 2013. While our guiding principles have
served us well over the past two years, we now manage many
thousands of database instances and below offer some lessons we
have learned from operating databases at scale.
Design escalators, not elevators: Failures are common when
operating large fleets with many service dependencies. A key
lesson for us has been to design systems that degrade on failures
rather than losing outright availability. These are a common
design pattern when working with hardware failures, for example,
replicating data blocks to mask issues with disks. They are less
common when working with software or service dependencies,
though still necessary when operating in a dynamic environment.
Amazon overall (including AWS) had 50 million code
deployments over the past 12 months. Inevitably, at this scale, a
small number of regressions will occur and cause issues until
reverted. It is helpful to make one’s own service resilient to an
underlying service outage. For example, we support the ability to
preconfigure nodes in each data center, allowing us to continue to
provision and replace nodes for a period of time if there is an
Amazon EC2 provisioning interruption. One can locally increase
replication to withstand an Amazon S3 or network interruption.
We are adding similar mitigation strategies for other external
understanding that, even if not a widespread concern, each issue is
meaningful to the customer experiencing it. In Figure 5, Sev 2
refers to a severity 2 alarm that causes an engineer to get paged.
This means operational load roughly correlates to business
success. Within Amazon Redshift, we collect error logs across our
fleet and monitor tickets to understand top ten causes of error,
with the aim of extinguishing one of the top ten causes of error
each week.
Figure 5: Tickets per cluster over time
Pareto analysis is equally useful in understanding customer
functional requirements. However, it is more difficult to collect.
Escalators, not elevators
Failures happen.
Durability failures are “easy”
mirroring, quorums, well understood techniques
Availability failures are “hard” –
want to degrade on unavailability not cascade failures
tolerate 1-2 hours of unavailability (time to detect, fix)
- eg caching IP addresses when DNS is unavailable
- eg maintaining instance warm pools rather than provisioning
- eg losing the ability to restore a backup, not lose writes
Ship often
Continuous delivery should be to the
customer
Benefits
Customers prefer small patches
Rollback is easier
Rollback is less likely
Faster response to customer issues
We push a new database engine
version, including both features and
bug fixes, every two weeks.
dependencies that can fail independently from the database itself.
Continuous delivery should be to the customer: Many
engineering organizations now use continuous build and
automated test pipelines to a releasable staging environment.
However, few actually push the release itself at a frequent pace.
While customers would prefer small patches to large ones for the
same reasons engineering organizations prefer to build and test
continuously, patching is an onerous process. This often leads to
special-case, one-off patches per customer that are limited in
scope – while necessary, they make patching yet more fragile.
Figure 4: Cumulative features deployed over time
Amazon Redshift is set up to automatically patch customer
clusters on a weekly basis in a 30-minute window specified by the
Cumulative features deployed over time

More Related Content

PPT
Vertical vs Horizontal Scaling
PPTX
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
PPTX
How to Build Scalable Websites in the Cloud
PPTX
Create Agile, Automated and Predictable IT Infrastructure in the Cloud
PPTX
Applying BDD in refactoring
PPTX
Serverless lessons learned #5 retries
PDF
Veeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud Connect
PPTX
Building Scalable Websites for the Cloud
Vertical vs Horizontal Scaling
Performance monitoring - Adoniram Mishra, Rupesh Dubey, ThoughtWorks
How to Build Scalable Websites in the Cloud
Create Agile, Automated and Predictable IT Infrastructure in the Cloud
Applying BDD in refactoring
Serverless lessons learned #5 retries
Veeam - Fast Secure Cloud base Disaster Recovery with Veeam Cloud Connect
Building Scalable Websites for the Cloud

What's hot (19)

PDF
Five (easy?) Steps Towards Continuous Delivery
PPTX
4 extreme performance - part ii
PPTX
Serverless lessons learned #4 circuit breaker
PDF
Veeam Using cloud connect in 3 unexpected, awesome ways
PPT
MySQL HA Presentation
PPTX
Serverless lessons learned #8 backoff
PDF
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
PDF
SRE Demystified - 01 - SLO SLI and SLA
PPTX
Rapidly Deploy Enterprise Cloud Sandboxes
PPTX
Managing RightScale on RightScale
PPTX
Managing RightScale on RightScale
PDF
Harper Reed: Cloud Contraints
PDF
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
PPT
Continuously Delivering: Compress the time from committed to consumed
PPTX
Aug NYC July 12 event
PPTX
Serverless lessons learned #3 reserved concurrency
PDF
Divide and Conquer: Easier Continuous Delivery using Micro-Services
PDF
BlazeMeter Presents at the High Performance Drupal Meetup
PPTX
Serverless lessons learned #1 custom sdk timeouts
Five (easy?) Steps Towards Continuous Delivery
4 extreme performance - part ii
Serverless lessons learned #4 circuit breaker
Veeam Using cloud connect in 3 unexpected, awesome ways
MySQL HA Presentation
Serverless lessons learned #8 backoff
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
SRE Demystified - 01 - SLO SLI and SLA
Rapidly Deploy Enterprise Cloud Sandboxes
Managing RightScale on RightScale
Managing RightScale on RightScale
Harper Reed: Cloud Contraints
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
Continuously Delivering: Compress the time from committed to consumed
Aug NYC July 12 event
Serverless lessons learned #3 reserved concurrency
Divide and Conquer: Easier Continuous Delivery using Micro-Services
BlazeMeter Presents at the High Performance Drupal Meetup
Serverless lessons learned #1 custom sdk timeouts
Ad

Similar to Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup (20)

PDF
Building data intensive applications
PDF
Ops is dead. long live ops.
PDF
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
PPTX
Data Design and Modeling for Microservices I AWS Dev Day 2018
PDF
AWS to Bare Metal: Motivation, Pitfalls, and Results
PDF
Lessons From A DevOps Transformation on AWS
PDF
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
PPTX
Aws vs azure bakeoff
PDF
Dev Ops without the Ops
PDF
Escalando hasta sus primeros 10 millones de usuarios
PDF
Escalando hasta sus primeros 10 millones de usuarios
PPTX
Design Reviews for Operations - Velocity Europe 2014
PPTX
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
PDF
Charity Hound - Serverless, NoOps, The Tooth Fairy
PDF
Devoxx2017
PDF
Get the most out of your AWS Redshift investment while keeping cost down
PPTX
Microservices in der Cloud - Software Architecture Summit Berlin 2016
PDF
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
PPTX
Engineering operations
PDF
Creating an Operating Model to enable a high frequency organization
Building data intensive applications
Ops is dead. long live ops.
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Data Design and Modeling for Microservices I AWS Dev Day 2018
AWS to Bare Metal: Motivation, Pitfalls, and Results
Lessons From A DevOps Transformation on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
Aws vs azure bakeoff
Dev Ops without the Ops
Escalando hasta sus primeros 10 millones de usuarios
Escalando hasta sus primeros 10 millones de usuarios
Design Reviews for Operations - Velocity Europe 2014
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
Charity Hound - Serverless, NoOps, The Tooth Fairy
Devoxx2017
Get the most out of your AWS Redshift investment while keeping cost down
Microservices in der Cloud - Software Architecture Summit Berlin 2016
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
Engineering operations
Creating an Operating Model to enable a high frequency organization
Ad

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup

  • 1. Operations @ Scale Anurag Gupta, VP AWS Database Services
  • 2. Dev/Ops How I learned to stop worrying and love my pager Dev/Ops - your dev org is your ops org I get a pager! You get a pager! Everyone gets a pager! Why would I possibly want this? It motivates design for operability It aligns your interests w/ your customer experience It improves the feedback loop to customer needs
  • 3. Monitor everything Every API call to your service, Every API call you make to a dependent service Canary traffic for things that vary (eg SQL statements) Most of the metrics won’t be meaningful. That’s OK Page on your high signal-to-noise metrics Monitor these metrics during deployments Median/Average, Fleet-wide, coarse time grain are obscuring Measure TP90, TP99 (99th percentile response time) Measure at finer and finer grain Evaluate per-customer metrics Look for the needles in the haystack
  • 4. Correction-of-Error (COE) Reporting Meet weekly on operations (execs, service operators) Review each issue that happened. “Spin the wheel” to review a service’s metrics Support a “truth-seeking” culture Looking for data, process improvements COE - Customer impact - Timeline: incidence to detection to response to resolution - 5 Whys? Get to actionable changes to extinguish cause - Actions
  • 5. Ops is Dev Humans are fallible circa 1% defect injection rate Error rate changes based on time of day (3am vs 3pm) New ones show up, have unique issues Limit human access to machines Use code/scripts/tools instead Scripts are code unit test, code review, deploy, automate
  • 6. Ops load correlates to business growth As your business does well, your operations needs to become great Growing 100-200% YoY is hard. Improving ops 100-200% YoY is really hard. Improving ops 2% each week is possible. Use Pareto analysis to prioritize work Bonus – each customer gets a better experience even as your own ops load stays constant Amazon Redshift has grown rapidly since it became generally available in February 2013. While our guiding principles have served us well over the past two years, we now manage many thousands of database instances and below offer some lessons we have learned from operating databases at scale. Design escalators, not elevators: Failures are common when operating large fleets with many service dependencies. A key lesson for us has been to design systems that degrade on failures rather than losing outright availability. These are a common design pattern when working with hardware failures, for example, replicating data blocks to mask issues with disks. They are less common when working with software or service dependencies, though still necessary when operating in a dynamic environment. Amazon overall (including AWS) had 50 million code deployments over the past 12 months. Inevitably, at this scale, a small number of regressions will occur and cause issues until reverted. It is helpful to make one’s own service resilient to an underlying service outage. For example, we support the ability to preconfigure nodes in each data center, allowing us to continue to provision and replace nodes for a period of time if there is an Amazon EC2 provisioning interruption. One can locally increase replication to withstand an Amazon S3 or network interruption. We are adding similar mitigation strategies for other external understanding that, even if not a widespread concern, each issue is meaningful to the customer experiencing it. In Figure 5, Sev 2 refers to a severity 2 alarm that causes an engineer to get paged. This means operational load roughly correlates to business success. Within Amazon Redshift, we collect error logs across our fleet and monitor tickets to understand top ten causes of error, with the aim of extinguishing one of the top ten causes of error each week. Figure 5: Tickets per cluster over time Pareto analysis is equally useful in understanding customer functional requirements. However, it is more difficult to collect.
  • 7. Escalators, not elevators Failures happen. Durability failures are “easy” mirroring, quorums, well understood techniques Availability failures are “hard” – want to degrade on unavailability not cascade failures tolerate 1-2 hours of unavailability (time to detect, fix) - eg caching IP addresses when DNS is unavailable - eg maintaining instance warm pools rather than provisioning - eg losing the ability to restore a backup, not lose writes
  • 8. Ship often Continuous delivery should be to the customer Benefits Customers prefer small patches Rollback is easier Rollback is less likely Faster response to customer issues We push a new database engine version, including both features and bug fixes, every two weeks. dependencies that can fail independently from the database itself. Continuous delivery should be to the customer: Many engineering organizations now use continuous build and automated test pipelines to a releasable staging environment. However, few actually push the release itself at a frequent pace. While customers would prefer small patches to large ones for the same reasons engineering organizations prefer to build and test continuously, patching is an onerous process. This often leads to special-case, one-off patches per customer that are limited in scope – while necessary, they make patching yet more fragile. Figure 4: Cumulative features deployed over time Amazon Redshift is set up to automatically patch customer clusters on a weekly basis in a 30-minute window specified by the Cumulative features deployed over time