SlideShare a Scribd company logo
Safe and Fast Automation
On AWS
For Fun and
Profit (?)
Raghavendra Prabhu, Yelp
me@rdprabhu.com / @randomsurfer
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Pet vs Cattle
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Automation in context of DevOps
➔ Provisioning
➔ Remediation
➔ Lifecycle Management
➔ Alerting /+ Actioning
➔ Incident Management
➔ Oncall handling
➔ Configuration Management
➔ Fleet actions
➔ …
➔ Anything that is done often and unsupervised
Automation at Yelp
● Ranges from Declarative to Imperative
○ Depending on how much is on fire
● Puppet and Terraform
● Fabric
● MCO
● Marley
● Taskerman
● Jenkins
● Clusterman
● SSH
● And many others..
Motivation
● More extensible and elegant than ssh/mco
● Lighter than taskerman
● More AWS aware and tighter integration
● Straddling the declarative and imperative
● Use cases
○ Chaos Reliability Engineering
○ A/B testing
○ Quick prototyping
● Better error handling
● Safer concurrency
● Secure!
Safe and Fast Automation on AWS for Fun and Profit
AWS Systems Manager
● Boooring Enterprisey Name
● “Gain Operational Insights and Take Action on AWS Resources”
● “Systems Manager includes a unified interface that allows you to easily
centralize operational data and automate tasks across your AWS resources.”
Safe and Fast Automation on AWS for Fun and Profit
What SSM can do
● Resource groups
○ “Tag all resources easily and query”
● Actions
○ “Run something on tagged”
○ “Patch for security updates”
● Insights on Inventory
○ “Tell me something about resources that is not obvious”
● Parameter Store
○ “Let us parameterize our configuration”
● State Manager
○ “Let us manage resources based on state”
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Run Command
● SSM service
● Run pre-defined or ad-hoc “commands” or flow-based automation
● Command = Target (Who) + Document (What) + Params (When) + How
● “Managed Instance”
● Parameterized “Document”
○ Predefined
○ Custom
● Security
● Free to use
● Invocation from CLI, SDK(boto et.al) or console.
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Safe and Fast Automation on AWS for Fun and Profit
Workflow
1. IAM instance profile on nodes
a. Done in puppet (and never on console)
b. Giant web of permissions
c. Needed for invoker and executor and viewer.
2. Deploy SSM agent on nodes.
a. Packages available for Ubuntu
b. Setup a jenkins job and added to repo: amazon-ssm-agent
c. Puppetized and deploy on zookeeper cluster.
3. Tag instances if not already
a. Reused ubiquitous puppet::role tag
4. Fire!
a. `aws ssm send-command`
Security
● Managed through IAM (Identity and Access Management)
○ Users, Groups, Roles, APIs.
● Cloudtrail-based auditing (sourcetype=aws:cloudtrail)
● Not dependent on SSH
● IAM instance profile for executor
○ Instance role defined in puppet
○ Attached to the host in terraform
● Configure User to read status
● Secret management through parameter store
● Highly granular access control
○ Adding approvers and SNS
○ Possible to separate roles of invoker, executor and viewer.
Safe and Fast Automation on AWS for Fun and Profit
Security observations
● Multiple ways in documentation to achieve same thing
● Quick prototyping hard with IAMs and puppet
○ IAM Policy Simulator helps here.
○ Updating on console is disallowed (for good reasons).
● The recommended IAMs in docs are wide-open.
● Documentation on IAM workflow and IAMs at Yelp could improve
● Executor’s permissions
● Nature of agent
○ Open source
Safe and Fast Automation on AWS for Fun and Profit
Execution Control
● Circuit breakers
○ max-errors (% or count)
○ timeout
● Concurrency (% or count)
○ Exponentially scaled (1, 2, 4, .. value)
● Targets
○ Instance IDs
○ Union or intersection of Tags
● Cancellation
● Maintenance Windows
● Command Timeout
● Integration
Output and Monitoring
● Audit logs with AWS cloudtrail
● Short output and return statuses through SSM
● Output storage in S3
● Cloudwatch logs
● SNS notifications
○ Asynchronous nature
● AWS Console
Cloudwatch
Cloudtrail SNS
Use case: zookeeper restart on a 5 node cluster
aws --region us-west-2 ssm send-command --document-name
"AWS-RunShellScript" --targets
'Key=tag:puppet:role::zookeeper,Values=cluster_name=test_generic-u
swest2-devc' --parameters
'{"workingDirectory":["/"],"executionTimeout":["100"],"commands":[
"zk_tool local check_cluster", "sudo service zookeeper restart",
"sleep 30", "zk_tool local check_cluster"]}' --timeout-seconds 300
--max-concurrency 2 --max-errors "1" --output text
Safe and Fast Automation on AWS for Fun and Profit
Credits
➢ https://www.pexels.com/photo/person-holding-black-pen-1020325/
➢ https://imgur.com/gallery/LVSJDgy
➢ https://www.pexels.com/photo/gray-steel-tubes-586019/
➢ https://i.kym-cdn.com/photos/images/original/000/572/078/d6d.jpg
➢ https://aws.amazon.com/systems-manager/getting-started/
➢ https://giphy.com/gifs/animation-cartoon-robot-8qrrHSsrK9xpknGVNF
➢ https://www.pexels.com/photo/amplifier-analogue-audio-blur-462439/
➢ https://www.pexels.com/photo/black-and-white-business-chart-computer-241544/
➢ https://media3.giphy.com/media/lXiRpzaKeG9nWR5eM/giphy.gif

More Related Content

PPTX
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
ODP
PPTX
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
PPTX
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
ODP
Faster on Rails
PDF
Chatting Server on AWS
PPTX
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
PDF
Detecting secrets in code committed to gitlab (in real time)
Problems you’ll face in the Microservices World: Configuration, Authenticatio...
MONITORING THE UNKNOWN, 1000*100 SERIES A DAY - DEVOXX MOROCCO 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Faster on Rails
Chatting Server on AWS
How to Build a Multi-DC Cassandra Cluster in AWS with OpsCenter LCM
Detecting secrets in code committed to gitlab (in real time)

What's hot (19)

PDF
Prezo at-mesos con2015-final
PPTX
Top 23 Things Not to Do in AWS
KEY
Scaling Django for X Factor - DJUGL Oct 2012
PDF
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
PDF
Kraken Front-Trends
PDF
Infra for startup
PPTX
Build a reverse proxy for modern immutable infrastructure - Sozu - Devops D D...
PPTX
Serverless by examples and case studies
PDF
Cloud-Native DevOps Engineering
PDF
Cron in der Cloud - Die Top 10 Hitparade
PPTX
Managing and Scaling Puppet - PuppetConf 2014
PPTX
Become Thanos of the LambdaLand: Wield all the Infinity Stones
PDF
The Real World - Plugging the Enterprise Into It (nodejs)
PDF
Serverless framework와 CircleCI를 통한 NoOps 맛보기
PPT
Setting Up Amazon EC2 server
PDF
2012 07 making disqus realtime@euro python
PDF
[AWSKRUG&JAWS-UG Meetup #1] 70% Cost Reduction with On-demand resizing
PPTX
20170525 왕진영 AWS 분산딥러닝
PDF
Performance Tales of Serverless - CloudNative London 2018
Prezo at-mesos con2015-final
Top 23 Things Not to Do in AWS
Scaling Django for X Factor - DJUGL Oct 2012
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Kraken Front-Trends
Infra for startup
Build a reverse proxy for modern immutable infrastructure - Sozu - Devops D D...
Serverless by examples and case studies
Cloud-Native DevOps Engineering
Cron in der Cloud - Die Top 10 Hitparade
Managing and Scaling Puppet - PuppetConf 2014
Become Thanos of the LambdaLand: Wield all the Infinity Stones
The Real World - Plugging the Enterprise Into It (nodejs)
Serverless framework와 CircleCI를 통한 NoOps 맛보기
Setting Up Amazon EC2 server
2012 07 making disqus realtime@euro python
[AWSKRUG&JAWS-UG Meetup #1] 70% Cost Reduction with On-demand resizing
20170525 왕진영 AWS 분산딥러닝
Performance Tales of Serverless - CloudNative London 2018
Ad

Similar to Safe and Fast Automation on AWS for Fun and Profit (20)

PDF
Serverless security for multi cloud workloads
PDF
HashiCorp Vault configuration as code via HashiCorp Terraform- stories from t...
PDF
Automatic Provisioning of Consul & Vault
PDF
AWS DevOps - Terraform, Docker, HashiCorp Vault
PDF
Netflix Open Source Meetup Season 4 Episode 3
PDF
DevOpsDays - DevOps: Security 干我何事?
PDF
Automating Security in Cloud Workloads with DevSecOps
PPTX
004 - Logging in the Cloud -- hide01.ir.pptx
PDF
Monitoring with prometheus at scale
PDF
Monitoring with prometheus at scale
PPTX
us-east-1 Shuffle_ Lateral Movement and other Creative Steps Attackers Take i...
PPTX
Languages don't matter anymore!
PDF
From 0 to Secure in 1 Minute - Securing laaS - Nir Valtman
PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
PDF
How Ansible Tower and Prometheus can help automate continuous deployments
PDF
Penetration Testing AWS
PDF
Using Ansible for Deploying to Cloud Environments
PDF
NetflixOSS Open House Lightning talks
PDF
AWS Lambda from the Trenches
PDF
NetflixOSS Meetup season 3 episode 2
Serverless security for multi cloud workloads
HashiCorp Vault configuration as code via HashiCorp Terraform- stories from t...
Automatic Provisioning of Consul & Vault
AWS DevOps - Terraform, Docker, HashiCorp Vault
Netflix Open Source Meetup Season 4 Episode 3
DevOpsDays - DevOps: Security 干我何事?
Automating Security in Cloud Workloads with DevSecOps
004 - Logging in the Cloud -- hide01.ir.pptx
Monitoring with prometheus at scale
Monitoring with prometheus at scale
us-east-1 Shuffle_ Lateral Movement and other Creative Steps Attackers Take i...
Languages don't matter anymore!
From 0 to Secure in 1 Minute - Securing laaS - Nir Valtman
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
How Ansible Tower and Prometheus can help automate continuous deployments
Penetration Testing AWS
Using Ansible for Deploying to Cloud Environments
NetflixOSS Open House Lightning talks
AWS Lambda from the Trenches
NetflixOSS Meetup season 3 episode 2
Ad

More from Raghavendra Prabhu (20)

PDF
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
PDF
Orchestrating Cassandra with Kubernetes
PDF
Cassandra Operator with Yelp PaaSTA
PDF
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
PDF
Pass Elk: CAP Theorem since 90s and Beyond
PDF
Cassandra in Docker at Yelp: Opportunities and Challenges
PDF
Taskerman: A Distributed Cluster Task Manager
PDF
Taskerman - a distributed cluster task manager
PDF
NUMA and Java Databases
PDF
Linux NUMA & Databases: Perils and Opportunities
PDF
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
PDF
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
PPTX
Working from home - fun, facts and scares!
PPTX
Securing databases with systemd for containers and services
PDF
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
PDF
Dock'em: Distributed Systems Testing with NetEm and Docker
PDF
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
PDF
Jutsu or Dô: Open documentation: continuous process than a body
PDF
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
PDF
Corpus collapsum: Partition tolerance of Galera put to test
Orchestrating Cassandra with Kubernetes Operator and PaaSTA
Orchestrating Cassandra with Kubernetes
Cassandra Operator with Yelp PaaSTA
Orchestrating Cassandra with Kubernetes: Challenges and Opportunities
Pass Elk: CAP Theorem since 90s and Beyond
Cassandra in Docker at Yelp: Opportunities and Challenges
Taskerman: A Distributed Cluster Task Manager
Taskerman - a distributed cluster task manager
NUMA and Java Databases
Linux NUMA & Databases: Perils and Opportunities
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating Percona XtraDB Cluster with Kubernetes.
Working from home - fun, facts and scares!
Securing databases with systemd for containers and services
Corpus collapsum: Partition tolerance testing of Galera with Docker and NetEm
Dock'em: Distributed Systems Testing with NetEm and Docker
Galera with Docker: How Synchronous Replication and Linux Containers mesh tog...
Jutsu or Dô: Open documentation: continuous process than a body
Corpus collapsum: Partition tolerance of Galera in a noisy high load environment
Corpus collapsum: Partition tolerance of Galera put to test

Recently uploaded (20)

PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PPTX
introduction to high performance computing
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPT
Occupational Health and Safety Management System
PDF
Soil Improvement Techniques Note - Rabbi
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
communication and presentation skills 01
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Module 8- Technological and Communication Skills.pptx
August 2025 - Top 10 Read Articles in Network Security & Its Applications
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
introduction to high performance computing
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Occupational Health and Safety Management System
Soil Improvement Techniques Note - Rabbi
Abrasive, erosive and cavitation wear.pdf
R24 SURVEYING LAB MANUAL for civil enggi
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
communication and presentation skills 01

Safe and Fast Automation on AWS for Fun and Profit

  • 1. Safe and Fast Automation On AWS For Fun and Profit (?) Raghavendra Prabhu, Yelp me@rdprabhu.com / @randomsurfer
  • 7. Automation in context of DevOps ➔ Provisioning ➔ Remediation ➔ Lifecycle Management ➔ Alerting /+ Actioning ➔ Incident Management ➔ Oncall handling ➔ Configuration Management ➔ Fleet actions ➔ … ➔ Anything that is done often and unsupervised
  • 8. Automation at Yelp ● Ranges from Declarative to Imperative ○ Depending on how much is on fire ● Puppet and Terraform ● Fabric ● MCO ● Marley ● Taskerman ● Jenkins ● Clusterman ● SSH ● And many others..
  • 9. Motivation ● More extensible and elegant than ssh/mco ● Lighter than taskerman ● More AWS aware and tighter integration ● Straddling the declarative and imperative ● Use cases ○ Chaos Reliability Engineering ○ A/B testing ○ Quick prototyping ● Better error handling ● Safer concurrency ● Secure!
  • 11. AWS Systems Manager ● Boooring Enterprisey Name ● “Gain Operational Insights and Take Action on AWS Resources” ● “Systems Manager includes a unified interface that allows you to easily centralize operational data and automate tasks across your AWS resources.”
  • 13. What SSM can do ● Resource groups ○ “Tag all resources easily and query” ● Actions ○ “Run something on tagged” ○ “Patch for security updates” ● Insights on Inventory ○ “Tell me something about resources that is not obvious” ● Parameter Store ○ “Let us parameterize our configuration” ● State Manager ○ “Let us manage resources based on state”
  • 17. Run Command ● SSM service ● Run pre-defined or ad-hoc “commands” or flow-based automation ● Command = Target (Who) + Document (What) + Params (When) + How ● “Managed Instance” ● Parameterized “Document” ○ Predefined ○ Custom ● Security ● Free to use ● Invocation from CLI, SDK(boto et.al) or console.
  • 21. Workflow 1. IAM instance profile on nodes a. Done in puppet (and never on console) b. Giant web of permissions c. Needed for invoker and executor and viewer. 2. Deploy SSM agent on nodes. a. Packages available for Ubuntu b. Setup a jenkins job and added to repo: amazon-ssm-agent c. Puppetized and deploy on zookeeper cluster. 3. Tag instances if not already a. Reused ubiquitous puppet::role tag 4. Fire! a. `aws ssm send-command`
  • 22. Security ● Managed through IAM (Identity and Access Management) ○ Users, Groups, Roles, APIs. ● Cloudtrail-based auditing (sourcetype=aws:cloudtrail) ● Not dependent on SSH ● IAM instance profile for executor ○ Instance role defined in puppet ○ Attached to the host in terraform ● Configure User to read status ● Secret management through parameter store ● Highly granular access control ○ Adding approvers and SNS ○ Possible to separate roles of invoker, executor and viewer.
  • 24. Security observations ● Multiple ways in documentation to achieve same thing ● Quick prototyping hard with IAMs and puppet ○ IAM Policy Simulator helps here. ○ Updating on console is disallowed (for good reasons). ● The recommended IAMs in docs are wide-open. ● Documentation on IAM workflow and IAMs at Yelp could improve ● Executor’s permissions ● Nature of agent ○ Open source
  • 26. Execution Control ● Circuit breakers ○ max-errors (% or count) ○ timeout ● Concurrency (% or count) ○ Exponentially scaled (1, 2, 4, .. value) ● Targets ○ Instance IDs ○ Union or intersection of Tags ● Cancellation ● Maintenance Windows ● Command Timeout ● Integration
  • 27. Output and Monitoring ● Audit logs with AWS cloudtrail ● Short output and return statuses through SSM ● Output storage in S3 ● Cloudwatch logs ● SNS notifications ○ Asynchronous nature ● AWS Console Cloudwatch Cloudtrail SNS
  • 28. Use case: zookeeper restart on a 5 node cluster aws --region us-west-2 ssm send-command --document-name "AWS-RunShellScript" --targets 'Key=tag:puppet:role::zookeeper,Values=cluster_name=test_generic-u swest2-devc' --parameters '{"workingDirectory":["/"],"executionTimeout":["100"],"commands":[ "zk_tool local check_cluster", "sudo service zookeeper restart", "sleep 30", "zk_tool local check_cluster"]}' --timeout-seconds 300 --max-concurrency 2 --max-errors "1" --output text
  • 30. Credits ➢ https://www.pexels.com/photo/person-holding-black-pen-1020325/ ➢ https://imgur.com/gallery/LVSJDgy ➢ https://www.pexels.com/photo/gray-steel-tubes-586019/ ➢ https://i.kym-cdn.com/photos/images/original/000/572/078/d6d.jpg ➢ https://aws.amazon.com/systems-manager/getting-started/ ➢ https://giphy.com/gifs/animation-cartoon-robot-8qrrHSsrK9xpknGVNF ➢ https://www.pexels.com/photo/amplifier-analogue-audio-blur-462439/ ➢ https://www.pexels.com/photo/black-and-white-business-chart-computer-241544/ ➢ https://media3.giphy.com/media/lXiRpzaKeG9nWR5eM/giphy.gif