Jürgen Etzlstorfer
@jetzlstorfer
Technology Strategist
How to build your own auto-remediation
workflow for your applications using Ansible
Ansible Meetup Munich, 10th July 2018
confidential
The journey – Part 1
confidential
The journey – Part 2
 What is (auto-)remediation and why you need it
 How to build your auto-remediation workflow?
 Demo Time!
 Outlook: embed auto-remedation in your CI/CD pipeline
On average, a single transaction uses 82 different types of technology
Browser
Multi-geo
Mobile Network
Code
Hosts
Logs
IoT
3rd parties
Services
Cloud SDN
Containers
Applications are getting more complex!
confidential
If you write applications,
they will break eventually
~ Murphy‘s law
confidential
What if you had
something similar to
a self-healing robot?
confidential
What is needed for self-healing applications?
 Monitoring: know what’s going on in your
applications
 End-to-end
 Full-stack – fully integrated in production
(or even in staging)
 Automation/Execution: perform
mitigation/remediation actions
 Access to all systems
 Automation system should be isolated from
production system
APIs
confidential
Know what‘s going on in your
applications
 Monitor your applications Identify the root cause
of the problem!
confidential
Auto-remediation with Ansible (Tower)
 APIs are key to enable automation
 Ansible Tower makes extensive use APIs internally and exposes them also externally
 Ansible playbooks are scripts that are executed from a central host on different machines
 Multiple OS are supported
 Idempotent
 Playbooks can be orchestrated in workflows and job templates
Full-stack
environment
is monitored
Anomalies
are detected
automatically
Root
cause
analysis is
performed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable auto-remediation
Staging
Approve
Staging
Production
Approve
Production
Up and
running
Scenario: How to mitigate a bad deployment?
Staging
Approve
Staging
Production
Approve
Production
Remediation
Roll-
back
confidential
---
- name: rollback to previous version
hosts: localhost
vars:
...
tasks:
- name: push comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context":
"Ansible Tower" }"
- name: fetch custom deployment events
uri:
url: "{{dtdeploymentapiurl}}"
return_content: yes
with_items: "{{ impactedEntities }}"
register: customproperties
ignore_errors: no
- name: parse deployment events
set_fact:
deployment_events: "{{item.json.events}}"
with_items: "{{ customproperties.results }}"
register: app_result
confidential
- name: call remediation action
uri:
url: "{{ myItem.remediationAction }}"
method: POST
body_format: json
body: "{{ payload | to_json }}"
return_content: yes
ignore_errors: yes
register: result
- name: push success comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}",
"user": "{{commentuser}}", "context": "Ansible Tower" }"
when: result.status == 200
- name: push error comment to dynatrace
...
body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user":
"{{commentuser}}", "context": "Ansible Tower" }"
when: result.status != 200
confidential
Steps to mitigate the bad deployment
Fetch
information
about event
Process the
data
Select
corresponding
remediation
action
1.Execution the
remediation
action
Keep track of all automation steps
confidential
Demo Time!
confidential
confidential
Auto-remediation as a safety net
It does not fix your problem
confidential
https://blogs.msdn.microsoft.com/visualstudioalmrangers/2017/04/17/set-up-a-cicd-pipeline-for-your-team-services-extension/
confidential
Embed auto-remediation in your CI/CD pipeline
Shift-Left: Break Pipeline Earlier
Path to NoOps: Self-Healing, …
Shift-Right: Tags, Deploys, Events
Actionable Feedback Loops
Injecting speed &
quality: automatic gate
at test & performance
• Continuous Performance Validation for daily builds
• Root Cause details automatically pushed to JIRA
• Decisions made to compare, break, or good-to-go
Shift-left:engage Dev withearlier & automatedfeedback
confidential
Shift-right:empowerOps withmore contextto react faster
https://github.com/Dynatrace/AWSDevOpsTutorial
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Compares Builds and Approves/Rejects Pipeline
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Validates Production and Approves/Rejects Pipeline
handleDynatraceProblemNotification
Executes Auto-Remediating Actions, e.g: Rollback
Build 6
Build 7
Production
Production
Auto-Approve!
Auto-Reject!
Auto-Approve!
Auto-Reject!
confidential
How to start to build your own remediation workflow?
1. Monitor your environment
2. Define your runbooks
3. Start small and with low hanging fruits
 What are frequent issues?
 Of these, which ones are easy to deal with?
4. Build more and more automation along the way
Cultural Change!
www.dynatrace.com
confidential
Jürgen Etzlstorfer
Technology Strategist
juergen.etzlstorfer@dynatrace.com
@jetzlstorfer
Thank you!
confidential
dynatrace.com/trial

More Related Content

PPTX
A framework for self-healing applications – the path to enable auto-remediation
PDF
Winston - Netflix's event driven auto remediation and diagnostics tool
PDF
gRPC: The Story of Microservices at Square
PDF
WebRTC Real time media P2P, Server, Infrastructure, and Platform
PDF
Rundeck Overview
PDF
The Role of IAM in Microservices
PDF
Microservices architecture
PDF
CI CD Pipeline Using Jenkins | Continuous Integration and Deployment | DevOps...
A framework for self-healing applications – the path to enable auto-remediation
Winston - Netflix's event driven auto remediation and diagnostics tool
gRPC: The Story of Microservices at Square
WebRTC Real time media P2P, Server, Infrastructure, and Platform
Rundeck Overview
The Role of IAM in Microservices
Microservices architecture
CI CD Pipeline Using Jenkins | Continuous Integration and Deployment | DevOps...

What's hot (20)

ODP
Elasticsearch for beginners
PPTX
Circuit Breaker Pattern
PDF
Observability For You and Me with OpenTelemetry
PPTX
What is an API Gateway?
PPTX
Distributed tracing 101
PDF
PagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
PDF
Prometheus
PPTX
PDF
Spring cloud on kubernetes
PDF
CI/CD on Google Cloud Platform
PPTX
Introduction to DevOps on AWS
PPTX
GitLab, AWS and Terraform: The Perfect Combination
PDF
Reactive Microservices with Quarkus
ODP
Linux host orchestration with Foreman, Puppet and Gitlab
PPTX
Mass Migrate Virtual Machines to Kubevirt with Tool Forklift 2.0
PDF
Logs/Metrics Gathering With OpenShift EFK Stack
PDF
Decompose your monolith: strategies for migrating to microservices (Tide)
PPTX
Microservice vs. Monolithic Architecture
PDF
Platform Engineering
PDF
Enterprise Release Management for DevOps & Continuous Delivery/ From Spreadsh...
Elasticsearch for beginners
Circuit Breaker Pattern
Observability For You and Me with OpenTelemetry
What is an API Gateway?
Distributed tracing 101
PagerDuty + Rundeck = Shorter Incidents, Fewer Escalations
Prometheus
Spring cloud on kubernetes
CI/CD on Google Cloud Platform
Introduction to DevOps on AWS
GitLab, AWS and Terraform: The Perfect Combination
Reactive Microservices with Quarkus
Linux host orchestration with Foreman, Puppet and Gitlab
Mass Migrate Virtual Machines to Kubevirt with Tool Forklift 2.0
Logs/Metrics Gathering With OpenShift EFK Stack
Decompose your monolith: strategies for migrating to microservices (Tide)
Microservice vs. Monolithic Architecture
Platform Engineering
Enterprise Release Management for DevOps & Continuous Delivery/ From Spreadsh...
Ad

Similar to How to build your own auto-remediation workflow - Ansible Meetup Munich (20)

PPTX
Self-healing Applications with Ansible
PDF
Shift-left SRE: Self-healing on OpenShift with Ansible
PPTX
Event driven-automation and workflows
PDF
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
PDF
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
PDF
Devops_and_Legacy_Ansible
PPTX
Learn you some Ansible for great good!
PDF
Enabling self-service automation with ServiceNow and Ansible Automation Platform
PDF
Bsides Delhi Security Automation for Red and Blue Teams
PPTX
Process Automation for Modern IT Operations Management
PDF
DevSecOps: Closing the Loop from Detection to Remediation
PPTX
DevSecOps: Closing the Loop from Detection to Remediation
PPTX
Automating using Ansible
PPTX
Infrastructure Automation
PDF
Sprint 83
PDF
Devops with Python by Yaniv Cohen DevopShift
PDF
Deployment Automation & Self-Healing with Dynatrace & Ansible
PDF
Streamlining Cloud Infrastructure Automation with Ansible
PDF
Automate your business
PPTX
Ansible presentation
Self-healing Applications with Ansible
Shift-left SRE: Self-healing on OpenShift with Ansible
Event driven-automation and workflows
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
"Using Automation Tools To Deploy And Operate Applications In Real World Scen...
Devops_and_Legacy_Ansible
Learn you some Ansible for great good!
Enabling self-service automation with ServiceNow and Ansible Automation Platform
Bsides Delhi Security Automation for Red and Blue Teams
Process Automation for Modern IT Operations Management
DevSecOps: Closing the Loop from Detection to Remediation
DevSecOps: Closing the Loop from Detection to Remediation
Automating using Ansible
Infrastructure Automation
Sprint 83
Devops with Python by Yaniv Cohen DevopShift
Deployment Automation & Self-Healing with Dynatrace & Ansible
Streamlining Cloud Infrastructure Automation with Ansible
Automate your business
Ansible presentation
Ad

Recently uploaded (20)

PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
What is a Computer? Input Devices /output devices
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
STKI Israel Market Study 2025 version august
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Configure Apache Mutual Authentication
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
Flame analysis and combustion estimation using large language and vision assi...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Abstractive summarization using multilingual text-to-text transfer transforme...
Final SEM Unit 1 for mit wpu at pune .pptx
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Convolutional neural network based encoder-decoder for efficient real-time ob...
Getting started with AI Agents and Multi-Agent Systems
What is a Computer? Input Devices /output devices
Taming the Chaos: How to Turn Unstructured Data into Decisions
STKI Israel Market Study 2025 version august
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
Consumable AI The What, Why & How for Small Teams.pdf
Developing a website for English-speaking practice to English as a foreign la...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Configure Apache Mutual Authentication
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Credit Without Borders: AI and Financial Inclusion in Bangladesh
A contest of sentiment analysis: k-nearest neighbor versus neural network

How to build your own auto-remediation workflow - Ansible Meetup Munich

Editor's Notes

  • #2: I‘m Jürgen Work at Dynatrace Who of you knows Dynatrace? We are the global leader in application performance monitoring – full-stack end to end (Java, dotnet, on-prem and in the cloud) Why ansible? Because I care about automation
  • #3: Who of you knows Dynatrace? We are the global leader in application performance monitoring – full-stack end to end (Java, dotnet, on-prem and in the cloud) Why ansible? Because I care about automation
  • #5: that’s not going to be easy – container and cloud platforms allow for faster deployments, independent release cycles WHILE increasing operational complexity monolith to microservice, in memory call / network call, Istio (more hops, more technologies) – overall we see on average 82! applications are incredibly complex how it works end-to-end? nobody knows all parts ...
  • #6: It might not break immediately but there will be a point in time when your applications will break. It can be a broken dependency, it can be a infrastructure failure, it can be a database slowdown severely impacting your service – however, your application will break. Murphys law: whatever can go wrong, will go wrong!
  • #7: A self-healing robot fixing itself when it experiences troubles. This could mean freeing up additional resources, restarting things that are not doing well, rolling back to a state where everything worked perfectly…
  • #8: Monitoring: End to end means that you have to track the complete path of your requests to not look at black boxes Full-stack: has to cover your complete application stack from frontend to backend technologies Automation: Means that can execute what you would do manually in case of outages
  • #9: What we see a lot in customer environments is that the actual root cause of the problem is buried somewhere else than you would expect at first sight. For example, if your services experience a slow down, the actual problem might be even the network or the underlying database of a different service the one that you are looking for is depending on.
  • #11: We at Dynatrace have automated this process, since the traditional way still means a lot of manual monitoring and looking at dashboards. We achieve this by using our own monitoring tool and integrating it with 3rd party vendors. Also, Dynatrace provides full stack monitoring to detect issues in either layer of your environment. Automatic baselining further allows to automatically detect anomalies without the need to manually define tresholds, since they might differ substantially between applications. Our AI-based root cause analysis finally detects the real root cause of the problem and sends exactly this notification. Now a third party vendor such as Ansible Tower can take over.
  • #12: As an example, let‘s take a look at a simple delivery pipeline. When deploying a new version, we make sure to carefully test our new build. However, despite thorough tests in staging and maybe even in production errors might occur. Although the pipeline was build to fail early this is not always possible. So it might happen that the error is only discovered in production. If the error occurs Saturday night it might not possible to inspect it immediately and schedule counter actions. Therefore with auto-remediation in place we can for example automatically rollback to the previous stable version to save the weekend.
  • #20: - you see the problem in the picture for automation?
  • #29: As we can see being able to automate lies in the core of even enabling auto-remediation or self-healing. First you need to have runbooks or scripts that can kick in every time they are needed. Next you can connect your tools of choice to this scripts to enable auto-remediation. However, you still have to have dedicated runbooks for each scenario in place and have to connect the right problems to the right counter-actions. Finally, with self-healing we can leverage the power of AI and big data to fully understand the root causes of problems and automatically determine executable steps for remediations.
  • #31: Real customer problem in a complex cloud environment Problem is not only the money spent on this, but also time and bad brand reputation – problem was that
  • #32: Does your Enterprise look like this today?
  • #33: Bob has many layers to look through for problems. Mean time to Recovery (MTTR) for application problems could take 72 hours or more. Can Bob find the problem quickly let alone fix it? What about the impact? In many cases the Mean Time to Discovery (MTTD) takes up two-thirds of the MTTR. In that time how many other users or applications may be impacted?