SlideShare a Scribd company logo
Innovate Everywhere
Choosing the right tools for your SRE toolchain
@jasonhand
Leonard Gram
Core Developer at Grafana Labs
@xlson
Margo Schaedel
Developer Advocate at InfluxData
@mschae16
Jason Hand
DevOps at VictorOps
@jasonhand
Expect To Learn
● Industry expectations around service
reliability and availability
(New Challenges)
● How to create simple and lightweight
representations of your systems for
everyone in the organization
(Observability)
● SRE Toolchain
(e.g.) Grafana, InfluxData and
VictorOps work together through the
entire (Lifecycle of an Incident)
@jasonhand
“Reliability is the single most important
feature we provide.”
- Dan Jones, CTO
VictorOps
Reliability &
Availability
@jasonhand
What Do We Mean?
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
7© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Systems Are ...
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
DevOps
SRE
vs
DevOps Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
jhand.co/2rLjEV7
SRE
Shared ownership with developers by using the same
tools and techniques across the stack
Encourage moving quickly by reducing costs of
failure
Have a formula for balancing accidents and
failures against new releases
Encourages "automating this year's job away" and
minimizing manual systems work to focus on
efforts that bring long-term value to the system
Believes that operations is a software problem,
and defines prescriptive ways for measuring
availability, uptime, outages, toil, etc.
jhand.co/2rLjEV7
Shared ownership with developers by using the
same tools and techniques across the stack
Encourage moving quickly by reducing costs of
failure
Have a formula for balancing accidents and
failures against new releases
Encourages "automating this year's job away" and
minimizing manual systems work to focus on
efforts that bring long-term value to the system
Believes that operations is a software problem,
and defines prescriptive ways for measuring
availability, uptime, outages, toil, etc.
Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
DevOps SRE
jhand.co/2rLjEV7
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Starting The SRE
Journey
@jasonhand
Our First SRE Question...
What keeps you up at night?
What Does the Data Mean?
Even though you are collecting the data .. what does it actually mean?
Collecting → Storing → Thresholds/triggers.
Defining what needs action taken … not always the same for everyone.
@jasonhand
(events, logging, tracing, etc.)
You Are Here
Specialized
operations
tools
convergence between
SRE and software
engineering
toolchains
The tools SREs use at any given time will depend on
where an organization is in their SRE journey.
How Do We DevOps/SRE?
@jasonhand
It Depends
SRE
Toolchain
A transition story
How a development team gained
ownership of their services and
learned more about observability
and reliability.
New responsibilities:
● On-call
● Deployment
● Reliability
@xlson
The story continues...
Users contact support about the service being
unreliable.
What they have:
● Telegraf & InfluxDb (InfluxData) for metrics
● Grafana, basic dashboard for requests / minute
● Victorops for on-call & incident management
@xlson
The Problem
An increase in requests to their systems,
… but ...
no increase in logins or purchases
The RED Method
Tracking already:
● (Request) Rate
Added metrics:
● (Request) Errors
● (Request) Duration
@xlson
Client Side Failures
After some more time they find that requests to their systems are failing client side
without being properly terminated server side, which leads to requests timing out
and erroring. They set up alarms to guard against failures in the future.
@xlson
Setting Up An SRE Toolchain
✓
✓
✓
Platform for All Metrics and Event Workloads
@influxdb
telegraf & influxdb
Choosing the right tools for your SRE toolchain
@influxdb
What is telegraf
● Plugin-driven server agent for collecting &
reporting metrics
● Input plugins to source a variety of metrics
○ directly from the system it’s running on
○ pull metrics from third party APIs
○ listen for metrics via a statsd and
Kafka consumer services
● Output plugins to send metrics to other
datastores, services, & message queues
○ InfluxDB, Graphite, OpenTSDB,
Datadog, Librato, Kafka, MQTT,
NSQ, and many others
What is influxDB
• Time-series database built from the ground up
to handle high write & query loads
• Data store for timestamped data, (DevOps
monitoring, application metrics, IoT sensor data,
& real-time analytics)
• Conserve space by configuring
– to keep data for a defined length of time
– to automatically expiring data
– delete any unwanted data from the system
• Offers a SQL-like query language for interacting
with data
• Dashboarding in popular projects like Grafana
What is Grafana
● An open source platform for beautiful
analytics and monitoring
● Easy to use
● Over 40 Datasources
○ InfluxDB
○ Prometheus
○ Graphite
○ Elasticsearch
○ ...and many more
@xlson
Try Grafana on
https://play.grafana.org/
Timeseries are everywhere
@xlson
Creating a dashboard
@xlson
Adding some metrics
@xlson
Setting up alerting
@xlson
Annotations
@xlson
Visualizing your data
@xlson
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Human(e)
Response
Phases
of an
Incident
React
Respond
Ownership & Responsibility
Escalate Or Automate
Communicate
jhand.co/chatopsbook
Learn & Improve
Learn & Improve
jhand.co/PIR_book
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain
Learn More
try.victorops.com/trial
influxdata.com/download
grafana.com/get
try.victorops.com/SRE
@jasonhand
Questions
@jasonhand
Thank You
Leonard Gram
Core Developer at Grafana Labs
@xlson
Margo Schaedel
Developer Advocate at InfluxData
@mschae16
Jason Hand
DevOps at VictorOps
@jasonhand

More Related Content

PDF
Replace Outdated DevOps Tools with Innovative & Modern Pipelines
PDF
Building Ops Automation in DevOps
PPTX
DOES15 - Rosalind Radcliffe - Test Automation For Mainframe Applications
PDF
Integrating SAP into DevOps Pipelines: Why and How
PPTX
A Crash Course in Building Site Reliability
PDF
To Scale Test Automation for DevOps, Avoid These Anti-Patterns
PPTX
Making the business case for DevOps
PPTX
How to Build the Right Automation
Replace Outdated DevOps Tools with Innovative & Modern Pipelines
Building Ops Automation in DevOps
DOES15 - Rosalind Radcliffe - Test Automation For Mainframe Applications
Integrating SAP into DevOps Pipelines: Why and How
A Crash Course in Building Site Reliability
To Scale Test Automation for DevOps, Avoid These Anti-Patterns
Making the business case for DevOps
How to Build the Right Automation

What's hot (20)

PDF
Monitoring at the Speed of DevOps
PPTX
Webinar: A Roadmap for DevOps Success
PPTX
Serena DevOps Drive-in: Leading the Agile and DevOps transformation with Gary...
PPTX
Accelerate DevOps Transformation with App Migration to the Cloud
PPTX
Top Trends in Application Delivery Webinar 10.29.15
PPTX
The DevOps Journey in an Enterprise - DOES 2021
PDF
PDF
Top enterprise dev ops transformation practices 2022
PPTX
How a Top Retailer Brought Together UX Design and Agile Development (and got ...
PDF
DevOps Transformation - technical and organizational goals
PDF
Developing a Testing Strategy for DevOps Success
PPTX
DevOps Monitoring and Alerting
PPTX
Starting and Scaling DevOps
PPTX
DOES15 - Mark Michaelis - Metrics that Matter
PDF
DevOps CD and Multispeed IT in regulated industries (FUG Presentation)
PDF
Tech Mahindra ADOPT©: Accelerate DevOps Transformation
PDF
DevOps Explained
PPTX
Building a Bridge Between CI/CD and ITSM
PPTX
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
PDF
Data-Driven DevOps: Improve Velocity and Quality of Software Delivery with Me...
Monitoring at the Speed of DevOps
Webinar: A Roadmap for DevOps Success
Serena DevOps Drive-in: Leading the Agile and DevOps transformation with Gary...
Accelerate DevOps Transformation with App Migration to the Cloud
Top Trends in Application Delivery Webinar 10.29.15
The DevOps Journey in an Enterprise - DOES 2021
Top enterprise dev ops transformation practices 2022
How a Top Retailer Brought Together UX Design and Agile Development (and got ...
DevOps Transformation - technical and organizational goals
Developing a Testing Strategy for DevOps Success
DevOps Monitoring and Alerting
Starting and Scaling DevOps
DOES15 - Mark Michaelis - Metrics that Matter
DevOps CD and Multispeed IT in regulated industries (FUG Presentation)
Tech Mahindra ADOPT©: Accelerate DevOps Transformation
DevOps Explained
Building a Bridge Between CI/CD and ITSM
10 Reasons Why You Should Consider Google App Engine (GAE) for Your Next Project
Data-Driven DevOps: Improve Velocity and Quality of Software Delivery with Me...
Ad

Similar to Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain (20)

PPTX
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
PPTX
SRE Training Online in India SRE Training.pptx
PDF
Site Reliability Engineering slide deck 101
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PPTX
When Platform Engineers meet SREs - The Birth of O11y-as-a-Service Superpowers
PDF
I pushed in production :). Have a nice weekend
PDF
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
PDF
Site-Reliability-Engineering-v2[6241].pdf
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PPTX
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
PDF
SRE Certification and SRE Courses Online in India – Visualpath.pdf
PDF
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PPTX
Observability-as-a-Service: When Platform Engineers meet SREs
PPTX
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
PPTX
Site (Service) Reliability Engineering
PPTX
Site reliability engineering
PDF
Upskill Yourself With GSDC Site Reliability Engineering Certification
PDF
S.R.E - create ultra-scalable and highly reliable systems
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
SRE Training Online in India SRE Training.pptx
Site Reliability Engineering slide deck 101
SRE (service reliability engineer) on big DevOps platform running on the clou...
When Platform Engineers meet SREs - The Birth of O11y-as-a-Service Superpowers
I pushed in production :). Have a nice weekend
SRE and GitOps for Building Robust Kubernetes Platforms.pdf
Site-Reliability-Engineering-v2[6241].pdf
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
ADDO_2022_SRE Architectural Patterns_Nov10.pptx
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
SRE Certification and SRE Courses Online in India – Visualpath.pdf
ADDO_2020-Driving-Digital-Transformation-through-CloudOps-and-SRE.pdf
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Observability-as-a-Service: When Platform Engineers meet SREs
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site (Service) Reliability Engineering
Site reliability engineering
Upskill Yourself With GSDC Site Reliability Engineering Certification
S.R.E - create ultra-scalable and highly reliable systems
Ad

More from DevOps.com (20)

PDF
Modernizing on IBM Z Made Easier With Open Source Software
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PPTX
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
PDF
Next Generation Vulnerability Assessment Using Datadog and Snyk
PPTX
Vulnerability Discovery in the Cloud
PDF
2021 Open Source Governance: Top Ten Trends and Predictions
PDF
A New Year’s Ransomware Resolution
PPTX
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
PDF
Don't Panic! Effective Incident Response
PDF
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
PDF
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
PDF
Monitoring Serverless Applications with Datadog
PDF
Deliver your App Anywhere … Publicly or Privately
PPTX
Securing medical apps in the age of covid final
PDF
How to Build a Healthy On-Call Culture
PPTX
The Evolving Role of the Developer in 2021
PDF
Service Mesh: Two Big Words But Do You Need It?
PPTX
Secure Data Sharing in OpenShift Environments
PPTX
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
PDF
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Modernizing on IBM Z Made Easier With Open Source Software
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Next Generation Vulnerability Assessment Using Datadog and Snyk
Vulnerability Discovery in the Cloud
2021 Open Source Governance: Top Ten Trends and Predictions
A New Year’s Ransomware Resolution
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Don't Panic! Effective Incident Response
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Monitoring Serverless Applications with Datadog
Deliver your App Anywhere … Publicly or Privately
Securing medical apps in the age of covid final
How to Build a Healthy On-Call Culture
The Evolving Role of the Developer in 2021
Service Mesh: Two Big Words But Do You Need It?
Secure Data Sharing in OpenShift Environments
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Cloud computing and distributed systems.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation theory and applications.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Approach and Philosophy of On baking technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Machine Learning_overview_presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Cloud computing and distributed systems.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Electronic commerce courselecture one. Pdf
Encapsulation theory and applications.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Approach and Philosophy of On baking technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Machine Learning_overview_presentation.pptx
sap open course for s4hana steps from ECC to s4
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)

Innovate Everywhere: Choosing the Right Tools When Building Your SRE Toolchain