SlideShare a Scribd company logo
1
5 practical operability
techniques for teams
Matthew Skelton, Conflux
@matthewpskelton
confluxdigital.net
UNICOM DevOps Showcase North 2019
Manchester, 27 February 2019
Practical operability
Why do we need a focus on
operability?
5 practical operability techniques
that work
2
modern event-based logging
Run Book dialogue sheets
endpoint healthchecks
correlation IDs
user personas
3
team
collaboration
techniques
4
About me
Matthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
5
20% discount for
DevOps Showcase!
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
operabilitybook.com
20% discount for Unicom events
http://leanpub.com/SoftwareOperability/c/UNICOM
6
You?
Software Developer
Tester / QA
DevOps Engineer
Ops Engineer / SRE
Head of Department
7
5 Practical
Operability Techniques
for Teams
8
Operability
making software work well
in Production
9
Operability
10
Scale
Restore
Inspect
Failover
Monitor
Diagnose
Secure
Cleardown
Report
“But can’t we just give
those things to an SRE?”
11
“But can’t we just give
those things to the
DevOp?”
12
Operability is
a shared concern
#BizDevTestSecOps
13
Operability is
a shared concern
#BizDevTestSecOps
14
15
16
A self-managed Kubernetes
cluster near you
Operability is
a shared concern
#BizDevTestSecOps
17
18
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com
Practical operability techniques
1. Modern logging with event IDs
2. Run Book dialogue sheets
3. Endpoint healthchecks
4. Correlation IDs
5. Lightweight User Personas
19
20
Logging with Event IDs
Lack of observability for
distributed systems
21
Modern logging w/ Event IDs
Distinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared understanding
22
search by event
Event ID
{Delivered,
InTransit,
Arrived}
23
transaction
trace
Correlation ID
612999958…
24
Modern logging with event IDs
helps to produce a well-defined
event space:
human-readable events
25
Which calls might fail?
26
How many distinct event
types (state transitions) in
your application?
27
28
represent distinct states
29
enum
Human-readable sets:
unique values, sparse, immutable
C#, Java, Python, node
(Ruby, PHP, …)
30
Technical
Domain
public enum EventID
{
// Badly-initialised logging data
NotSet = 0,
// An unrecognised event has occurred
UnexpectedError = 10000,
ApplicationStarted = 20000,
ApplicationShutdownNoticeReceived = 20001,
MessageQueued = 40000,
MessagePeeked = 40001,
BasketItemAdded = 60001,
BasketItemRemoved = 60002,
CreditCardDetailsSubmitted = 70001,
// ...
} 31
BasketItemAdded = 60001
BasketItemRemoved = 60002
32
example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly 33
Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughput
Glitch-free video & audio
34
Storage I/O
Worker Job
Queue
Upload
35
36
37
Example: video processing
Discover processing bottlenecks
Trigger alerts
Report on KPIs
Target areas for improvement
38
Modern logging w/ Event IDs
clarity about software behavior
reduce time to detect problems
increase team engagement
enhance collaboration
39
Modern Logging:
Collaborate on Event IDs
and Correlation traces for
better system awareness
40
Run Book dialogue sheets
41
Operational aspects not
addressed, or addressed
too late in the cycle
42
Run Book dialogue sheets
Checklists for typical operational
considerations
Team-friendly exploration
43
Run Book dialogue sheets help
to increase awareness of
operability within teams
44
runbooktemplate.infoRun Book dialogue sheets
45
System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
… 46
http://runbooktemplate.info/
Github, CC BY-SA
47
runbooktemplate.infoRun Book dialogue sheets
48
Run Book dialogue sheets
Early discovery of operational needs
Input to team backlog
“Shift-left” testing
Avoid operational problems
49
50
http://operabilityquestions.com/
Github, CC BY-SA
OperabilityQuestions.com
Freeform, exploratory questions for
teams
Usability, viability, reliability,
observability, securability, …
(Github, CC BY-SA)
51
Run Book dialogue sheets:
Collaborate on operational
requirements for better
system awareness
52
Endpoint healthchecks
53
“Why has my deployment
failed again?”
“Why is Pre-Prod always
so flaky?”
54
Endpoint healthchecks
Simple HTTP check
Common way to assess any
service/app/component
Key operational requirement
55
endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
56
Endpoint healthchecks help
teams to collaborate on
service viability
57
endpoint healthchecks
Each component is responsible for
determining its own health and viability
– this is very contextual
58
endpoint healthchecks
Use JSON as a response type –
parsable by both
machines and humans!
59
60
endpoint healthchecks
For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the component
200 / 500 responses
61
Helper service
62
https://github.com/Lugribossk/simple-dashboard
63
64
Question:
What does this look like for Serverless?
¯_(ツ)_/¯
Endpoint healthchecks
Rapid diagnosis and visibility
Reduce confusion around
environment state
“Fail fast” → “learn sooner”
65
Endpoint healthchecks:
Collaborate on component
health status for better
system awareness
66
Correlation IDs
67
“Which nodes handled the
request?”
68
Correlation IDs
Unique-ish identifiers
Trace calls across machine &
container boundaries
Re-assemble the HTTP call later
69
‘Unique-ish’ identifier for each request
Passed through downstream layers
70
Correlation IDs help teams to
think about the big picture:
end-to-end outcomes
71
Unique-ish ID
72
Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this is internal only)
73
Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
74
Example: OpenTracing / PCF
3 tracing elements:
TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId"
"X-B3-ParentSpan"
75
Example: OpenTracing / PCF
Always log the TraceID as-is
Log calling SpanID as ParentSpan
Log new SpanID
76
Trace
Span
ParentSpan
77
Correlation IDs
Detect bottlenecks and unexpected
interactions
Increase transparency
Learn about the system
78
Correlation IDs:
Collaborate on distributed
tracing for better system
awareness
79
Lightweight user personas
80
Software is difficult to
operate: poor UX for Ops.
81
Lightweight User Personas
Simple characterisation of user
needs for Dev/Test/Ops
Based on full UX user personas but
less detailed
82
Lightweight user personas:
Ops Engineer
Test Engineer
Build & Deployment Engineer
Service Owner
83
Lightweight user personas
help teams to build systems
with good UX for all users
84
Lightweight user personas:
Consider the User Experience (UX) of
engineers and team members using
and working with the software
85
http://www.keepitusable.com/blog/?tag=alan-cooper
86
Motivations
Goals
Frustrations
Lightweight user personas:
What data does the User Persona need
visible on a dashboard in order to
make decisions rapidly & safely?
87
https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 88
Lightweight User Personas
Empathise better with people from
other roles
Capture missing operational
requirements
89
Lightweight User Personas:
Collaborate on user needs
for better system awareness
90
Summary
91
Operability
making software work well
in Production
92
93
Lack of observability
Operational aspects not known
“Why has deployment failed?”
What handled the request?
Poor UX for Ops
94
95
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com
Logging with Event IDs
use enum-based Event IDs to
explore runtime behaviour and
fault conditions
96
Run Book dialogue sheets
explore and establish operational
requirements as a team, around a
physical table, together
97
Endpoint healthchecks
HTTP 200 / 500 responses to
/status/health call with JSON
details – good for tools and
humans
98
Correlation IDs
trace execution using correlation IDs:
synchronous (HTTP X-trace-id)
async (SQS MessageAttribute)
99
Lightweight user personas
explore the UX and needs of
different roles for rapid decisions
via dashboards
100
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
101
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
operabilitybook.com
20% discount for Unicom events
http://leanpub.com/SoftwareOperability/c/UNICOM
102
Resources
•Team Guide to Software Operability by Matthew Skelton
and Rob Thatcher http://operabilitybook.com/
•Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/
•Operability Questions http://operabilityquestions.com/
•5 proven operability techniques for software teams
https://techbeacon.com/5-proven-operability-techniques-s
oftware-teams
103
thank you
104
@matthewpskelton / operabilitybook.com
@ConfluxHQ / confluxdigital.net

More Related Content

PDF
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
PDF
5 practical operability techniques for teams - Matthew Skelton - ADDO 2018
PDF
How to address operational aspects effectively with Agile practices - Matthew...
PPTX
Har du en DevOps i ditt team?
PPTX
SRE in Enterprise - Local Journey DevopsDays Galway
PPTX
Dev ops continuousdeliveryforcloudproduct
PDF
Iac evolutions
PDF
Driving Systems Stability & Delivery Agility through DevOps [Decoding DevOps ...
5 practical operability techniques - Matthew Skelton - SkillsMatter 2018
5 practical operability techniques for teams - Matthew Skelton - ADDO 2018
How to address operational aspects effectively with Agile practices - Matthew...
Har du en DevOps i ditt team?
SRE in Enterprise - Local Journey DevopsDays Galway
Dev ops continuousdeliveryforcloudproduct
Iac evolutions
Driving Systems Stability & Delivery Agility through DevOps [Decoding DevOps ...

What's hot (20)

PPTX
Har du en DevOps i ditt team?
PDF
DevOps : Consulting with Foresight
PDF
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
PDF
DevOps Explained
PDF
SRE in Apiary
PPTX
SRE 101 (Site Reliability Engineering)
PDF
Scaling Enterprise DevOps with CloudBees
PDF
Tech Mahindra ADOPT©: Accelerate DevOps Transformation
PPTX
Road to DevOps ROI
PPTX
The Next Wave of Reliability Engineering
PDF
Scrum in dev ops teams - Presentation from Scrum Gathering Bangalore
PPTX
OpenSouthCode 2016 - Accenture DevOps Platform 2016-05-07
PPTX
DevOps explained
PDF
Software operability and run book collaboration - DevOps Summit, Bangalore
PDF
InfoSeption Corporate Presentation
PPTX
DevOps-as-a-Service: Towards Automating the Automation
PDF
XebiaLabs @ Jenkins User Conference NYC 2014
PPTX
Mirco hering devops for systems of record final
PPTX
Scaling Enterprise DevOps with CloudBees
PPSX
Har du en DevOps i ditt team?
DevOps : Consulting with Foresight
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
DevOps Explained
SRE in Apiary
SRE 101 (Site Reliability Engineering)
Scaling Enterprise DevOps with CloudBees
Tech Mahindra ADOPT©: Accelerate DevOps Transformation
Road to DevOps ROI
The Next Wave of Reliability Engineering
Scrum in dev ops teams - Presentation from Scrum Gathering Bangalore
OpenSouthCode 2016 - Accenture DevOps Platform 2016-05-07
DevOps explained
Software operability and run book collaboration - DevOps Summit, Bangalore
InfoSeption Corporate Presentation
DevOps-as-a-Service: Towards Automating the Automation
XebiaLabs @ Jenkins User Conference NYC 2014
Mirco hering devops for systems of record final
Scaling Enterprise DevOps with CloudBees
Ad

Similar to Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase North 2019 (20)

PDF
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
PDF
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
PDF
Practical operability techniques for teams - webinar - Skelton Thatcher & Unicom
PDF
Practical operability techniques for distributed systems - Velocity EU 2017
PDF
Practical operability techniques for teams - IPEXPO 2017
PDF
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
PPTX
SplunkLive! London 2016 Splunk for Devops
PDF
Practical, team-focused operability techniques for distributed systems - DevO...
ODP
PDF
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
PPTX
Scaling Enterprise DevOps with CloudBees
PPTX
Our Journey To Continuous Delivery
PPTX
DockerCon SF 2019 - Observability Workshop
PPTX
Agile & DevOps - It's all about project success
PPTX
Is DevOps Braking Your Company?
PDF
Introduction to DevOps slides.pdf
PPTX
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
PPTX
DevSecCon Keynote
PPTX
DevSecCon KeyNote London 2015
5 practical operability techniques for teams - Matthew Skelton - SQUID meetup...
Practical operability techniques for teams - Matthew Skelton - Conflux - Cont...
Practical operability techniques for teams - webinar - Skelton Thatcher & Unicom
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for teams - IPEXPO 2017
Practical operability techniques for teams - Matthew Skelton - Agile in the C...
SplunkLive! London 2016 Splunk for Devops
Practical, team-focused operability techniques for distributed systems - DevO...
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
Scaling Enterprise DevOps with CloudBees
Our Journey To Continuous Delivery
DockerCon SF 2019 - Observability Workshop
Agile & DevOps - It's all about project success
Is DevOps Braking Your Company?
Introduction to DevOps slides.pdf
All you need is fast feedback loop, fast feedback loop, fast feedback loop is...
DevSecCon Keynote
DevSecCon KeyNote London 2015
Ad

More from Matthew Skelton (20)

PDF
Find me on SpeakerDeck! - Matthew Skelton.pdf
PDF
Business and technical agility with Team Topologies - QCon Plus - 2021-05-26
PDF
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
PDF
Business agility with Team Topologies - NatWest Group - 2021-01-19
PDF
WFT is platform as a product? Clues from Team Topologies - WTFinar with Conta...
PDF
Beyond the Spotify Model - Team Topologies - Tech.rocks - 2020-12-10 - Matthe...
PDF
Accidental Architects - how HR designs software systems - Team Topologies - f...
PDF
Team Topologies in action - early results from industry - DOES Las Vegas 2020...
PDF
What is platform as a product? Clues from Team Topologies - Puppetize 2020 - ...
PDF
Remote first team interactions with Team Topologies - Iris Software Group - 2...
PDF
Team Topologies in action - early results from industry - DOES London Virtual...
PDF
Accidental Architects - how HR designs software systems - Team Topologies - e...
PDF
Remote-first team interactions with Team Topologies - SEAM - 2020-05-13
PDF
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
PDF
Remote first team interactions with Team Topologies - IT Revolution webinar -...
PDF
Remote-first team interactions with Team Topologies
PDF
Forget monoliths vs microservices - focus on Team Cognitive Load - Team Topol...
PDF
How to break apart a monolithic system safely without destroying your team - ...
PDF
Un-broken logging - the foundation of software operability - Operability.io -...
PDF
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...
Find me on SpeakerDeck! - Matthew Skelton.pdf
Business and technical agility with Team Topologies - QCon Plus - 2021-05-26
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
Business agility with Team Topologies - NatWest Group - 2021-01-19
WFT is platform as a product? Clues from Team Topologies - WTFinar with Conta...
Beyond the Spotify Model - Team Topologies - Tech.rocks - 2020-12-10 - Matthe...
Accidental Architects - how HR designs software systems - Team Topologies - f...
Team Topologies in action - early results from industry - DOES Las Vegas 2020...
What is platform as a product? Clues from Team Topologies - Puppetize 2020 - ...
Remote first team interactions with Team Topologies - Iris Software Group - 2...
Team Topologies in action - early results from industry - DOES London Virtual...
Accidental Architects - how HR designs software systems - Team Topologies - e...
Remote-first team interactions with Team Topologies - SEAM - 2020-05-13
Remote-first team interactions with Team Topologies - Agile Yorkshire - 2020-...
Remote first team interactions with Team Topologies - IT Revolution webinar -...
Remote-first team interactions with Team Topologies
Forget monoliths vs microservices - focus on Team Cognitive Load - Team Topol...
How to break apart a monolithic system safely without destroying your team - ...
Un-broken logging - the foundation of software operability - Operability.io -...
Forget monoliths vs microservices - focus on team cognitive load - Team Topol...

Recently uploaded (20)

PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
System and Network Administration Chapter 2
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Essential Infomation Tech presentation.pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
L1 - Introduction to python Backend.pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Operating system designcfffgfgggggggvggggggggg
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
System and Network Administration Chapter 2
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Essential Infomation Tech presentation.pptx
Odoo Companies in India – Driving Business Transformation.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
top salesforce developer skills in 2025.pdf
CHAPTER 2 - PM Management and IT Context
L1 - Introduction to python Backend.pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
2025 Textile ERP Trends: SAP, Odoo & Oracle
Operating system designcfffgfgggggggvggggggggg

Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase North 2019