SlideShare a Scribd company logo
bertjan@openvalue.eu
Debugging distributed systems
Bert Jan Schrijver
@bjschrijver
Debugging distributed systems: the good parts
bertjan@openvalue.eu
Bert Jan Schrijver
@bjschrijver
Networking 101
How the internet works
Why?
Bert Jan Schrijver
L e t ’ s m e e t
@bjschrijver
Why are distributed
systems difficult?
Networking 101
What?
Why?
✅
Demo
War stories
Conclusion
W h a t ‘ s n e x t ?
Outline
A structured approach
@bjschrijver
What is a distributed system?
A distributed system is a system whose
components are located on different
networked computers
which communicate and coordinate their
actions by passing messages to one
another.
• Concurrency of components
• Lack of a global clock
• Independent failure of components
➡ Distributed systems are harder to
reason about
Characteristics of distributed systems
Source: http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
Working with distributed systems is
fundamentally different from writing
software on a single computer
- Martin Kleppmann
- and the main difference is that there are
lots of new and exciting ways for things to
go wrong.
“
”
Photo: Dave Lehl
”
Why do things go wrong?
“ ”
Photo: Dave Lehl
The fallacies of distributed computing
are a set of assertions made by L Peter
Deutsch and others at Sun Microsystems
describing false assumptions that
programmers new to distributed
applications invariably make.
1. The network is reliable;
2. Latency is zero;
3. Bandwidth is infinite;
4. The network is secure;
5. Topology doesn't change;
6. There is one administrator;
7. Transport cost is zero;
8. The network is homogeneous.
Fallacies of distributed computing
What could possibly go wrong?
“ ”
Photo: Dave Lehl
OSI & TCP/IP
Source: https://www.guru99.com/difference-tcp-ip-vs-osi-model.html
.. in your browser’s address bar and press Enter
What happens when you type google.com…
Source: https://github.com/alex/what-happens-when
16
Source: https://7216-presscdn-0-76-pagely.netdna-ssl.com/wp-content/uploads/2011/12/confused-man-single-good-men.jpg
Where do I start?
A structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Step 1: Observe & document
• What do you know about the problem?
• Inspect logging, errors, metrics, tracing
• Draw the path from source to target - what’s
in between? Focus on details!
• Document what you know
• Can we reproduce in a test?
• By injecting errors, for example
Tools
Whiteboard,
documentation, logging,
metrics, tracing
(opentracing.io), tests,
jepsen.io
Step 1: Observe & document
Step 2: Create minimal reproducer
• Goal: maximise the amount of debugging
cycles
• Focus on short development iterations /
feedback loops
• Get close to the action!
Tools
IDE, Shell scripts,
SSH tunnels, Curl
Step 3: Debug client side
• Focus on eliminating anything that could be
wrong on the client side
• Are we connecting to the right host?
• Do we send the right message?
• Do we receive a response?
• Not much different from local
debugging
Tools
IDE, debugger,
logging
Step 4: Check DNS & routing
• DNS:
• Make sure you know what IP address the
hostname should resolve to
• Verify that this actually happens
at the client
• Routing:
• Verify you can reach the
target machine
Tools
host, nslookup,
dig, whois, ping,
traceroute,
nslookup.io,
dnschecker.org
Step 5: Check connection
• Can we connect to the port?
• If not, do we get a REJECT or a DROP?
• Does the connection open and stay open?
• Are we talking TLS?
• What is the connection speed
between us?
Tools
telnet, nc, curl,
iperf
Step 6: Inspect traffic / messages
• Do we send the right request?
• Do we receive the right response?
• How do we know?
• How do we handle TLS?
• Are there any load balancers
or proxies in between?
Tools
curl, wireshark,
tcpdump, network
tab in browser,
mitm/tls proxy
Step 7: Debug server side
• Inspect the remote host
• Can we attach a remote debugger?
• See https://youtube.com/OpenValue
• Profiling
• Java Flight Recorder
• Strace
Tools
SSH tunnels,
remote debugger,
profiler, strace,
JFR
Step 8: Wrap up & post mortem
• Document the issue:
• Timeline
• What did we see?
• Why did it happen?
• What was the impact?
• How did we find out?
• What did we do to mitigate and fix?
• What should we do to prevent
repetition?
Tools
Whiteboard,
documentation
If you really want a reliable system, you
have to understand what its failure modes
are. You have to actually have witnessed
it misbehaving.
- Jason Cahoon
“
”
Distributed systems war stories
The one where it worked half of the time…
The one at a school…
Devoxx Belgium 2022 - Debugging distributed systems
The one with breaking news…
Summary: a structured approach
to debugging distributed systems
@bjschrijver
Check DNS & routing
Check connection
Debug client side
Create minimal reproducer
Debug server side
Observe & document
Wrap up & post mortem
Inspect traffic / messages
Source: https://cdn2.vox-cdn.com/thumbor/J9OqPYS7FgI9fjGhnF7AFh8foVY=/148x0:1768x1080/1280x854/cdn0.vox-cdn.com/uploads/chorus_image/image/46147742/cute-success-kid-1920x1080.0.0.jpg
THAT’S IT.
NOW GO KICK SOME ASS!
Questions?
@bjschrijver
Thanks for your time.
Got feedback? Tweet it!
All pictures belong
to their respective
authors
@bjschrijver

More Related Content

PDF
Debugging distributed systems
PDF
JavaLand 2022 - Debugging distributed systems
PDF
GOTO night April 2022 - Debugging distributed systems
PDF
Mastering Microservices 2022 - Debugging distributed systems
PDF
Debugging distributed systems
PDF
JUG CH September 2021 - Debugging distributed systems
PPTX
Incident Response Fails
PPTX
Pentesting Tips: Beyond Automated Testing
Debugging distributed systems
JavaLand 2022 - Debugging distributed systems
GOTO night April 2022 - Debugging distributed systems
Mastering Microservices 2022 - Debugging distributed systems
Debugging distributed systems
JUG CH September 2021 - Debugging distributed systems
Incident Response Fails
Pentesting Tips: Beyond Automated Testing

Similar to Devoxx Belgium 2022 - Debugging distributed systems (20)

PPTX
WTF is Penetration Testing v.2
PDF
When Security Tools Fail You
PPTX
Rewriting DevOps
PPTX
Derbycon - Passing the Torch
PPT
shostack-blackhat-991.ppt YUGUUYGYGUUYUHJ
PPTX
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
PPT
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
PPTX
How To Start Your InfoSec Career
PDF
Push Functional Testing Further
PDF
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
PDF
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
PPTX
Heartbleed
PDF
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
PPTX
Monitoring microservices
PDF
2023 NCIT: Introduction to Intrusion Detection
PDF
Scaling a Web Site - OSCON Tutorial
PPT
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
PDF
When the internet bleeded : RootConf 2014
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
WTF is Penetration Testing v.2
When Security Tools Fail You
Rewriting DevOps
Derbycon - Passing the Torch
shostack-blackhat-991.ppt YUGUUYGYGUUYUHJ
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
How To Start Your InfoSec Career
Push Functional Testing Further
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
Sophisticated Attacks - Can We Really Detect Them _v1.2.pdf
Heartbleed
THOTCON 0x6: Going Kinetic on Electronic Crime Networks
Monitoring microservices
2023 NCIT: Introduction to Intrusion Detection
Scaling a Web Site - OSCON Tutorial
Heartbleed Bug Vulnerability: Discovery, Impact and Solution
When the internet bleeded : RootConf 2014
What does "monitoring" mean? (FOSDEM 2017)
Can_We_Really_Detect_These_So_Called_Sophisticated_Attacks?
Ad

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
history of c programming in notes for students .pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
ai tools demonstartion for schools and inter college
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 2 - PM Management and IT Context
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
ISO 45001 Occupational Health and Safety Management System
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
history of c programming in notes for students .pptx
Design an Analysis of Algorithms I-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
Nekopoi APK 2025 free lastest update
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
ai tools demonstartion for schools and inter college
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How Creative Agencies Leverage Project Management Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 2 - PM Management and IT Context
Ad

Devoxx Belgium 2022 - Debugging distributed systems

  • 2. Debugging distributed systems: the good parts bertjan@openvalue.eu Bert Jan Schrijver @bjschrijver Networking 101 How the internet works
  • 4. Bert Jan Schrijver L e t ’ s m e e t @bjschrijver
  • 5. Why are distributed systems difficult? Networking 101 What? Why? ✅ Demo War stories Conclusion W h a t ‘ s n e x t ? Outline A structured approach @bjschrijver
  • 6. What is a distributed system?
  • 7. A distributed system is a system whose components are located on different networked computers which communicate and coordinate their actions by passing messages to one another.
  • 8. • Concurrency of components • Lack of a global clock • Independent failure of components ➡ Distributed systems are harder to reason about Characteristics of distributed systems Source: http://www.nasa.gov/images/content/218652main_STOCC_FS_img_lg.jpg
  • 9. Working with distributed systems is fundamentally different from writing software on a single computer - Martin Kleppmann - and the main difference is that there are lots of new and exciting ways for things to go wrong. “ ” Photo: Dave Lehl ”
  • 10. Why do things go wrong? “ ” Photo: Dave Lehl
  • 11. The fallacies of distributed computing are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make.
  • 12. 1. The network is reliable; 2. Latency is zero; 3. Bandwidth is infinite; 4. The network is secure; 5. Topology doesn't change; 6. There is one administrator; 7. Transport cost is zero; 8. The network is homogeneous. Fallacies of distributed computing
  • 13. What could possibly go wrong? “ ” Photo: Dave Lehl
  • 14. OSI & TCP/IP Source: https://www.guru99.com/difference-tcp-ip-vs-osi-model.html
  • 15. .. in your browser’s address bar and press Enter What happens when you type google.com… Source: https://github.com/alex/what-happens-when
  • 16. 16
  • 18. A structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 19. Step 1: Observe & document • What do you know about the problem? • Inspect logging, errors, metrics, tracing • Draw the path from source to target - what’s in between? Focus on details! • Document what you know • Can we reproduce in a test? • By injecting errors, for example Tools Whiteboard, documentation, logging, metrics, tracing (opentracing.io), tests, jepsen.io
  • 20. Step 1: Observe & document
  • 21. Step 2: Create minimal reproducer • Goal: maximise the amount of debugging cycles • Focus on short development iterations / feedback loops • Get close to the action! Tools IDE, Shell scripts, SSH tunnels, Curl
  • 22. Step 3: Debug client side • Focus on eliminating anything that could be wrong on the client side • Are we connecting to the right host? • Do we send the right message? • Do we receive a response? • Not much different from local debugging Tools IDE, debugger, logging
  • 23. Step 4: Check DNS & routing • DNS: • Make sure you know what IP address the hostname should resolve to • Verify that this actually happens at the client • Routing: • Verify you can reach the target machine Tools host, nslookup, dig, whois, ping, traceroute, nslookup.io, dnschecker.org
  • 24. Step 5: Check connection • Can we connect to the port? • If not, do we get a REJECT or a DROP? • Does the connection open and stay open? • Are we talking TLS? • What is the connection speed between us? Tools telnet, nc, curl, iperf
  • 25. Step 6: Inspect traffic / messages • Do we send the right request? • Do we receive the right response? • How do we know? • How do we handle TLS? • Are there any load balancers or proxies in between? Tools curl, wireshark, tcpdump, network tab in browser, mitm/tls proxy
  • 26. Step 7: Debug server side • Inspect the remote host • Can we attach a remote debugger? • See https://youtube.com/OpenValue • Profiling • Java Flight Recorder • Strace Tools SSH tunnels, remote debugger, profiler, strace, JFR
  • 27. Step 8: Wrap up & post mortem • Document the issue: • Timeline • What did we see? • Why did it happen? • What was the impact? • How did we find out? • What did we do to mitigate and fix? • What should we do to prevent repetition? Tools Whiteboard, documentation
  • 28. If you really want a reliable system, you have to understand what its failure modes are. You have to actually have witnessed it misbehaving. - Jason Cahoon “ ”
  • 30. The one where it worked half of the time…
  • 31. The one at a school…
  • 33. The one with breaking news…
  • 34. Summary: a structured approach to debugging distributed systems @bjschrijver Check DNS & routing Check connection Debug client side Create minimal reproducer Debug server side Observe & document Wrap up & post mortem Inspect traffic / messages
  • 37. Thanks for your time. Got feedback? Tweet it! All pictures belong to their respective authors @bjschrijver