System Diagrams are Performance Caches for Cognitive Load
As is common in my line of work as a Staff Engineer, today I found myself discussing a complex scaling problem in a complex sub-system of a complex distributed system.
This discussion is the sort of thing that sometimes happens off-hand as we encounter problems, but in this case, it was in a weekly office-hours session where we reserve time to work through exactly this kind of problem with a broader group of engineers than might usually tackle it. All of the participants except myself had multiple years of expereince with the larger complex system, and were able to lean on that shared knowledge to have an efficient discussion.
But, having joined just a few months ago, I was overwhelmed about 5 minutes in. The individual words and concepts all made sense. JSON parsing slow. Network transit treacherous. Changing things at the source hard. I got all of those components of the discussion, but through the whole thing I was just barely able to follow the overall system conversation and ask very basic questions to understand what was going on. I came away with a bunch of exploratory personal action items, and a very clear hole in my mental model of the system that needs to be filled.
It occurred to me though, that we, the engineers, are part of the system too. If we think about what we know about caching, computation, and storage, and apply it here we can say that we had several nodes with primed local in-memory caches, and one new node with almost no local cache.
During this discussion, latency requirements were pretty relaxed. I can take days to fetch things from documentation and read the code when necessary. Being new, it’s expected that I will run my cognitive processors at a high load precisely to populate my local in-memory cache. And everyone in the meeting will also be making slower decisions based on their own local best guesses of how the system works based on their local cache and this discussion.
But, imagine responding to an incident on this system with only a partial understanding of this problem. Not everyone is present. Fetching from the code is the most expensive cache miss, expending precious cognitive load. Reading user docs, an RFC or design documents top to bottom is only slightly more efficient. Meanwhile, we have a system to observe, alarms to silence, run-books to read, and mitigation steps to enact, all with a very tight latency requirement of ASAP.
A single system diagram is where those primed nodes can push the most relevant bits of their information out of their local brain-caches, and into a high-performance distributed cache from which everyone can read. This will preserve precious cognitive load for those critical low-latency tasks.
Of course, all of these caches may be stale. The local in-memory ones are particularly hard to test, but at least the system diagram is observable. Everyone can look at it, and if there are nodes with updates, they can update the cache.
And the diagram is probably going to be stale as the system evolves, and will be stale no matter what, because it’s just a map, not the territory. But, as we know with distributed caches, a stale read that keeps the system up is a better situation than consistent reads that take more resources than you have.
So, prime those caches. Draw a picture of your system today!