Operations

The runbook that lived in one person's head

Production went down on a Saturday. The only engineer who knew how to fix it was on a plane. Two hours of revenue burned waiting for him to land.

The runbook that lived in one person's head
Illustration · Deimar Gutiérrez

Production went down on a Saturday afternoon. The on-call engineer paged the senior backend engineer who, by company convention, fixed this subsystem when it failed. The senior engineer was on a transatlantic flight, with his phone in airplane mode, somewhere over Greenland.

The company burned two hours of revenue waiting for the plane to land. The fix, when he reached his laptop, took eleven minutes.

The runbook for that subsystem existed in exactly one place: in his head. There was a wiki page with the system name on it. The page had been written eighteen months ago and was wrong in three ways that would have actively misled anyone trying to use it. The functional runbook was the senior engineer's memory of the last four times he had fixed it.

Every company has these. They are visible only at the moment they fail. Tribal knowledge looks like an asset most of the time and becomes a liability the one Saturday a quarter when it matters. The asset framing is dangerous because it discourages the documentation work: the senior engineer is fast, reliable, on Slack. The senior engineer is also a single point of failure the org chart does not show.

The cheapest audit is unsentimental. Make the senior person take a real week off. Phone off. No Slack. Whatever breaks while they are gone is the documentation map. Whatever the on-call team had to escalate, route around, or hot-patch is the list of runbooks that need to exist. The exercise is uncomfortable and exposes things the senior person did not realize only they knew. That discomfort is the point.

The writeup that comes out is predictable. It opens with the symptom, not the system. The first sentence is what someone actually types into Google at three in the morning — checkout 500 error on stripe webhook, not payment infrastructure subsystem overview. The next three sentences are the exact diagnostic commands. The runbook is judged by whether a tired on-call engineer who has never seen this system can resolve the incident from the doc alone. If they cannot, the doc is wrong, regardless of how comprehensive it looks.

The hire who knows everything is a feature only as long as they are reachable. The day they are not, the feature reveals itself as the bug it always was. The fix is not to clone the engineer. It is to make the engineer dispensable, which is the same thing every engineer wishes for in principle and resists in practice. Make them resist it less. Make them resist it through their next vacation.