Static runbooks are useless during a crisis. The real value comes when they are scripts you can trigger.
How I approach it
- Each runbook is a script (bash, Python, Playwright) stored in the repo with a clear description.
- Critical steps can be executed through a helper CLI (e.g.,
npm run incident --runbook=slow-query). - Every runbook includes a post-mortem template and triggers alerts that verify when it was last executed.
These runbooks do real work: cleaning queues, validating transactions, and posting summaries to Slack/Teams/Telegram.
Why it matters
- An incident becomes a routine instead of a panic. Automated runbooks deliver the exact commands and remove copy-paste errors.
- Teams can extend a runbook by coding a new step and hooking it up to the alert.
- Each execution produces logs, so we can improve the playbook and prevent future outages.