Read in: Français

February 7, 20261 min read

Automating runbooks for critical incidents

OperationsIncidentsRunbook

Static runbooks are useless during a crisis. The real value comes when they are scripts you can trigger.

How I approach it

Each runbook is a script (bash, Python, Playwright) stored in the repo with a clear description.
Critical steps can be executed through a helper CLI (e.g., npm run incident --runbook=slow-query).
Every runbook includes a post-mortem template and triggers alerts that verify when it was last executed.

These runbooks do real work: cleaning queues, validating transactions, and posting summaries to Slack/Teams/Telegram.

Why it matters

An incident becomes a routine instead of a panic. Automated runbooks deliver the exact commands and remove copy-paste errors.
Teams can extend a runbook by coding a new step and hooking it up to the alert.
Each execution produces logs, so we can improve the playbook and prevent future outages.