Read in: Français

Automating runbooks for critical incidents

Static runbooks are useless during a crisis. The real value comes when they are scripts you can trigger.

How I approach it

  1. Each runbook is a script (bash, Python, Playwright) stored in the repo with a clear description.
  2. Critical steps can be executed through a helper CLI (e.g., npm run incident --runbook=slow-query).
  3. Every runbook includes a post-mortem template and triggers alerts that verify when it was last executed.

These runbooks do real work: cleaning queues, validating transactions, and posting summaries to Slack/Teams/Telegram.

Why it matters

  • An incident becomes a routine instead of a panic. Automated runbooks deliver the exact commands and remove copy-paste errors.
  • Teams can extend a runbook by coding a new step and hooking it up to the alert.
  • Each execution produces logs, so we can improve the playbook and prevent future outages.