How the Runbook Coverage Check Works: Inside the Mission

Harry's companion piece on why coverage gaps hurt makes the business case; my job is to open the hood. When people hear "AI checks your runbook coverage," they picture a model hallucinating documentation into a wiki. That's not what this is. The Runbook Coverage Check is a StudioX Mission — a stateful, observable workflow that reasons one step at a time toward a verdict, with a human gate on anything that writes. Let me walk through exactly how it's assembled and what runs when a new failure mode lands.

A Mission is a small org chart of agents

A Mission isn't a script. It's a roster of specialist agents, each backed by its own bot and its own knowledge base, coordinated by a two-tier reasoning system. For coverage checking, the roster looks like this:

a Change agent that reads what actually changed in the codebase,
an Alert agent that queries the alerting platform to see whether a matching alert already exists,
a Runbook agent that searches the runbook store and the incident-history knowledge base,
a Draft agent that authors a proposed alert definition and a proposed runbook when something is missing,
a Report agent that records the coverage verdict.

None of these are hand-coded pipelines. Each is a StudioX Vibe plus a bot with a knowledge base, registered into the Mission. That distinction matters: if tomorrow you want the check to also consider dashboard coverage, you register a Dashboard agent and the reasoning core starts considering it automatically. No platform release. Change the roster, change the behavior.

Tier 1: the reasoning core decides who acts next

When the Mission is triggered on a merged change, control goes to the reasoning core — Tier 1, the router. It reads the intent ("a new failure mode appeared; is it covered?"), the agent roster, and a one-line directory of every knowledge base, and then it makes a routing decision: pick exactly one agent to act this round, or declare the request answered. It runs up to ten rounds, re-reading every previous agent result before each decision.

This is genuinely reasoning, not a decision tree. The router re-evaluates semantically each round — a grounded negative like "no alert matches this failure signature" is a complete finding, not an error, and the router treats it as one. So a natural sequence emerges without anyone scripting it: route to the Change agent to characterize the new failure mode, then to the Alert agent to check for a matching alert, then to the Runbook agent to check for a matching runbook, and — only if a gap is confirmed — to the Draft agent. When the router judges the question fully answered, it stops and the Mission synthesizes one coherent verdict.

Tier 2: each agent plans and executes against real tools

When the router hands a goal to an agent that has a bot, Tier 2 — the agent planner — takes over. It first discovers what that agent's bot can actually do: its MCP tools, its knowledge bases, its vibes. Then, if the agent has more than one real capability, it plans an ordered set of steps (capped at six, always ending in a synthesis step) and executes them by chatting the bot in plain language. Each step is validated by a separate model call — did this step actually do its job? — with one retry on failure.

Here's the honest part about read versus write. The Alert agent and the Runbook agent are read-only: they query the alerting API and search the runbook store to answer "does coverage exist?" The Change agent reads the diff. None of these mutate anything. The Draft agent produces text — a proposed alert rule and a proposed runbook — but producing a draft is still not the same as publishing it. Nothing lands in your alerting config or your wiki from inside this loop. That gate comes later, and it's deliberate.

Observations: you watch it reason

The reason I trust this in front of an SRE team is that nothing is a black box. Every phase — each routing decision, each agent's discovery, its plan, every step and that step's validation verdict, the final answer check — is emitted as an observation. These stream live to the Explain rail as the Mission runs and are recorded in order as trace events. When the Mission says "no runbook exists for this failure mode," you can expand the trace and see which runbook store it searched, what query it used, and why it concluded there was no match. That's the difference between a tool you audit and a tool you hope about.

Instant MCP servers wire the enterprise tools

The agents don't have GitHub, your alerting platform, or your wiki hard-coded into them. Those arrive as MCP servers, registered once and discovered at runtime. The Alert agent's bot calls whatever alerting API you've wired; swap PagerDuty for Opsgenie and you re-point the MCP server, not the agent. This is what lets a coverage check stand up in hours: you connect the three systems that hold the truth — the repo, the alerting config, the runbook store — and the agents call the tool names, indifferent to what's behind them.

The human gate on the only write

The check itself is read-only, and drafting is just proposing. Publishing is the one action with blast radius, so it never happens autonomously. When the Draft agent produces a proposed alert and runbook, the Mission's synthesis ends with a [REQUEST_APPROVAL] block instead of claiming it shipped anything. The route layer turns that into a decision queue row, emails the owning engineer a magic-link approve/reject URL, and shows the pending item in the portal. A human reads the draft, edits or approves, and only then does it land. Human-in-the-loop isn't a bolt-on — it's where the design puts the one irreversible step.

That's the whole machine: a roster of agents, a router that reasons round by round, observations you can watch, MCP servers that supply the truth, and a decision gate on the single write. Patrick shows what it feels like on a real team in the Runbook Coverage Check in practice, and the general shape lives in the Missions platform, running entirely inside your own perimeter.