The Deployment Risk Scorer in Practice: A Real Before and After

I run security and deployment, which is a polite way of saying I am the person who gets the call when a release goes sideways. So when we turned the Deployment Risk Scorer on our own pipeline, I wasn't looking for a demo. I was looking for whether it would have caught the three worst deploys of my last year. Here's what actually happened over one week in April, told the way it happened.

The change that looked boring

Tuesday, 11 a.m. A platform engineer opens a pull request that migrates our rate-limiter from an in-memory store to a shared cache. On the surface: a dependency bump and about 200 lines. The kind of change that gets a thumbs-up in four minutes because it's "just infrastructure." Historically, this is exactly the class of change that has cost us the most — not the scary-looking ones, which everyone rolls out carefully, but the boring ones that carry hidden blast radius.

Before, our process here was: the author picks a rollout percentage based on gut, an approver skims the diff, and we find out at runtime whether anyone remembered that the checkout service and the login service both sit behind that rate limiter. Reconstructing that by hand — pulling the incident history for the touched components, checking which services depend on them — is maybe forty minutes of work across PagerDuty, our incident wiki, and a dependency graph nobody keeps current. Under deadline, nobody spends the forty minutes. They ship.

This time the CI job posted the change to the Deployment Risk Scorer mission and I watched it run on the Explain rail.

Watching the forecast happen

The reasoning core routed first to the Change Intake agent, which read the diff and came back with the touched surface: the rate-limiter module, its config, and — this is the part a human skim misses — the two services that import it. Then it routed to the History agent, which queried its knowledge base and surfaced two escalations from the previous eight months, both on this exact module, both involving the shared-cache dependency behaving badly under login-peak load. The Blast-Radius agent connected the dots: a failure here degrades authentication for everyone, not just the feature that shipped it.

None of that was new information. It all existed in our systems already. What was new was that it arrived before the deploy, assembled in about ninety seconds, with every step traced so I could see where each claim came from. The Canary Planner's proposal was specific: start at 2% of traffic, hold for 30 minutes across a login peak, watch auth error rate and cache latency, promote to 25% only if both stay flat, then 100%. And a clear abort line — if auth errors rise above baseline in phase one, roll back automatically and page the author, not the whole on-call rotation.

BEFORE merge 11:04 100% ship 11:06 auth errors spike 13:20 resolved 20:40 ~9.5 hrs · 2 SLA breaches

AFTER — with the Risk Scorer PR opened 11:04 forecast (90s) 11:06 approve plan 11:12 2% canary held 11:15 100% clean 12:05 ~1 hr · 0 breaches

The forecast added six minutes up front and removed a nine-hour night.

The approval, and the part I care about

Here's where the honesty of the design showed. The forecast itself is read-only — it touched nothing in production, it just read the diff, the history, and the dependency data and produced a plan. Executing the rollout is a separate act, and it does not happen automatically. The mission ended its turn with an approval request, which landed as a row in our Decision Queue and emailed the release owner a magic-link approve/reject. He read the plan, agreed with the 2% start, and clicked approve at 11:12. Six minutes of a human's time.

We ran the canary exactly as planned. At 2%, over the next login peak, cache latency ticked up — not enough to breach, but visible, and precisely the signal the History agent had warned about. Because we were at 2% and not 100%, it affected almost nobody. We held, tuned a connection-pool setting, and promoted cleanly. The whole thing was done and at full traffic by early afternoon, with no page, no cascade, and no 3 a.m. anybody.

Run the counterfactual against the "before" timeline and the arithmetic is stark. The old path for this class of change was a two-hour delay before symptoms even appeared, then a multi-hour hunt because the failure showed up as auth errors with no obvious link to a rate-limiter change. Nine and a half hours, two SLA breaches, one burned evening. The new path cost six minutes of approval time up front and removed the incident entirely.

What a month looked like

One good week is an anecdote, so here's the month. We routed 41 changes through the mission. It flagged 6 as high-blast-radius and proposed tightened canary plans; the other 35 it cleared for a normal rollout, which mattered more than I expected — being told a change is safe, with the history to back it, let us ship the boring 85% faster instead of slathering every deploy in the same defensive caution. Of the 6 flagged, 2 showed early canary signals we caught at low percentage. My rough accounting: two avoided incidents at our historical average of roughly eight hours of response each, plus the forty minutes of manual history-digging we no longer do per risky change, against a setup cost measured in an afternoon of wiring MCP servers to GitHub, PagerDuty, and our dependency data.

The thing I keep coming back to isn't the hours, though. It's that the knowledge was always there. Every one of those escalations was already in our systems; we just never got it in front of the person shipping in time to matter. Closing that gap is the whole game.

If you want the leadership framing, read Ajay's why it matters; for the architecture under the hood, Mark's how it works is the definitive one. And if you're weighing running this inside your own perimeter, our enterprise deployment and StudioX Missions pages cover the ground.

The Deployment Risk Scorer in Practice: A Real Before and After

The change that looked boring

Watching the forecast happen

The approval, and the part I care about

What a month looked like

Discussion

Join the discussion

See StudioX run.