case-studytest-escapeai-missions

Integration Risk Scorer in Practice: Two Cycles Saved

PG
Patrick Gilberg · Head of Security & Deployment
June 11, 2026

I run security and deployment, which means I am the person who has to believe a new system before it touches a customer's perimeter. So I don't get excited by demos. I get excited by a Tuesday that would have gone badly and didn't. Let me tell you about one, because the Integration Risk Scorer earned its keep on an ordinary change that nobody flagged as dangerous — which is exactly the kind that hurts.

The change that looked fine

A platform team I'd been working with — payments-adjacent, high transaction volume, the kind of shop where a bad cycle is measured in customer trust as much as hours — had a mid-sized integration in flight. An engineer, Deepak, was wiring their order service to a replacement fraud-scoring provider. The diff was clean. It swapped one client for another and adjusted a retry wrapper. Every reviewer who looked at it would have said the same thing: low risk, standard test plan, ship it.

Eighteen months earlier, that team had shipped escape 2024-1183 — a defect in how a different integration handled retries under partial failure, where a non-idempotent call got replayed and double-charged a sliver of customers. It slipped the test cycle, reached production, and cost them two full cycles to unwind: one to diagnose across three services, one to fix and re-certify under scrutiny. Nobody on the current change connected it to Deepak's diff. The engineer who lived through 1183 had moved to a different org six months prior. The knowledge existed; the recall didn't.

What actually happened on the Tuesday

We'd deployed the Scorer as a pre-test mission inside their perimeter — their escape log and post-mortems loaded into the Escape History Agent's knowledge base, their GitLab and their test-management system wired as instant MCP servers. It runs the moment a change is proposed, before the cycle spends a minute.

Deepak opened the change at 10:14. By 10:16 the mission had returned a verdict. I watched it on the Explain rail in real time, which is the part I care about as a security owner — I could see exactly what it looked at:

  • It read the diff and identified the seam: a retry wrapper around a non-idempotent external call.
  • It queried the escape history for that seam and surfaced 2024-1183 at high similarity, with the two-cycle cost attached.
  • It checked the planned suite and found no test exercising replay-under-partial-failure on the new client.

The verdict, in plain language: "Elevated test-escape risk. This change resembles escape 2024-1183 (retry/idempotency, cost two cycles). The current plan does not cover the replay path that escape slipped through. Recommend adding idempotency-under-retry coverage before the cycle."

One Tuesday, two minutes, two cycles saved 10:14 Change proposed (looks low-risk) 10:15 Pre-test scan runs reads diff + history 10:16 Verdict: resembles 2024-1183 by noon Targeted tests added, bug caught in test Savings ledger — this one change

Re-certification cycles avoided 2 cycles (~3 weeks) Engineer-hours reclaimed (diagnosis + rework) ~120 hrs Customer-facing escapes prevented 1

Cost of the forecast: two minutes of read-only compute and one targeted test suite, spent before the cycle.

Why the team believed it

A forecast is only useful if people act on it, and people only act on a machine's judgment when they can check the machine's work. Two things earned Deepak's trust in the moment.

First, it was cited, not asserted. The verdict didn't say "this is risky." It pointed at a specific past escape, in their own history, and named the exact uncovered path. That is an argument an engineer can evaluate in thirty seconds — and if it were wrong, he could have dismissed it just as fast. Which brings me to the honest part.

Second, it is calibrated, and I insist teams treat it that way. The Scorer forecasts resemblance; it is not an oracle. On that same team it has, on other changes, surfaced a resemblance that a lead looked at and said "no, that seam is different, we're fine" — and dismissed it in under a minute from the trace. That is a healthy false-positive, not a failure. The cost of a resemblance that turns out not to matter is one minute of a human's attention. The cost of the resemblance we miss is 2024-1183 all over again. Any security owner should take that trade every single time.

The rollout that let me sleep

My job is deployment, so let me be plain about the part that mattered to me. The escape history — post-mortems, defect records, the exact institutional memory an attacker or a competitor would love — never left their environment. The mission runs on the enterprise AI platform, inside their perimeter, against a knowledge base they own. Every tool it touches is a read-only MCP connection to systems they already ran. There was no new data pipeline to secure, because the scan doesn't move data out; it reasons over it in place and returns a verdict.

Standing it up took an afternoon: point the Escape History Agent at their log, register GitLab and their test tracker as MCP servers, and set the mission to run pre-test. That's the Missions model — you describe the agents and connect the knowledge, and the platform does the orchestration.

What it adds up to

One change. Two minutes of read-only compute. Two cycles and roughly a hundred and twenty engineer-hours that the team spent building instead of firefighting — and a double-charge bug that customers never saw. Multiply that across a quarter of integrations and the pattern is obvious: most of your escapes are variations on escapes you've already survived, and the record of them is the most under-used asset you own.

If you want the leadership framing of why this problem is worth solving, Ajay lays it out in why it matters. And if you want the architecture behind that two-minute verdict — the agents, the observable trace, and where a human still signs off — Mark walks through it in how it works.

Discussion

No comments yet — start the conversation.

Join the discussion

See StudioX run.

Put autonomous AI workers to work on your own systems and knowledge.