AI MissionsCase StudyDeployment

The Fault-Injection Forge in Practice: A Reconciliation Bug, Caught

PG
Patrick Gilberg · Head of Security & Deployment
February 15, 2026

I run security and deployment at StudioX, which means I spend most of my time on the seam between "it works on my machine" and "it works in front of a customer." That seam is where field escapes live, and I've watched enough of them go by to have strong opinions about where they come from. So when we started running the Synthetic Data & Fault-Injection Forge against our own release pipeline, I paid close attention to the numbers. Here's a real cycle, start to finish, with the outcome it produced.

Before: the change that would have escaped

A payments team was shipping a change to how partial refunds get applied to an invoice — the kind of change that looks small in the diff and touches everything downstream. Their test data was what test data always is: a handful of clean invoices, positive balances, the three currencies in the seed file. Two reviewers signed off. CI was green. On the old process, that change ships, and we find out whether it was safe when a customer does something we didn't picture.

I asked the lead to route it through the Forge before promotion. The intent she typed was one sentence: "Exercise this partial-refund change against every currency-and-balance combination we support, include the malformed cases we've seen in production, and add a fault scenario where the ledger service times out after the refund is written but before the invoice is updated."

That last clause is the one that mattered, and it's the one no one would have hand-built. It describes a partial failure — the exact silent, mid-transaction gap that becomes an eleven-hour reconciliation problem nobody notices until the numbers don't tie out.

The run

The Mission started streaming its observations immediately, and I watched it the way I'd watch a junior engineer I trust but verify. The Reasoning Core routed first to the Schema Agent, which pulled the current invoice and ledger models — not a snapshot, the live schema. Then to the Escape-History Agent, which read our incident tracker and surfaced three past defects on the refund path, including one negative-balance case from a market we'd opened after the last time anyone updated the fixtures. That's the imagination we don't have at the desk: it remembered what we'd forgotten.

From there the AI workers generated in parallel. The Fixture Agent produced clean records across every currency. The Edge-Case Agent built the boundary set — zero-balance invoices, refunds exceeding the remaining balance, the negative-balance-in-a-new-currency shape the history agent had flagged. The Fault-Scenario Agent constructed the mid-write timeout as a runnable scenario against a scratch environment.

Then the Mission stopped. Seeding that fault scenario into our shared staging ledger is a consequential action, and the Forge doesn't take consequential actions on its own. It posted an approval request to the decision queue: here is the corpus, here is the fault I'm about to inject, here is the environment it touches. The lead reviewed it in the portal, saw exactly what would be applied, and approved. Human-in-the-loop, right where it belongs — not slowing down the generation, just gating the one step that reaches a shared system.

The change failed the timeout scenario. Under the mid-write fault, the refund posted to the ledger but the invoice update rolled back, leaving the two out of sync — precisely the kind of silent divergence that runs for hours in production before anyone sees it. The Coverage Agent's report named it plainly, matched it to the historical pattern, and the team fixed the ordering before the change ever left staging.

BEFORE — happy-path data only write change clean fixtures CI green · ship field escape 11 hrs bad ledger AFTER — Forge run before promotion one-sentence intent schema + history edge + fault gen approve in portal scenario fails fixed at desk Elapsed: ~9 minutes of Mission time · one human approval Escape avoided · reconciliation defect caught in staging · fixture-building went from a day to a sentence

The numbers

I care about ROI that survives contact with a finance team, so let me be concrete about this cycle and what it generalizes to.

The hand-built version of that corpus — clean fixtures across every currency, the boundary cases, and a runnable mid-write fault harness — is realistically a full day of an engineer's time, and that's assuming they think to build the timeout case at all, which they usually don't. The Forge produced it in about nine minutes of Mission time plus one human approval. Call it a day of engineering compressed into ten minutes, but that undersells it, because the day-of-work version never happens on an ordinary change. It only happens on the changes important enough to justify the effort — which are not the changes that escape.

The escape itself is the number that actually matters. A silent reconciliation divergence like the one the fault scenario caught is, from experience, an eleven-hour production incident: bad ledger entries, a remediation project to unwind them, and the trust cost of a customer finding it before you do. We've priced incidents in that class in the low hundreds of thousands once you count remediation and the engineering pulled off roadmap to fix it. Catching it in staging cost nine minutes and one click.

Across the pipeline, the pattern held. Teams that route changes through the Forge before promotion are exercising conditions they simply were not exercising before — not because they got more diligent, but because the cost of building adversarial data dropped from a day to a sentence. When the cost of the right thing collapses, people do the right thing by default. That's the whole game.

What I tell teams adopting it

Two things, both about honesty. First: the Forge doesn't make your code safe, it makes your code checked against the conditions that actually break it — the gap the whole use case is named for. It surfaces the escape; your engineers still fix it. Second: trust the human gate. Generation is safe and runs on its own, but the decision queue exists precisely so that injecting a fault into a shared environment is always a human's explicit call. In a security-and-deployment role, that boundary is the thing that lets me say yes to autonomy without losing sleep.

If you want the leadership case for why this gap costs what it costs, read why the Forge matters. If you want the architecture underneath the run I just described — the agents, the observations, the MCP wiring — read how the Forge works. What I'll leave you with is the moment I keep coming back to: a reconciliation bug that would have run silently for eleven hours in production instead showed up as a failed scenario on a screen, nine minutes after someone typed a single sentence. That's the difference between finding out at the desk and finding out from a customer.

Discussion

No comments yet — start the conversation.

Join the discussion

See StudioX run.

Put autonomous AI workers to work on your own systems and knowledge.