case-studyai-missions

The Architecture Review That Shows Up to Every PR

PG
Patrick Gilberg · Head of Security & Deployment
March 14, 2026

I run security and deployment at StudioX, which means I spend most of my time thinking about the gap between what a team intends to ship and what actually lands in production. So when our own platform group put the Architectural Intent Verifier in front of their merge queue, I asked to shadow it for a couple of sprints — not as the person who sells it, but as the person who has to trust it. What follows is what I saw, with the numbers I could actually count.

Tuesday, 10:14 a.m.

A backend engineer named Dana opens PR #4127. It's a good change — it makes the checkout page faster by fetching a customer's loyalty balance up front instead of on a second round trip. The build is green in four minutes. The two reviewers assigned are both heads-down on a release. Six months ago, this is exactly the PR that merges on a Friday and becomes a five-month excavation, which is the story my colleague Mark tells in why this matters.

This time, the moment the PR opens, the Verifier mission wakes up. On the Explain rail I watch its observations stream in real time: routing to Diff Agent… reading PR #4127 via GitHub… new call edge detected: checkout/balance.ts → loyalty.db… routing to Decision Registry… matched ADR-014: "checkout must not read loyalty tables directly; go through the ledger service"… routing to Drift Report Agent. Eleven seconds after Dana pushed, the verdict posts on the PR: one drift finding, ADR-014, line 88, with the reason in plain language and a link to the decision itself.

Dana didn't know ADR-014 existed. It was ratified before she joined. She reads it, sees the point — the ledger service exists precisely so loyalty can shard its database without a dozen hidden readers breaking — and changes one call. New push, clean verdict, merged by 10:40. Total cost of the catch: one comment and twenty-six minutes. The alternative cost, which I've now watched play out enough times to quote from memory, was a quarter.

WITHOUT the Verifier PR merges Fri 4pm built upon, forgotten shard blocked +5 months 3-week excavation WITH the Verifier PR opens 10:14 drift flagged +11 sec one call fixed 10:31 merged clean 10:40 26 min, 1 comment

Not every finding is a violation

The failure mode I was most worried about, wearing my security hat, was the boring one: a gate that cries wolf gets disabled. So the case I paid closest attention to came on Thursday, when the Verifier flagged PR #4155 for reaching across a service boundary — and the team argued it was justified. It was a temporary read during a migration, explicitly time-boxed, and blocking it would have stalled a piece of work the architecture group actually wanted.

This is where the design earns its keep. The Verifier doesn't get to wave the change through on its own, and it doesn't get to block it either. It emitted a [REQUEST_APPROVAL] block, which landed as a row in the Decision Queue. The architect on rotation got a magic-link approve/reject with the full context attached: the finding, the rule it touched, and the team's stated rationale. She approved it as a documented, time-boxed exception — thirty seconds of her time — and the exception is now in the audit trail with her name on it. Nobody silently overrode architecture in a PR comment thread that would be lost by next week. That's the human-in-the-loop boundary I need to sign off on a tool before it goes near a merge queue, and here it's the default. The mechanics behind that gate are laid out in how it works.

What the sprint actually bought

Over two sprints on one platform team of about twenty engineers, here's what I could count. The Verifier ran on 213 pull requests. It flagged 19 for review. Fourteen were genuine drift that the author corrected before merge — the Dana case, repeated. Four went through the Decision Queue as approved, documented exceptions. One was a false positive traced to a stale decision in the registry; the team retired the outdated ADR, and because behavior lives in the knowledge base and not in code, the fix was a document edit, not a deploy.

The number I keep coming back to isn't the count of catches — it's when they happened. All fourteen genuine corrections landed at the pull request, at a cost of a comment and a few minutes each. Not one of them became a downstream refactor, a migration, or an incident. In the two prior quarters, this same team had traced three separate production issues back to exactly this class of undetected boundary erosion. This sprint: zero escapes to production. The senior architect who used to spend a chunk of every week reading diffs for intent violations spent that time on the design work only she can do, because the intent now defends itself on every change.

Why I trust it in the perimeter

The last thing I'll say is the thing my role cares about most. This entire mission runs inside the company's own boundary. The Diff Agent reads the repository over a registered MCP server that never leaves the perimeter; the Decision Registry is your own knowledge base; the reasoning happens on infrastructure you control. Every judgment it makes is on the Explain rail, so an audit is a matter of reading the trace, not reconstructing intent after the fact. And it does exactly one consequential thing without a human — nothing. The checking is read-only; the overrides go through a person.

That combination — early catches, transparent reasoning, a real human gate, all in your own perimeter — is what makes this deployable in a security-conscious org rather than just demoable. It's one workflow among many built on the same pattern: AI Missions that reason, act where you permit, and explain themselves, running on an enterprise AI platform you actually own. After two sprints of shadowing it, I stopped thinking of it as a checker and started thinking of it as the architecture review that finally shows up to every pull request.

Discussion

No comments yet — start the conversation.

Join the discussion

See StudioX run.

Put autonomous AI workers to work on your own systems and knowledge.