In Practice: Catching an ISR Overrun Before It Ships

I run security and deployment, which means I spend most of my time on the boundary between "this works in theory" and "this survived contact with a real team's Tuesday." So let me tell you about a real Tuesday — composited from a few deployments, names filed off, but the shape is exactly what happens.

Before: the PR that looked fine

A firmware team on a battery-powered sensor module. The product has a hard rule: the radio-scheduling ISR must complete inside 250 microseconds, and the current worst-case margin is thin — around 60 microseconds of slack on a bad day. That number lives in a design doc. Everyone "knows" it. Nobody checks it per-PR, because how would you?

A developer — call her Priya — opens a pull request. It's a good change: she's hardening a packet parser with a bounds check and a small retry on a checksum mismatch. Twelve lines. Her reviewer, a strong engineer, reads the diff, agrees it's correct, and approves it. Correct code, clean review, merged before lunch.

Here's what neither of them could see. The retry sits inside a path the ISR can reach under a specific error condition, and the bounds check added a few dozen cycles to the hot loop. Individually, nothing. Together, on a noisy channel, the worst-case ISR path now runs about 30 microseconds longer — eating half the remaining margin. No test catches it, because the bench channel is clean. It merges. It ships in the next build. Six weeks later, field units in an RF-noisy warehouse start dropping packets and occasionally resetting. The bug report says "intermittent connectivity." It takes two senior engineers nine days, a logic analyzer, and a reproduced warehouse to trace it back to Priya's twelve lines — which by then are buried under forty other commits.

That's the before. Nobody was careless. The cost was invisible at the only moment it was cheap to fix.

After: the same PR, with the mission watching

Now the same team, same PR, with the Power/Timing Impact Estimator mission wired to their GitLab. Priya opens the PR. Within about ninety seconds, before her reviewer has even opened it, a comment appears:

Resource forecast: +0.4% CPU · +2KB heap · ISR worst-case latency +31µs — consumes 31 of 60µs remaining margin on the radio-scheduling ISR. New worst-case: 221µs against a 250µs deadline. Trend: this subsystem has spent 22µs of margin across the last 3 PRs.

That last line is the one that changes behavior. It's not just "you're within budget." It's "you're within budget and the budget is disappearing." Priya reads it, moves the retry out of the interrupt-reachable path into a deferred handler, and pushes again. New forecast: +0.4% CPU, +2KB heap, +3µs ISR latency. She merges. The nine-day warehouse expedition never happens because the defect never ships.

What the numbers actually were

I'm allergic to ROI slides with invented figures, so here's how I frame it with real deployments, in units you can audit.

The direct save on this one incident is the nine engineer-days of field debugging that never happened, plus the field escape itself — the OTA, the support tickets, the customer call. But the single incident isn't the real return. Over the first quarter on that team, the estimator ran on every PR — a few hundred of them. The vast majority came back green in about a minute and a half and nobody thought about it again; that's the point, it's quiet. A handful — I think it was nine that quarter — came back flagging real margin erosion the author then fixed before merge. Nine potential escapes, caught at the 1x column of Ajay's cost chart instead of the 100x one.

The cost of running it is a mission that sits next to their repo and burns a couple of minutes of analysis per PR. The team spent zero time maintaining a resource-budget spreadsheet, because the budget now lives in the Budget Agent's knowledge base and gets consulted automatically. That reclaimed the informal "resource cop" duty that used to fall on their most senior engineer — call it a few hours a week of his attention handed back.

Why the practitioners actually trust it

Two things earned trust on the ground, and I watched both happen.

First, the observations. When the estimator posts a number an engineer doubts, they don't argue with a black box — they open the Explain rail and watch the mission's reasoning in order: which budget it pulled, what the diff analysis found, how the estimate was composed. Twice in that first quarter the number was wrong — and both times the trace showed the miss was a stale figure in the Budget Agent's knowledge base, not a modeling error. They fixed the knowledge base in ten minutes and the mission got smarter with no code change. A tool you can correct by editing knowledge, and see yourself correcting, is a tool people stop fighting.

Second, the honest gate. We deliberately shipped the estimator as advice first — a comment, read-only, no teeth. It never blocked a merge it shouldn't have, because it couldn't block anything at all. Once the team had a quarter of accurate forecasts behind them, they turned on the decision-queue gate for the one hard rule: a PR that would push the ISR past its deadline now emits an approval request that a lead has to sign off on before the merge check goes red. Autonomous where it's just informing, human-in-the-loop where it's enforcing. That sequencing — earn trust as advice, then add teeth — is the deployment pattern I recommend to every team, and it's the same posture we take across Enterprise Deployment.

The whole thing runs inside their perimeter, next to their source, which for firmware IP is non-negotiable. If you want the leadership case for why this matters, Ajay lays it out in Why It Matters; for the architecture underneath, Mark walks it in How It Works; and for the pattern in general, AI Missions.

Priya's twelve lines shipped. They just shipped fixed. That's the entire product.

In Practice: Catching an ISR Overrun Before It Ships

Before: the PR that looked fine

After: the same PR, with the mission watching

What the numbers actually were

Why the practitioners actually trust it

Discussion

Join the discussion

See StudioX run.