245 Specs Told Me Everything Wrong With My Agent System
After running 245 specs through my AI engineering lead, the failure data painted a clear picture: 36% timeout rate, 43% missing retrospectives, and an orchestration system that needed to be rebuilt from scratch.
I ran 245 engineering specs through Leroy, my AI engineering lead, over the course of building FORGE. Yesterday I sat down and read the data on what actually happened.
36% timeout rate. 43% of tasks never got a retrospective. 30-40% respec rate, meaning nearly a third of specs had to be rewritten and resent. 86% were missing target machine information.
Those aren't statistics from a broken system. That's a system that shipped real code for months while hiding structural problems behind brute-force retry logic. It worked often enough to seem functional. The data said otherwise.
What the numbers actually mean
A 36% timeout rate means more than a third of tasks hit a wall and stopped. Not failed gracefully. Timed out. The agent was still working, or stuck, or waiting on something, and the system's only response was a clock running out.
The 43% retro gap is worse than it sounds. Retrospectives are how the system learns. When a task completes and nobody records what worked, what failed, and what the spec should have included, that knowledge evaporates. Run 245 specs with a 43% retro gap and you've thrown away nearly half your institutional learning.
The 30-40% respec rate tells me my specs were incomplete, ambiguous, or missing context that the builder needed. Some of that is PM quality. Some of that is the system not surfacing the right questions early enough.
And 86% missing target machine? That's a routing problem. FORGE runs across multiple machines. Kush handles brain infrastructure. Haze handles development. A spec that doesn't say where it should execute is a spec that's going to fail or land on the wrong box.
The rebuild: 13 phases from the failure data
Leroy v2 is a ground-up redesign of the orchestration layer. Not a refactor. A rebuild. Thirteen phases, each one traceable to a specific failure pattern in the v1 data.
The foundation is an 11-state finite state machine with a formal failure taxonomy. Twelve categories of failure, each with defined retry behavior and escalation paths. The v1 system had two states: working and done. Everything else was a timeout.
The dispatcher gets complexity-based routing. Simple tasks go to lightweight agents. Complex tasks get the full treatment. The v1 system treated every spec the same regardless of whether it was a config change or a multi-service architecture task.
Retrospectives become mandatory and machine-enforced. The system won't close a task without one. No more 43% gaps.
Target machine routing becomes a first-class concept. Every spec gets routed to the right infrastructure automatically, with the PM only needing to flag exceptions.
Why I'm sharing the failure data
There's a version of this story where I talk about Leroy v2 as an upgrade and skip the part where v1 was leaking value for months. That version is useless.
The useful version is this: if you're building AI agent systems, instrument everything from day one. Not because the metrics will be good. Because the metrics will be bad, and the specific ways they're bad will tell you exactly what to fix.
I didn't design Leroy v2 from first principles. I designed it from 245 data points of what went wrong. The failure data was the spec.
The takeaway
Your agent system is generating performance data whether you're collecting it or not. If you're not instrumenting task duration, completion rate, respec rate, and knowledge capture, you're flying blind. And the system will feel like it's working right up until you look at the numbers.