Four Papers That Rewrote My Agent Architecture
I read four academic papers on agent design and came away with one insight that changed everything: stop making models bigger, start making them recursive.
Every AI company is racing to make models bigger. More parameters. Longer context windows. Higher benchmarks. I spent Saturday reading four academic papers that argue the opposite: the architecture matters more than the model.
I'm building FORGE, an agent orchestration system where a PM agent writes specs, an engineering lead decomposes them, and a workforce of builder agents executes. It works. But it hits walls: tasks stall, builders loop on dead ends, and the system has no mechanism for learning from its own failures mid-flight. So I went looking for answers in the research.
The four papers and what each one does
DisCIPL separates planning from execution. A planner model writes structured programs, not prose instructions, and cheap follower models execute them mechanically. The planner thinks. The followers do. Clean separation.
RLM treats the codebase like an environment. Instead of loading entire files into the prompt, the agent explores surgically: peek at a function, grep for a pattern, decompose a problem into smaller searches. When complexity spikes, it spawns recursive sub-agents that each handle a piece. The context window stays manageable because the agent earns its context through exploration instead of dumping everything in up front.
RSA runs N builders in parallel, then a separate aggregator combines the best fragments from each attempt. It's not mechanical code splicing. The aggregator sees all previous attempts and re-builds with that knowledge. Run it for T iterations and the population of solutions evolves toward something better than any single builder produced.
TRM is the foundation layer. It proves that iterative refinement with carried state beats single-pass depth. A small model that runs five times with memory of previous attempts outperforms a large model that runs once. Recursion depth equals reasoning depth.
The unified insight
These papers aren't competing with each other. They're layers of the same system:
- Planning layer (DisCIPL): Write structured execution programs, not vague instructions
- Execution layer (RLM): Explore surgically, spawn sub-agents for complexity, never load everything at once
- Quality layer (RSA): Run parallel attempts, combine the best fragments, iterate
- Foundation layer (TRM): Carry state across iterations, let the model refine its own work
The takeaway is blunt: don't make the model bigger. Make it recursive. Give it state. Let it iterate. Score intermediate results. Cull bad paths. Combine good fragments.
What this means for my build
I mapped this directly into a 13-phase improvement plan for Leroy, my engineering lead agent. The changes that matter most:
A spec analyzer that queries the brain for relevant lessons before any builder touches the work. A progress heartbeat so builders signal life instead of going silent for ten minutes. A task event system that auto-persists every significant event to the memory layer. Parallel builder runs with aggregation for complex tasks.
The thread running through all of it is brain enforcement. Every phase wires into the memory system. The spec analyzer checks for past lessons. The event system persists outcomes. The quality scorer weights brain compliance heavily: did you query memory before building? Did you persist what you learned? If not, your quality score drops.
This isn't theoretical. The system already has 14,000+ memory chunks, 42 recorded lessons, and sub-100ms query times. The papers gave me a framework for using that infrastructure at every step of the build cycle instead of just at the beginning and end.
The contrarian bet
The industry is spending billions making models bigger. The research says you can get more out of a small model that iterates with state than a large model that runs once blind. That's not a minor optimization. That's a fundamentally different approach to building agent systems.
I'm betting on recursion over scale. Not because scale doesn't matter, but because at the execution layer, how you orchestrate matters more than how many parameters you throw at the problem.
The papers are the receipts. The build starts this week.