My AI Agents Shipped Three Product Pages While I Slept
What happens when you stop treating AI as a copilot and start treating it as a workforce that operates while you're offline.
I went to bed on a Tuesday with three open build tasks and no expectation that any of them would be done by morning.
When I woke up, all three had been built, QA'd, and the results were sitting in my inbox. One of them passed 8 out of 8 test criteria. Another passed 10 out of 10. The third had been validated with zero issues on a previous run nobody had checked. Not one line of code was written by a human overnight.
This is the part most people skip when they talk about AI in production. Not "I asked ChatGPT to write a function." Not "my copilot autocompleted a loop." A team of AI agents, operating autonomously, building product pages, running QA against written specs, persisting results to long-term memory, and dispatching follow-up tasks. While I was asleep.
The setup
I run a system called FORGE. It has an engineering lead (Leroy) that receives specs via an agent-to-agent protocol and decomposes them into subtasks. It has a headless PM agent that monitors for task completions and automatically dispatches QA specs when builds finish. And it has a shared memory system (Aianna) where every agent persists what it learned.
The product being built is a Decision Support Report for cannabis cultivation operations. Think of it as an assessment tool that takes a grower's facility data, their constraints, their goals, and produces a structured recommendation. The wireframe is a Next.js app with JSON data fixtures for two test accounts.
On March 4, three things ran in sequence without me touching anything:
- Phase 1 QA passed. The scaffold, navigation, and overview page all validated against spec criteria. 190 seconds of work.
- Phase 2A QA ran against the assessment page. Eight test criteria covering route rendering, data binding, design system compliance, constraint card layout, priority badges, and visual hierarchy. 353 seconds. 8/8 pass.
- Phase 2C verification discovered the proposal page was already built by a previous run that had failed. 10/10 criteria already passing. Zero changes needed.
The headless PM picked up each completion, evaluated the results, decided whether to dispatch QA or skip it (Phase 2C was skipped because the build task had already verified everything), and persisted retrospectives to the brain.
Why this matters more than it looks
The common AI narrative right now is about copilots. Tools that sit next to you while you work. The value is measured in keystrokes saved or suggestions accepted.
That framing misses the bigger shift. The interesting question isn't "how much faster can I work with AI?" It's "what work can happen without me being present at all?"
When I woke up and reviewed the QA results, my job wasn't to build or test. My job was to decide: is this good enough to ship? That is a fundamentally different relationship with the work. I went from producer to reviewer. The agents didn't need my judgment during execution. They needed it after.
What actually breaks
This isn't a victory lap. The same night, a McMahon Sales Ops task ran for seven hours and produced essentially nothing. The stuck detector auto-completed it. The build logs were empty. That was the fourth attempt at getting six revenue charts to render correctly.
AI agent teams are not reliable in the way that CI/CD pipelines are reliable. They fail in strange, non-deterministic ways. A subprocess hangs. A timeout kills a process that was thinking before acting. An agent completes subtasks but the parent task never transitions out of "working" status. I built a stuck detector specifically because this kept happening.
The honest picture is that about 70% of tasks complete as expected. The other 30% require investigation, respeccing, or manual intervention. But the 70% that work? They happen at 2 AM while I'm unconscious. That's the trade.
The takeaway
If you're evaluating AI strategy for your organization, stop benchmarking copilot productivity. Start asking a different question: what decisions and deliverables can be produced by agents operating independently, then reviewed by humans on their own schedule?
The answer right now isn't "everything." It's "more than you'd expect, if you invest in the orchestration layer." Specs, QA criteria, memory persistence, stuck detection, headless monitoring. None of that is the AI itself. It's the scaffolding that lets AI operate as a workforce instead of a tool.
Three product pages shipped overnight. Not because the AI is magic. Because the system around it is disciplined enough to let it work unsupervised.