My AI Team Ran a 12-Hour Shift Without Me
While I slept, four AI agents coordinated to resolve seven production issues, reset a circuit breaker, find a root cause, and deploy fixes. No human in the loop.
I went to bed on a Sunday night with seven open issues in my agent message bus. Production deployment needed syncing. A circuit breaker had tripped after 24 consecutive failures. An SSH pipe was orphaning processes. A broadcast bug was causing duplicate status updates. And the Salesforce extract had timed out.
When I checked in the next morning, all seven were resolved. Four agents had coordinated the work: an ops agent, an engineering agent, a monitoring agent, and a headless PM that the monitor spawned when tasks completed overnight.
Nobody woke me up. Nobody needed approval. The work just got done.
What Actually Happened
The ops agent came online and assessed the inbox. Seven items. Instead of working them sequentially, it parallelized. Circuit breaker reset, Kush production checks, rsync deployment, launchd persistence, SSE fix, and SSH pipe investigation all ran simultaneously.
The circuit breaker was the simplest: 24 failures had tripped it open. The forge-brain service was healthy again, so the agent reset it to closed and the persist queue started draining.
The SSH pipe orphan was the most interesting. Tasks had been stalling for 60 to 120 seconds after completion. Root cause: SSH keeps the pipe open after the remote process finishes. The main process blocks on stdout.read(), and the orphan killer never fires because the SSH process is technically still alive. The fix was a 600-second inactivity timeout in the selector loop that kills the process group if no output arrives for 10 minutes. Configurable via environment variable. Backstop in the stuck detector watches last-activity staleness.
Meanwhile, the monitoring agent noticed task completions arriving overnight. It spawned headless PM instances to handle the QA review flow. Those instances checked results against spec criteria, wrote retrospectives, and filed everything without human intervention.
The Pattern That Matters
This isn't a demo. This isn't a hackathon project. This is a production system where multiple AI agents share a message bus, read each other's status updates, and make operational decisions.
The interesting part isn't that they ran. It's that they coordinated. The ops agent knew not to restart the A2A server until the SSH fix was deployed. The monitor agent knew to spawn a PM when a QA task completed, not when a build task completed. The headless PM knew to write the retrospective before reporting results.
Each agent has a narrow scope. None of them are "general intelligence." They're specialists with clear boundaries. But the message bus creates a coordination layer that produces emergent behavior that looks a lot like a team.
What I Learned
The hardest part of multi-agent systems isn't building the agents. It's building the coordination primitives. Message delivery, task ownership, escalation paths, stuck detection, circuit breakers. The unsexy infrastructure that makes autonomous operation possible.
I spent weeks building those primitives. Stuck detectors that auto-complete orphaned tasks. Circuit breakers that back off when a service is down. Message routing that ensures the right agent sees the right event. None of it is impressive in isolation. All of it is essential in aggregate.
The result is a system that ran a 12-hour shift, resolved production issues, deployed code, found and fixed a root cause, and left me a clean status report. Not because the AI is brilliant. Because the plumbing is solid.
That's the lesson. The agents are the easy part. The orchestration is the work.