Twenty-Seven Zombie Agents

I built a pipeline dashboard on Saturday. The kind of thing you build when you want to see where work sits in the lifecycle: queued, working, completed, failed. Simple status board for an agent workforce.

The first thing it showed me was that 27 tasks were listed as "working."

I checked the logs. None of them were doing anything. One had been "working" for 178 hours. That's over a week of a task sitting in an active status with no agent attached, no process running, no output being produced. Just a row in a database that said "working" and a system that believed it.

How Zombies Happen

The pattern is straightforward. An agent picks up a task, sets the status to "working," and starts building. If the build succeeds, the agent updates the status to "completed." If it fails, "failed." But there's a third path that nobody plans for: the agent process dies mid-task. SSH pipe timeout. Container killed. Builder crashes. Memory limit hit.

When that happens, the status stays "working" forever. There's no heartbeat check. There's no timeout. The task just sits there, technically active, actually dead. A zombie.

Twenty-seven of them. Some were tasks that had completed successfully but never got their status updated because the agent crashed after finishing the work but before writing the status. Some were genuine failures that never got recorded. Some were tasks that had been superseded by newer specs but never cleaned up.

The dashboard didn't create this problem. It revealed it. Without the dashboard, I had no way to see these zombies. The task list showed them as active work, which made it look like the system was busier than it was and made capacity planning impossible.

The Stuck Detector Wasn't Enough

I already had a stuck detector. It ran on a schedule, checked for tasks that had been "working" beyond a threshold, and flagged them. But it had a blind spot: it only flagged tasks that had an active process ID or active subtasks. If a task was "working" with no PID and no subtasks, it fell through.

That's exactly the zombie pattern. The process is gone. The subtasks are gone. But the status says "working."

So ops deployed an orphan detector. A check that specifically looks for the no-PID, no-subtask case and marks those tasks as failed. Simple. Obvious fix.

The Paradox

The orphan detector's first act was to mark a successfully completed task as failed.

Here's what happened: a builder had finished its work, produced all the expected outputs, and the auto-validator had confirmed the criteria passed. But the builder crashed before updating the task status from "working" to "completed." So when the orphan detector ran, it saw a task with no PID, no active subtasks, status "working," and did exactly what it was designed to do. It marked it failed.

The task wasn't stuck. It was done. But the detector couldn't tell the difference between "stuck because crashed before completing" and "stuck because completed but crashed before updating status." Both look identical from the outside: a task in "working" status with no active process.

This is the fundamental problem with state-based monitoring in agent systems. State tells you where something is. It doesn't tell you where it was going. A task that's "working" with no process could be a zombie that crashed on line 3 or a success that crashed on the last line. The state is the same. The intent is completely different.

What This Actually Requires

You need monitoring that understands intent, not just state. That means:

Output-aware detection. Before marking a task as failed, check whether it produced the expected outputs. If the files exist, the tests pass, and the acceptance criteria are met, the task succeeded regardless of its status field.

Heartbeat with context. A heartbeat that just says "I'm alive" isn't enough. The heartbeat should carry what the agent is currently doing: which file it's writing, which test it's running, what step it's on. When the heartbeat stops, you know exactly where it stopped.

Status reconciliation on recovery. When a zombie is detected, don't just mark it failed. Run the validator against whatever output exists. If the work is done, mark it completed. If the work is partial, mark it failed with a note about where it stopped.

None of this is exotic. It's the same pattern you'd build for any distributed workforce where workers can die mid-task. The difference is that most agent frameworks skip it because the demo works fine with three tasks and a single process.

The Takeaway

Agent orchestration has the same visibility problem as any operations team. You need a dashboard. You need monitoring. And your monitoring needs to be smarter than "is the process alive?"

Twenty-seven zombies in a system I built and operate daily. Not because the system was broken, but because the observability layer wasn't looking for the right thing. The work was fine. The window into the work was lying.

Build the dashboard first. You'll be surprised what it shows you.