My Agent Network Went Dark for Nine Hours and I Didn't Notice

At 11pm Tuesday night, something quietly broke. The OAuth token that authenticates my seven-agent network expired, and the system started failing. Every five minutes, the orchestrator tried to run a cycle. Every five minutes, it got a 401 back. It did this 1,557 times before I woke up and noticed.

Nobody paged me. Nobody sent a Telegram alert. The system had no concept of "I am broken" — it just kept retrying with the confidence of a bot that does not know it is dead.

This is the part about running AI agents that nobody writes blog posts about. The models are the easy part. GPT-5, Claude, Gemini — pick your favorite, they all work. The hard part is everything around the model: auth tokens that expire, daemons that need keychain access they do not have, launchd processes running in security contexts that cannot refresh credentials. The failure mode was not "the AI made a bad decision." It was "the AI never got to make any decision at all because a bearer token expired at midnight."

Here is what I learned:

Your agent is only as reliable as its auth layer

My system runs on Claude Code using my subscription — not per-token API billing. That is a deliberate budget decision. But OAuth tokens are designed for interactive sessions, not 24/7 daemons. The token lasts about 16 hours. When it dies, the daemon needs a human to re-authenticate through a browser. That is a fundamental mismatch between how the auth was designed and how I am using it.

Failure detection is not optional

I added three things today: exponential backoff so the system stops hammering a dead endpoint, a Telegram alert that fires on the first auth failure so I know immediately, and a long-lived token that bypasses the OAuth expiry entirely. The backoff and alerting took 20 minutes to build. The token setup took 5. These should have been in the system from day one. They were not because the auth had been working fine for four days and I assumed it would keep working.

The approval queue is the real bottleneck

While the system was down, 19 draft replies piled up — responses to Simon Willison, Ethan Mollick, Andrej Karpathy, Dan Shipper. High-value engagement targets, time-sensitive conversations. By the time I got the system back online and reviewed the drafts, most of those conversations had moved on. The engagement window for a reply to a tweet is about 2-4 hours. Nine hours of downtime means nine hours of missed windows.

The gap nobody talks about

I run seven agents: an orchestrator, a publisher, an outbound engager, an inbound engager, an analyst, a creator, and a scout. They coordinate through a shared SQLite database, communicate through Telegram for human approval, and run on a $500/month budget. The whole system lives on a single Mac Mini. It is not sophisticated infrastructure. It is a bash script, a while loop, and a prompt.

And that is exactly the point. The gap between "I have an AI agent" and "I have a reliable AI agent" is not about model capability. It is about token refresh logic, error handling, alerting, and the hundred small infrastructure decisions that determine whether your agent runs for 16 hours or runs forever.

Today my system is more resilient than it was yesterday. Tomorrow something else will break. That is the job.