Ghost Posts and the Watchdog That Cried Wolf

I built a watchdog agent to monitor my AI workforce. Its first act was to catch two real failures. Its second act was to lie about the third. Six commits later, the hardest part was not monitoring the agents — it was monitoring the monitor.

Last Sunday morning I stared at my agent dashboard and realized I had no idea whether my AI workforce was actually doing anything.

Not whether the engine was running. I could see that. The Groundswell run loop was cycling every five minutes, dutifully executing tasks, logging events, moving on. By every internal metric, the system was healthy. Thirty-two scheduled tasks across twenty-four agents. Green across the board.

But I had zero posts on X in three days. A ninety-six-item content backlog sitting untouched. LinkedIn had gone dark. The engine was running hot and producing nothing — like a car on blocks with the gas pedal floored.

So I did what any operator does when they suspect the instruments are lying: I built a new instrument.

The Watchdog

The watchdog agent is 191 lines of Python. Zero Claude cost. It runs every six hours via the same run.sh loop that drives everything else, and it checks six things:

  1. Agent schedules — is any agent more than two hours overdue?
  2. Posting cadence — has each platform received at least one post in the last twenty-four hours?
  3. Error spikes — more than five errors in six hours?
  4. Backlog health — fewer than five ready items means Creator needs to replenish.
  5. Engine process — is run.sh actually alive?
  6. Telegram bot — is the approval pipeline functional?

When it finds problems, it sends a Telegram alert directly to me. No intermediary agents, no approval queues, no routing through the Marketing Manager. A direct line from the monitor to the operator.

First run: two real findings. The engine had died at some point and restarted without me noticing. LinkedIn had a posting gap. Both legitimate, both actionable. The watchdog was earning its keep before the commit was ten minutes old.

Then I looked at the alert about X.

Ghost Posts

The X alert said no posts in twenty-four hours. That surprised me because I could see post_sent events in the database. The system believed it had posted. The events were logged. The backlog items were marked as consumed. Everything downstream of the post action thought it had succeeded.

But nothing was on the platform.

This is what I started calling ghost posts. The system logged success because the API call returned a 200. But the post never materialized on the platform. Maybe it was a shadow ban. Maybe the API accepted the request and silently discarded it. Maybe the Playwright fallback navigated to the right page, typed the text, clicked the button, and the button did nothing. Whatever the cause, the system had no way to know the difference between "I posted" and "I think I posted."

The distinction matters. A system that fails loudly is a system you can fix. A system that fails silently is a system that erodes trust in ways you cannot see until the damage is already done.

So I added a seventh check: verify that the last post on each platform actually exists on the platform. Trust but verify.

For X, the first pass was simple — call the API with the post ID and check if it returns data. If the API says the post exists, it exists. If it returns a 404, ghost post.

For LinkedIn and Threads, there was no verify endpoint I could use cheaply. So I reached for Playwright — launch a headless browser, navigate to the post URL, check if the page returns content or a 404.

LinkedIn worked. Threads did not.

Threads Blocks Headless Browsers

Threads detected the headless Chromium instance and refused to serve the page. Which, honestly, fair. Meta has been fighting scrapers and bots for two decades. A headless browser showing up to verify a post URL looks exactly like the thing they are trying to prevent.

The fix was to skip Playwright for Threads entirely and hit the Threads Graph API directly. One HTTP call — pass the post ID and the access token, get back the post metadata or a 404. Cheaper, faster, and it does not trigger bot detection.

Three platforms, three different verification strategies. X gets an API call with a Playwright fallback. LinkedIn gets Playwright only. Threads gets a direct Graph API check. The watchdog went from one verification path to three in two commits.

Then the Watchdog Broke

This is the part nobody warns you about.

Fifteen minutes after deploying the improved watchdog, my phone started buzzing. Telegram alerts. One every five minutes. The same alerts, over and over. Engine health, posting gaps, backlog status — a full diagnostic dump arriving faster than I could read them.

The watchdog was supposed to run every six hours. It was running every cycle.

The scheduling logic in run.sh worked like this: query the database for the last watchdog event, calculate how many hours ago it ran, and if the answer is six or more, run it again. Simple. Except the bash date math was broken.

Here is what the broken version did:

LAST_WATCHDOG=$(python3 -c "...")  # Returns timestamp
HOURS_SINCE=$(python3 -c "...")    # Calculates hours since last run
if [ "$HOURS_SINCE" -ge 6 ]; then
    python3 tools/watchdog.py check
fi

The problem was variable interpolation. The bash script was passing a Python datetime string through a shell variable into a second Python invocation. When the timestamp contained characters that bash interpreted differently — or when the first script errored silently and returned nothing — the second script would default to 999 hours. Which is always greater than six. Which means: always run.

The fix was to collapse the two Python calls into one. A single Python script that queries the database, calculates the gap, and returns "yes" or "no." No intermediate shell variables. No bash date math. No opportunity for the shell to corrupt the data between steps.

WATCHDOG_DUE=$(python3 -c "...")  # Returns yes or no
if [ "$WATCHDOG_DUE" = "yes" ]; then
    python3 tools/watchdog.py check
fi

Seven characters of output instead of a full ISO timestamp. The surface area for failure shrank by an order of magnitude.

The Engine Detection Problem

While I was fixing the spam, I noticed another false positive. The watchdog kept reporting the engine was down when I could clearly see it running. The health check used pgrep -f "groundswell/run.sh" to find the process, but the actual process name in the process table did not match that pattern. Pgrep returned nothing. The watchdog concluded the engine was dead.

The fix was to switch from pgrep to ps aux | grep run.sh with appropriate exclusions for the grep process itself and for the aianna-social engine that runs a different run.sh on the same machine. Two greps and a count. Ugly, but it works.

This is the pattern I keep running into: the elegant solution fails in production because production is not elegant. Pgrep should work. It is the right tool. But the process was launched through a shell wrapper that changed the command string, and pgrep matched against the command string, not the script path. The correct fix is the fix that works on the actual machine, not the fix that works in theory.

The Real Architecture Bug

All of this monitoring work revealed a deeper problem that the watchdog merely exposed.

The reason X had zero posts in three days was not a ghost post issue and was not a platform failure. It was an architecture flaw. The Marketing Manager agent was responsible for routing content to platform agents. It would select backlog items, choose the right platform, and dispatch them. But the Marketing Manager ran as a Claude agent, and dispatching to a platform agent meant spawning another Claude agent inside its own session. That spawn chain was broken. The MM would route the content, log the routing event, and then fail to actually start the platform agent.

Posts were logged as routed but never posted. Ninety-six items in the backlog, all marked as "routed," none of them live.

The fix was to bypass the Marketing Manager entirely for posting. I added outbound_x and outbound_linkedin as direct scheduled tasks in the orchestrator config. Five times a day for X at the posting windows. Twice a day for LinkedIn on weekdays. The orchestrator spawns the platform agent directly with a task_type of outbound_post. No intermediate agent. No broken spawn chain.

The Marketing Manager still handles content strategy, topic selection, and routing decisions. But the actual act of posting is now a direct line from the orchestrator to the platform. The same pattern that already worked for inbound mention handling — the orchestrator spawns the X Agent directly for inbound_x. I just replicated the pattern for outbound.

Six commits in three hours. Every one of them a response to a new discovery.

The Operational Lesson

Here is what I keep learning and apparently need to keep relearning:

Monitoring is not a feature you add at the end. It is not a nice-to-have. It is not the thing you build after the system is stable. The system is never stable. The system is a living thing that drifts, degrades, and develops new failure modes faster than you can catalog the old ones.

But — and this is the part that matters — the monitor itself is part of the system. It has bugs. It has false positives. It has its own failure modes that require their own monitoring. The watchdog that spammed me every five minutes was worse than no watchdog at all, because it trained me to ignore the alerts. The engine detection that cried wolf was worse than no engine detection, because it masked the real engine failures behind a wall of noise.

This is not unique to AI agent systems. This is the oldest problem in operations. Nagios alerts that fire so often they get piped to /dev/null. PagerDuty escalations that everyone mutes. CloudWatch dashboards that turn red on Tuesday and nobody investigates because they turned red last Tuesday too and it fixed itself.

The AI part makes it worse because the failure modes are stranger. A system that successfully calls an API, receives a 200 response, and has nothing to show for it. A spawn chain that logs success at every step except the step that actually does the work. A process detection pattern that matches in testing and fails on the production machine because the process was launched through a different shell wrapper.

Six commits to build a watchdog, fix the watchdog, fix the thing the watchdog exposed, fix the watchdog again, and then fix it one more time. That is not a failure of engineering. That is engineering. The system tells you what is wrong with it, but only if you build the thing that listens, and then build the thing that listens to the listener.

I am one hundred percent certain the watchdog has another bug I have not found yet. That is fine. The next alert will tell me where it is.