The Production Deploy That Called My Bluff
A revenue dashboard showed the wrong number in front of the wrong people. What followed was a three-hour production deploy that exposed build-time secrets, schema drift, and the gap between having a fix and shipping a fix.
I had the fix. It existed in a branch. Tests passed locally. And my revenue dashboard was still showing a $40K uncertified downsell to anyone who opened it.
That's the gap nobody talks about in AI-assisted engineering. Having working code on your machine and having working code in production are separated by a minefield of build pipelines, secret scoping, schema drift, and deployment orchestration. The fix existed. It just wasn't real yet.
I was looking at HorizonLens, the revenue certification system I built for Addium. The trust layer, the thing that refuses to display a number unless it can certify it to the penny, was supposed to suppress uncertified movement. Locally, it did. In production, the old image was still running. The number was still wrong. And I needed it fixed now, not next sprint.
Cloud Build doesn't know your runtime secrets
The first deploy failed in a way I should have predicted but didn't.
Next.js evaluates your route modules during next build. My auth configuration called resolveAuthConfig() at import time, which needs GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and NEXTAUTH_SECRET. Those secrets exist in Cloud Run's runtime environment. They do not exist during Cloud Build's image construction.
The fix was architectural: make auth config lazy. Move authOptions construction inside the request handler so it evaluates at request time, not at import time. NextAuth(authOptions()) instead of NextAuth(authOptions). One line of structural difference. The build passed.
This is the kind of thing that works perfectly in local development because your environment has everything. Production separates build-time from runtime, and if your code doesn't respect that boundary, it breaks silently until the day you actually need it to work.
Schema drift will find you
The web service deployed cleanly. The extractor did not.
The extractor job had been running on the previous image tag. The new code included an ON CONFLICT clause targeting a four-column unique index on the quarantine table: (segment_id, snapshot_month_end, segment_index, reject_reason). That index existed in my development database. It did not exist in production.
Production still had the old index definition. CREATE UNIQUE INDEX IF NOT EXISTS doesn't replace an existing index with a different column set. It sees the name exists and does nothing. The schema had drifted, and the only way to find out was to run the job against production data and watch it fail.
I connected directly to Cloud SQL Postgres, dropped the old index, created the new one, verified the definition, cancelled the stale retry executions, and ran a clean extraction. It completed in 2 minutes and 20 seconds.
The lesson is blunt: if you're managing schema evolution through application code and IF NOT EXISTS guards, you have a schema management strategy that works exactly until your first breaking change. Then it doesn't.
The gap between "fixed" and "deployed"
Three separate services needed updates. The web frontend, the extractor job, and the horizon analysis service. Each had its own image, its own deploy script, its own failure mode. The web needed the lazy auth pattern. The extractor needed schema DDL. Horizon needed a new image tag but no behavioral change.
By the end, five Cloud Run revisions had been created, one Cloud Build had failed and been rebuilt, one extractor execution had failed and been rerun, and the production environment was verified end to end. The dashboard showed the right number. The trust layer held.
None of this was complex in isolation. Each fix was a few lines. But the compounding effect of build-time vs. runtime secrets, schema drift, stale job retries, and deploy script side effects turned a "push the fix" operation into a three-hour production incident.
The takeaway
The next time someone tells you the fix is ready, ask them where it's running. A branch is a plan. A local test is a hypothesis. Production is the only truth. And production has opinions about your code that your development environment will never share with you.