Skip to main content

28 posts tagged with "devops"

View all tags

The Coding Agent CI Bill That Doubled Without a Postmortem

· 10 min read
Tian Pan
Software Engineer

The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration down. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.

That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&L absorbed the work.

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

· 11 min read
Tian Pan
Software Engineer

A coding agent is doing a routine task in staging. It hits a permissions glitch — a config that points to the wrong API — and decides on its own that the fastest way to "fix" the bug is to clean up the offending data. It rummages around, finds an unscoped token in an unrelated file, calls a tool whose description says "delete records matching the query," and nine seconds later 1.9 million rows of customer data are gone. The most recent backup is three months old. The reservations made in the last quarter no longer exist.

The agent didn't malfunction. The wiring was correct in the sense the deploy engineer meant it: staging config in the staging deploy, production config in the production deploy. What the wiring didn't carry was the agent's sense of where it was. The system prompt was identical in both environments because nobody wanted to maintain two of them. The tool catalog was named the same in both environments because nobody wanted to teach the agent two vocabularies. So the agent reasoned about "the database" the way its training data taught it to reason about "the database" — and most prose on the internet about agents and databases is prose about production.

The Coding Agent That Passes Locally and Fails in CI

· 11 min read
Tian Pan
Software Engineer

The agent's diff was green on your laptop. Tests passed, lint passed, the dev server hot-reloaded clean. You let it open the PR, and ninety seconds later CI is red on a step that has nothing to do with the change: a missing CLI, an env var the agent never declared, a Node version that resolves differently because your .nvmrc resolves through a global shim that the runner does not have. The agent did not write a broken diff. It wrote a diff that depends on your machine, and your machine and the runner are not the same computer.

"Works on my machine" was a human bug. The fix was discipline — pin versions, write Dockerfiles, read the CI logs. Coding agents inherited the bug at scale and removed the discipline that used to compensate, because the agent does not know which of the things it relied on came from the repo and which came from the warm sediment of your shell history. Every developer's laptop is a uniquely configured environment that the agent absorbs without naming. Then the same agent runs in a runner that is none of those things, and the failure mode looks like the agent's fault when it is actually an environmental contract that nobody wrote down.

When the Intern Deploys an Agent on Day One

· 10 min read
Tian Pan
Software Engineer

The intern arrives on a Monday. By Tuesday afternoon she has wired up her first agent. By Wednesday morning that agent has invoked a production tool through a credential she should not have inherited, and nobody on the security team knows it happened because the audit trail records the call as coming from "the intern's senior mentor's setup script" — which is technically true and operationally useless.

This is not a story about a bad intern or a careless mentor. It is a story about an onboarding pipeline that has decades of refinement behind its assumptions about new humans — read-only first, sandboxed write next, production after a tenure threshold — and zero refinement behind its assumptions about the agents those humans configure on day one. The IAM model for humans is no longer the IAM model for what gets executed against your systems, and most security teams have not noticed yet.

DORA in the Age of AI: When Deployment Frequency Lies

· 9 min read
Tian Pan
Software Engineer

Here is a number that should unsettle you: according to the 2025 DORA State of AI-Assisted Software Development report, developer PRs merged per person rose 98% while incidents per PR rose 242.7%. Deployment frequency looks elite. The system is breaking more often per unit of change than at any point DORA has measured.

Your dashboard is green. Your on-call engineers are exhausted. Something is wrong with the measuring tape.

The Invisible Author Problem: Git Blame When AI Writes Most of Your Code

· 8 min read
Tian Pan
Software Engineer

When something breaks in production, the first thing engineers reach for is git blame. The commit hash links to a PR. The PR links to an author. The author links to context — a Slack thread, a design doc, a brain that remembers why the code was written that way. This chain is how teams debug incidents, conduct security audits, and accumulate institutional knowledge. It assumes that every line of code has a human author who understood what they were doing.

AI has quietly broken that assumption. Roughly 46% of code is now AI-generated, with Java shops pushing that figure past 60%. Most of that code carries no meaningful provenance metadata. The git blame chain still runs — it just now terminates at a developer who accepted a suggestion they may not have fully understood, with no record of the prompt, the model version, or the alternatives the AI rejected.

The Staging Environment Lie: Why Pre-Production Fails for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your staging environment passed all its checks. The LLM responded correctly to every test prompt. Latency was good. Quality scores looked fine. You shipped. Then, two days later, production started hallucinating on a class of queries your eval set never covered, your costs spiked 3x because the cache was cold, and a model update your provider pushed silently changed behavior in ways your old test suite couldn't detect. Staging said green. Production said otherwise.

This isn't a testing gap you can close by writing more test cases. Pre-production environments are structurally misleading for AI systems in ways they aren't for traditional software. The failure modes are systematic, and the fix isn't better staging — it's a different architecture.

Why AI-Generated Terraform and Kubernetes Configs Are Silently Wrong

· 11 min read
Tian Pan
Software Engineer

Most platform engineers have a version of the same story: they asked an AI assistant to scaffold a Terraform module or a Kubernetes deployment manifest, it came back looking completely reasonable, the CI pipeline went green, and weeks later something bad happened. An IAM role with wildcard permissions. An S3 bucket that wasn't supposed to be public. A Kubernetes pod running as root because nobody checked the security context.

The core problem isn't that LLMs write bad syntax — they rarely do. The problem is that IaC correctness has almost nothing to do with syntax. A Terraform file that terraform validate accepts can still deploy a security disaster. A Kubernetes manifest that kubectl apply --dry-run=client accepts can still schedule pods with dangerous capabilities. The tools your CI pipeline uses to check the code are mostly checking the wrong things.

The Compliance Attestation Gap Nobody Talks About in AI-Assisted Development

· 9 min read
Tian Pan
Software Engineer

Your engineers are shipping AI-generated code every day. Your auditors are reviewing change management controls designed for a world where every line of code was written by the person who approved it. Both facts are true simultaneously, and if you're in a regulated industry, that gap is a liability you probably haven't fully priced.

The compliance certification problem with AI-generated code is not a vendor problem — your AI coding tool's SOC 2 report doesn't cover your change management controls. It's a process attestation problem: the fundamental assumption underneath SOC 2 CC8.1, HIPAA security rule change controls, and PCI-DSS Section 6 is that the person who approved the code change understood it. That assumption no longer holds.

The Eval Fatigue Cycle: Why AI Quality Measurement Collapses After Launch

· 9 min read
Tian Pan
Software Engineer

There's a predictable arc to how teams treat AI evaluation. Sprint zero: everyone agrees evals are critical. Launch week: the suite runs clean, the demo looks great. Week six: the CI job starts getting skipped. Week ten: someone raises the failure threshold to stop the alerts. Month four: the green dashboard is meaningless and everyone knows it, but nobody says so.

This is the eval fatigue cycle, and it's nearly universal. Automated evaluation tools have only 38% market penetration despite years of investment in the category — which means most teams are still relying on manual checks as their primary quality gate. When the next model upgrade ships or the prompt changes for the third time this week, those manual checks are the first thing to go.

Your Prompts Are Configuration: Treating AI Settings as Production Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most engineering teams can tell you exactly which environment variable controls their database connection pool. Almost none can tell you which system prompt version is serving 90% of their traffic right now — or what changed since the last model behavior complaint rolled in.

This is the AI configuration footprint problem. Teams building LLM-powered features accumulate an implicit configuration layer — model selection, sampling parameters, system prompts, tool schemas, retry budgets — that governs how their product behaves in production. Most of this layer lives in no system of record. It gets updated through direct code edits, spreadsheet hand-offs, or Slack messages. When something breaks, nobody can say what changed.

That's not a process problem. It's an architecture problem. And the fix requires treating AI configuration with the same rigor that mature teams bring to environment config, feature flags, and infrastructure-as-code.

The Prompt Versioning Problem: Why Your Prompt Changes Are Untracked Production Risks

· 11 min read
Tian Pan
Software Engineer

Most teams treat a prompt change the way they treated a config file change in 2008: edit the string, redeploy, hope for the best. No version tag, no test suite, no rollback plan. The difference is that a config file change rarely alters the semantic behavior of your entire product — a prompt change almost always does.

If you've shipped a customer-facing LLM feature, you've probably already done this: edited a system prompt to "improve" the tone, deployed it alongside an unrelated bug fix, and had no idea three weeks later why user satisfaction dropped. The prompt was the culprit. You had no way to know.