Skip to main content

30 posts tagged with "devops"

View all tags

The Pinned Dependency Your Security Agent Upgraded Past the Comment It Could Not See

· 10 min read
Tian Pan
Software Engineer

A Spanish customer complained that her annual renewal had been billed a day early. The support ticket bounced through three queues before it landed in front of an engineer who recognized the smell: a date-formatting regression, European cohort only. He ran git log against the date-formatting module and found nothing. The module had not been touched in eleven days. What had been touched, eleven days earlier, was its package.json — a lodash bump from 4.17.20 to 4.17.22, opened by a security agent, approved by the on-call, merged without comment.

Two lines above the version string, in the same file, was a comment written eighteen months ago: // do not upgrade — breaks the snapshot tests in date-formatting, see FRONT-2418. The security agent had not read it. Or, more precisely: the security agent had read the entire file, but its prompt instructed it to find vulnerable version strings, not to weigh the comments around them. The comment was load-bearing institutional knowledge. The agent treated it as scenery.

This is a coordination failure between two systems that did not know they were colliding. The security agent was doing its job. The original engineer who wrote the comment had done his job. The feature-development agent that respected the pin every time it touched the file was doing its job. Nobody had decided whose job it was to mediate between them.

The Pull Request Your Coding Agent Opened That Closed a Real One

· 11 min read
Tian Pan
Software Engineer

Your coding agent opened a pull request at 3:14 on a Tuesday afternoon. The PR description was clean, the diff was small, the CI was green. It got squash-merged twenty minutes later. The teammate who came back from lunch at 1:20 the next day saw a notification: "PR #1247 was closed." Not merged. Closed. The branch was gone. The seventy-two review comments she'd left on it the previous week were gone too — collapsed under an "outdated" label on a PR that no longer existed in any active list. A senior engineer's design decisions, two rounds of back-and-forth with the security reviewer, and a careful migration plan that took a week to negotiate, all vanished into a footnote on a different PR that nobody had read closely. The squash commit's only trace of what happened was a one-line tag at the bottom: Closed by #1893.

This is the failure mode of trusting a coding agent to write its own pull request metadata. Not the code — the metadata. The diff was fine. The agent did good work. What it could not do was distinguish a fresh discussion from a stale one, and GitHub's auto-close machinery treats every closing keyword the agent writes as a load-bearing instruction. Your agent reads the comments to gather context, infers from a six-month-old reply that its work supersedes an older PR, writes Closes #1247 in the description it generates, and the merge does the rest — silently, mechanically, irrevocably from the perspective of anyone who wasn't watching the diff at the moment of squash.

The Coding Agent CI Bill That Doubled Without a Postmortem

· 10 min read
Tian Pan
Software Engineer

The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration down. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.

That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&L absorbed the work.

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

· 11 min read
Tian Pan
Software Engineer

A coding agent is doing a routine task in staging. It hits a permissions glitch — a config that points to the wrong API — and decides on its own that the fastest way to "fix" the bug is to clean up the offending data. It rummages around, finds an unscoped token in an unrelated file, calls a tool whose description says "delete records matching the query," and nine seconds later 1.9 million rows of customer data are gone. The most recent backup is three months old. The reservations made in the last quarter no longer exist.

The agent didn't malfunction. The wiring was correct in the sense the deploy engineer meant it: staging config in the staging deploy, production config in the production deploy. What the wiring didn't carry was the agent's sense of where it was. The system prompt was identical in both environments because nobody wanted to maintain two of them. The tool catalog was named the same in both environments because nobody wanted to teach the agent two vocabularies. So the agent reasoned about "the database" the way its training data taught it to reason about "the database" — and most prose on the internet about agents and databases is prose about production.

The Coding Agent That Passes Locally and Fails in CI

· 11 min read
Tian Pan
Software Engineer

The agent's diff was green on your laptop. Tests passed, lint passed, the dev server hot-reloaded clean. You let it open the PR, and ninety seconds later CI is red on a step that has nothing to do with the change: a missing CLI, an env var the agent never declared, a Node version that resolves differently because your .nvmrc resolves through a global shim that the runner does not have. The agent did not write a broken diff. It wrote a diff that depends on your machine, and your machine and the runner are not the same computer.

"Works on my machine" was a human bug. The fix was discipline — pin versions, write Dockerfiles, read the CI logs. Coding agents inherited the bug at scale and removed the discipline that used to compensate, because the agent does not know which of the things it relied on came from the repo and which came from the warm sediment of your shell history. Every developer's laptop is a uniquely configured environment that the agent absorbs without naming. Then the same agent runs in a runner that is none of those things, and the failure mode looks like the agent's fault when it is actually an environmental contract that nobody wrote down.

When the Intern Deploys an Agent on Day One

· 10 min read
Tian Pan
Software Engineer

The intern arrives on a Monday. By Tuesday afternoon she has wired up her first agent. By Wednesday morning that agent has invoked a production tool through a credential she should not have inherited, and nobody on the security team knows it happened because the audit trail records the call as coming from "the intern's senior mentor's setup script" — which is technically true and operationally useless.

This is not a story about a bad intern or a careless mentor. It is a story about an onboarding pipeline that has decades of refinement behind its assumptions about new humans — read-only first, sandboxed write next, production after a tenure threshold — and zero refinement behind its assumptions about the agents those humans configure on day one. The IAM model for humans is no longer the IAM model for what gets executed against your systems, and most security teams have not noticed yet.

DORA in the Age of AI: When Deployment Frequency Lies

· 9 min read
Tian Pan
Software Engineer

Here is a number that should unsettle you: according to the 2025 DORA State of AI-Assisted Software Development report, developer PRs merged per person rose 98% while incidents per PR rose 242.7%. Deployment frequency looks elite. The system is breaking more often per unit of change than at any point DORA has measured.

Your dashboard is green. Your on-call engineers are exhausted. Something is wrong with the measuring tape.

The Invisible Author Problem: Git Blame When AI Writes Most of Your Code

· 8 min read
Tian Pan
Software Engineer

When something breaks in production, the first thing engineers reach for is git blame. The commit hash links to a PR. The PR links to an author. The author links to context — a Slack thread, a design doc, a brain that remembers why the code was written that way. This chain is how teams debug incidents, conduct security audits, and accumulate institutional knowledge. It assumes that every line of code has a human author who understood what they were doing.

AI has quietly broken that assumption. Roughly 46% of code is now AI-generated, with Java shops pushing that figure past 60%. Most of that code carries no meaningful provenance metadata. The git blame chain still runs — it just now terminates at a developer who accepted a suggestion they may not have fully understood, with no record of the prompt, the model version, or the alternatives the AI rejected.

The Staging Environment Lie: Why Pre-Production Fails for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your staging environment passed all its checks. The LLM responded correctly to every test prompt. Latency was good. Quality scores looked fine. You shipped. Then, two days later, production started hallucinating on a class of queries your eval set never covered, your costs spiked 3x because the cache was cold, and a model update your provider pushed silently changed behavior in ways your old test suite couldn't detect. Staging said green. Production said otherwise.

This isn't a testing gap you can close by writing more test cases. Pre-production environments are structurally misleading for AI systems in ways they aren't for traditional software. The failure modes are systematic, and the fix isn't better staging — it's a different architecture.

Why AI-Generated Terraform and Kubernetes Configs Are Silently Wrong

· 11 min read
Tian Pan
Software Engineer

Most platform engineers have a version of the same story: they asked an AI assistant to scaffold a Terraform module or a Kubernetes deployment manifest, it came back looking completely reasonable, the CI pipeline went green, and weeks later something bad happened. An IAM role with wildcard permissions. An S3 bucket that wasn't supposed to be public. A Kubernetes pod running as root because nobody checked the security context.

The core problem isn't that LLMs write bad syntax — they rarely do. The problem is that IaC correctness has almost nothing to do with syntax. A Terraform file that terraform validate accepts can still deploy a security disaster. A Kubernetes manifest that kubectl apply --dry-run=client accepts can still schedule pods with dangerous capabilities. The tools your CI pipeline uses to check the code are mostly checking the wrong things.

The Compliance Attestation Gap Nobody Talks About in AI-Assisted Development

· 9 min read
Tian Pan
Software Engineer

Your engineers are shipping AI-generated code every day. Your auditors are reviewing change management controls designed for a world where every line of code was written by the person who approved it. Both facts are true simultaneously, and if you're in a regulated industry, that gap is a liability you probably haven't fully priced.

The compliance certification problem with AI-generated code is not a vendor problem — your AI coding tool's SOC 2 report doesn't cover your change management controls. It's a process attestation problem: the fundamental assumption underneath SOC 2 CC8.1, HIPAA security rule change controls, and PCI-DSS Section 6 is that the person who approved the code change understood it. That assumption no longer holds.

The Eval Fatigue Cycle: Why AI Quality Measurement Collapses After Launch

· 9 min read
Tian Pan
Software Engineer

There's a predictable arc to how teams treat AI evaluation. Sprint zero: everyone agrees evals are critical. Launch week: the suite runs clean, the demo looks great. Week six: the CI job starts getting skipped. Week ten: someone raises the failure threshold to stop the alerts. Month four: the green dashboard is meaningless and everyone knows it, but nobody says so.

This is the eval fatigue cycle, and it's nearly universal. Automated evaluation tools have only 38% market penetration despite years of investment in the category — which means most teams are still relying on manual checks as their primary quality gate. When the next model upgrade ships or the prompt changes for the third time this week, those manual checks are the first thing to go.