Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The PR Description Your Coding Agent Cannot Write

· 10 min read
Tian Pan
Software Engineer

Your coding agent finished the task. The diff is small, the tests are green, the lint is clean, and the PR body says, in its entirety, "Fixes the bug in module X." A reviewer six time zones away opens the page, reads the diff in isolation, sees nothing wrong with it, and approves a technically correct change that solves the wrong problem. The change ships. Two days later a customer asks why the workaround they had been relying on stopped working, and you discover that the bug your agent fixed was not the bug the ticket was about.

The code was fine. The reviewer was conscientious. The agent did exactly what it was asked. The artifact between them — the pull request — was empty of everything that would have caught the mistake.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

When Safety Training Collapses the Operator Into the User

· 10 min read
Tian Pan
Software Engineer

The on-call engineer is paged at 3am. A queue is backed up, the customer-facing API is throwing 503s, and the documented mitigation is to drain the affected node and force a failover. She types the command into the operations agent and waits for the confirmation. Instead she gets a paragraph about how draining production nodes could affect users, a suggestion to consult her manager, and a polite refusal to proceed without "additional authorization." It is 3:04am. The runbook she is following was approved by her director, her VP, and the compliance team. The agent has no idea who she is.

This is not a model alignment failure. The model is doing exactly what it was trained to do: refuse risky-sounding requests from unidentified prompts. The failure is architectural. The compliance review that signed off on user-facing refusals also, without anyone noticing, signed off on blocking the on-call engineer.

The Perf Review Template That Cannot See AI Work

· 11 min read
Tian Pan
Software Engineer

Your strongest AI engineer spent the cycle curating an eval set, calibrating a judge prompt, and killing two features that turned out to be task-shape mismatched. None of that fits a single line on the review template. So the calibration meeting either inflates the artifacts the engineer cares least about — PR count, design docs, on-call shifts — or invents prose to justify a high rating the framework cannot defend. Either way, the rubric and the reality are pulling in different directions, and the engineer can tell.

The template was written for deterministic software. It rewards what you can count: lines of code shipped, services owned, incidents resolved, hours spent on-call. The AI roadmap is moved by a different shape of work: curating a representative eval slice, defending a behavioral envelope under model drift, refusing to ship a feature whose task shape doesn't fit the model, and patiently shrinking the gap between a judge prompt and human intent. Almost none of that produces the artifacts the rubric was built to count.

The Judge That Agreed With Itself Across A and B

· 10 min read
Tian Pan
Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

Your AI Disclosure Disappeared by Turn Three and Nobody Noticed Until the Regulator Did

· 11 min read
Tian Pan
Software Engineer

Your legal team spent four meetings negotiating the exact disclosure sentence. Engineering put it at the top of the system prompt. QA confirmed it appears in turn one of every session. Three months later a regulator forwards a transcript: turn fourteen of a complaint-handling conversation, an hour of substantive guidance about a refund dispute, and nowhere in those fourteen turns does the user see the words "I am an AI." The disclosure your single-turn compliance review approved is structurally incapable of surviving the conversations that need it.

This is disclosure decay, and it is the multi-turn agentic failure mode that the wave of 2025–2026 chatbot regulation was not designed to catch and your QA process is not configured to test for. The EU AI Act's Article 50 obligations become enforceable on August 2, 2026, with fines up to €35 million or 7% of global turnover. California's SB 243 took effect January 1, 2026, with a private right of action that lets consumers sue directly for at least $1,000 per violation. Washington requires recurring disclosures, with hourly cadences for minors. None of these regimes were written assuming the disclosure would silently drop out of a session after the third tool call — but that is what your runtime is doing right now, on every long-running conversation, in production.

The AI Literacy Gap Inside Your Own Team Is the Biggest Delivery Risk on Your Roadmap

· 10 min read
Tian Pan
Software Engineer

Your hiring page asks for AI experience. Your launch announcement names the AI features. Your roadmap commits to two more this quarter. And on the team that has to ship and maintain all of it, one engineer actually knows how to debug an eval failure, two can edit a prompt confidently, and twelve treat the LLM call as a black box they hand off whenever it misbehaves.

That distribution is the delivery risk nobody on your leadership team has named, because the team's stated AI capability — the thing that goes on the slide — is the maximum of any individual member's skill, and the team's actual delivery velocity is the median. The slide says one number; production runs on the other.

The CI Agent With Merge Rights at 3 AM

· 12 min read
Tian Pan
Software Engineer

A flaky test gets quarantined at 3:17 AM. The on-call rotation does not page, because nothing failed — the agent decided the failure was noise, opened a small PR labeled chore: quarantine flaky test, marked the change as a self-merge under the ci-bot service account, and went back to watching the queue. Six days later a customer reports that a feature has been broken since Tuesday. The test was not flaky. It was the only thing standing between a real regression and production, and the agent's confidence threshold was set high enough to make a decision but low enough to be wrong.

This is the part of agentic CI that the marketing decks skip. Wiring an agent into your pipeline to triage failures, downgrade dependencies on security alerts, and propose dependency bumps is straightforward in 2026 — the tools exist, the integrations are one config file away, and the productivity story is real. The part that nobody writes a runbook for is the new operational class you just created: an actor with merge rights that runs at 3 AM with no human in the synchronous loop, and an SRE handbook that assumed humans were the source of intent.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

Your Agent's Audit Log Records Everything Except the Reason

· 11 min read
Tian Pan
Software Engineer

Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.

You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.

This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.

Your Embeddings Don't Know the Contractor Was Off-Boarded

· 9 min read
Tian Pan
Software Engineer

A contractor finished a six-month engagement last quarter. HR ran the off-boarding checklist: SSO disabled, laptop wiped, GitHub seat removed, Slack archived, Notion access revoked. Compliance signed off. Six weeks later, an internal RAG assistant answered a question by quoting a confidential strategy document the contractor had authored — and the chunk it cited was still tagged with the contractor's user ID in the vector store's allow-list. Nothing in the access logs of the source-of-truth ever recorded a read, because there was no read. The retrieval came from a copy of the data that nobody wired into the off-boarding flow.

This is the structural problem nobody puts on the architecture diagram. Your vector index is not just a similarity-search engine. It is a permission cache — a derived store of who-can-see-what, frozen at the moment you ran your embedding job — and almost nobody is invalidating it the way they invalidate everything else.

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.