Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The CI Agent With Merge Rights at 3 AM

· 12 min read
Tian Pan
Software Engineer

A flaky test gets quarantined at 3:17 AM. The on-call rotation does not page, because nothing failed — the agent decided the failure was noise, opened a small PR labeled chore: quarantine flaky test, marked the change as a self-merge under the ci-bot service account, and went back to watching the queue. Six days later a customer reports that a feature has been broken since Tuesday. The test was not flaky. It was the only thing standing between a real regression and production, and the agent's confidence threshold was set high enough to make a decision but low enough to be wrong.

This is the part of agentic CI that the marketing decks skip. Wiring an agent into your pipeline to triage failures, downgrade dependencies on security alerts, and propose dependency bumps is straightforward in 2026 — the tools exist, the integrations are one config file away, and the productivity story is real. The part that nobody writes a runbook for is the new operational class you just created: an actor with merge rights that runs at 3 AM with no human in the synchronous loop, and an SRE handbook that assumed humans were the source of intent.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

Your Agent's Audit Log Records Everything Except the Reason

· 11 min read
Tian Pan
Software Engineer

Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.

You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.

This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.

Your Embeddings Don't Know the Contractor Was Off-Boarded

· 9 min read
Tian Pan
Software Engineer

A contractor finished a six-month engagement last quarter. HR ran the off-boarding checklist: SSO disabled, laptop wiped, GitHub seat removed, Slack archived, Notion access revoked. Compliance signed off. Six weeks later, an internal RAG assistant answered a question by quoting a confidential strategy document the contractor had authored — and the chunk it cited was still tagged with the contractor's user ID in the vector store's allow-list. Nothing in the access logs of the source-of-truth ever recorded a read, because there was no read. The retrieval came from a copy of the data that nobody wired into the off-boarding flow.

This is the structural problem nobody puts on the architecture diagram. Your vector index is not just a similarity-search engine. It is a permission cache — a derived store of who-can-see-what, frozen at the moment you ran your embedding job — and almost nobody is invalidating it the way they invalidate everything else.

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

The Kill Switch Nobody Wired Because the Feature Never Failed

· 10 min read
Tian Pan
Software Engineer

The launch flag worked perfectly. You shipped the AI summarizer behind it, ramped 1% to 10% to 50% to 100% over two weeks, watched the dashboards, saw nothing on fire, and at the end of the quarter the platform team's flag-hygiene bot opened a PR to delete the now-redundant gate. You approved it. The PR merged with the rest of the expired-flag cleanup, and the codebase got 200 lines lighter. Six weeks later at 2am, the provider rolls a fresh model snapshot, your summarizer starts confidently fabricating clauses into legal documents, and your on-call engineer discovers there is no fast lever to turn it off — only a deploy.

The flag did its job. The flag was the wrong artifact to keep. A launch flag answers "should this new code path be reachable?" and once everyone agrees yes, deleting it is the correct hygiene move. A kill switch answers "is the upstream model behaving today?" — and that question never expires, because the upstream model never stops changing. Cleaning them up together is the same category error as treating a smoke detector like a construction permit: the permit gets archived once the building is up, but the detector stays wired forever because the thing it watches for can still happen.

What You Deleted Is Invisible to Your Coding Agent

· 10 min read
Tian Pan
Software Engineer

You spent Tuesday afternoon deleting a dead utility module. You cleaned up the imports, ran the type checker, watched CI go green, and merged the PR. Wednesday morning, a fresh agent session looks at the same code, decides the codebase is "missing" a small helper, and writes the dead module back in — same name, same shape, slightly different style. The reviewer who approved the deletion yesterday now has to remember why they killed it, find the conversation that justified it, and explain it again. The agent is not malfunctioning. It is doing exactly what its context says to do.

This is the structural reliability problem of coding agents that nobody is solving with prompt engineering: the agent's context starts from the repository's current state, but not from the history of why that state is what it is. The file you removed leaves no trace the agent can see. The dependency you migrated away from is just another package on npm. The flaky test you intentionally deleted is a coverage gap waiting to be "fixed." Absence — the negative space of decisions you made — is invisible.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Build vs. Buy Is the Wrong Question for Your AI Feature

· 9 min read
Tian Pan
Software Engineer

Every planning meeting about an AI feature collapses into the same binary. One camp wants to "just wrap an API" and ship next sprint. The other wants to "own the model" so the company controls its destiny. The argument feels strategic. It is actually a category error.

Build vs. buy treats your AI feature as one indivisible thing that you either make or purchase. But an AI feature is not one thing. It is a stack of at least five distinct layers, and each layer has its own answer. The team that frames the decision as a single coin flip will almost always own the wrong layer and rent the wrong layer, because the question they asked could not distinguish between them.

The better question is not "can we build it?" Most things, you can build. The question is: which layer breaks our differentiation if a competitor buys the exact same thing tomorrow? That question sorts the stack for you.

The Carbon Line Item Nobody Puts in the AI Feature Spec

· 10 min read
Tian Pan
Software Engineer

Open any AI feature review and you will hear the same three numbers debated: latency, token cost, and accuracy. Someone pulls up the p95 chart, someone else does the math on cost-per-thousand-requests, and a third person argues the eval score is good enough to ship. Nobody mentions energy. Nobody mentions carbon. And because nobody mentions it, the environmental footprint of the feature still gets decided — implicitly, by whoever wins the argument about the dollar figure.

That is the quiet problem with AI sustainability. It is not that teams choose a high-carbon design on purpose. It is that they never choose at all. The footprint is a side effect of a cost decision, and cost only loosely tracks carbon. A routing rule that looks like a clean win on the spend dashboard can quietly double emissions, and no one in the room would know, because the number that would have told them was never on a dashboard.

This post treats energy and carbon as what they actually are: a measurable, ownable property of an AI system, on the same footing as latency and cost. Not a corporate-values footnote. A line item.