Skip to main content

861 posts tagged with "insider"

View all tags

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.

The Branch State Your Coding Agent Forgot to Check

· 10 min read
Tian Pan
Software Engineer

Your coding agent does not know which branch it is on. It thinks it does. It saw a git status output twelve turns ago, it has a CLAUDE.md in its context that mentions the branch name the session opened against, and it watched a tool result list five files that were the right files at the time. The agent has been quietly reasoning against that snapshot ever since. Meanwhile, in a second terminal, you ran git checkout main. The agent's diff lands cleanly on the file system because the OS does not care which branch the bytes belong to. The diff is semantically wrong because the agent's mental model of the branch is stale by three hundred commits and the parent it was reasoning against no longer exists in your working tree.

This is branch-state drift, and it is the coding-agent analog of a read-modify-write race in a database. The agent reads the world at turn N, modifies its plan across turns N+1 through N+k, and writes back to disk at turn N+k+1 — and somewhere in that window the world changed underneath it. No exception fires. No tool returns an error. The patch applies. The harm shows up downstream: a PR opened against the wrong base, a hand-written commit that silently reverts an intervening fix, a feature implemented against a schema that was migrated yesterday.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Coding Agent That Passes Locally and Fails in CI

· 11 min read
Tian Pan
Software Engineer

The agent's diff was green on your laptop. Tests passed, lint passed, the dev server hot-reloaded clean. You let it open the PR, and ninety seconds later CI is red on a step that has nothing to do with the change: a missing CLI, an env var the agent never declared, a Node version that resolves differently because your .nvmrc resolves through a global shim that the runner does not have. The agent did not write a broken diff. It wrote a diff that depends on your machine, and your machine and the runner are not the same computer.

"Works on my machine" was a human bug. The fix was discipline — pin versions, write Dockerfiles, read the CI logs. Coding agents inherited the bug at scale and removed the discipline that used to compensate, because the agent does not know which of the things it relied on came from the repo and which came from the warm sediment of your shell history. Every developer's laptop is a uniquely configured environment that the agent absorbs without naming. Then the same agent runs in a runner that is none of those things, and the failure mode looks like the agent's fault when it is actually an environmental contract that nobody wrote down.

The Conversation Tree Your Server Stored As A Log

· 10 min read
Tian Pan
Software Engineer

A user types "actually, I meant fifty, not fifteen," hits the pencil icon on their last message, and edits it. The UI does what good UIs do: it shows them the corrected message, fades out the old one, scrolls the assistant's stale reply into a struck-through ghost, and presents a clean conversation that reads as if the original mistake never happened. The user, satisfied, sends the next turn. The agent answers using fifteen.

The bug is not in the model. The model received exactly what the server sent it, and the server sent it the original message, the original assistant response, the regret, the edited message, and the new request — all concatenated, all in order, all live. The user is having a conversation they edited. The agent is having a conversation that was never edited. The two transcripts diverge at turn three and never reconcile, and every subsequent turn pays interest on the gap.

The Demo You Recorded in March Was the Last Time It Worked

· 8 min read
Tian Pan
Software Engineer

A sales engineer at a Series B AI company recorded a five-minute walkthrough on a Tuesday in March. The agent picked the right tool on the first try, framed the answer in the buyer's vocabulary, and refused a gnarly edge case with a politeness that landed as "thoughtful, not hedging." That recording went into the asset library. Over the next seven weeks it closed five deals.

By the time the sixth prospect watched it on an onboarding call in late May, the model had received a provider point-release that re-tuned its refusal phrasing, the prompt had been edited twice to fix an unrelated regression, the tool catalog had grown by three entries (one of which the model now preferred), and the RAG corpus had been re-indexed against a new chunker. The demo was no longer a recording of the product. It was a recording of a product that no longer existed.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The Free Trial That Burned Your Quarterly Inference Budget in Eleven Hours

· 11 min read
Tian Pan
Software Engineer

Your trial offered "100 generations per day." Your pricing team modeled an interested user kicking the tires for a week. The first trialist who points an agent at the endpoint runs through the daily quota in seventy seconds, the weekly quota in nineteen minutes, and the quarterly inference budget by lunch the next day. Nobody alerted, because the only alert wired up was the one that fires when a trial converts.

The trial limits were not wrong when they were written. They were calibrated for a usage distribution that no longer describes the modal user. Somewhere between the pricing review six months ago and the signup that arrived this morning, the population shifted from humans clicking buttons to programs that don't get tired. The numbers on the dashboard stopped meaning what they meant when you set them.

The Inner Loop Your Coding Agent Quietly Broke

· 8 min read
Tian Pan
Software Engineer

The productivity claim around coding agents is that they remove the typing bottleneck. The bottleneck the engineer actually hits in practice is different. The engineer can no longer hold the system in their head, because the agent is editing files faster than the engineer can read them, writing tests faster than the engineer can reason about coverage, and refactoring abstractions faster than the engineer can verify they still type-check at the design level rather than just the compiler level.

The tight inner loop — hypothesize, change, observe, refine — that defines competent engineering quietly collapses into a different loop. The engineer is now reviewing agent output rather than building intuition about the system. A METR randomized controlled trial from mid-2025 found experienced open-source developers were 19% slower on familiar codebases when using AI assistants, while reporting they felt 20% faster. The 39-point gap between perceived and actual productivity is not a measurement error. It is the sound of comprehension being silently traded for throughput.

The Latency Budget Your Agent Loop Stole from the Search Box

· 12 min read
Tian Pan
Software Engineer

The launch metrics looked clean. Answer quality up, citation rate up, the eval suite green. The team that replaced the old keyword search with an agent-backed retriever shipped, took the win, and moved on. Six weeks later somebody noticed the weekly active number on that surface had drifted down twelve percent and nobody could find the regression. There was no regression. The agent worked. The users left because the box that used to answer in two hundred milliseconds now took four seconds, and nothing in the launch retro had a budget for that.

This is the latency-budget transfer problem, and almost nobody draws the org chart that catches it. A search box is not just a function call. It is a thirty-year contract with the user's nervous system: type, see results, scan, click. The 200-millisecond response is not a performance metric on a dashboard somewhere — it is the reason the user's attention is still on the screen when the results arrive. When the team underneath the box replaces a keyword index with an agent loop, the function-call surface looks identical and the SLA on the new call lives in a completely different regime. The latency budget moved from the team that owned the index to the team that owns the agent, and from the team that owns the agent to the user, and the only one who showed up to the meeting was the user.

The Multi-Agent Deadlock That Hangs on Two Calendars

· 10 min read
Tian Pan
Software Engineer

Agent A asks Agent B for a piece of data it needs to finish its task. Agent B, before answering, asks Agent A for a piece of context it needs to produce that data. Both requests cross a "human review required" boundary on the way out. The first request lands in a Slack approval channel watched by Priya. The second lands in a Jira queue watched by Marcus. Priya is at lunch. Marcus is in a customer call. Neither knows the other exists. The workflow hangs for nineteen hours, and nobody notices until a customer escalation forces somebody to ask why the rollup never landed.

This is not a novel failure. It is the oldest failure in distributed systems, wearing a new costume. The Coffman conditions — mutual exclusion, hold and wait, no preemption, circular wait — were named in 1971, and a multi-agent system with human-in-the-loop approval queues satisfies all four by default. The new wrinkle is that one of the "resources" in the deadlock is a person's attention, which means your liveness guarantee is now bound by how quickly two humans who don't know they're paired can independently context-switch.

The On-Call Runbook That Assumed a Human Would Read the Page

· 11 min read
Tian Pan
Software Engineer

The page fired at 02:14. The runbook said "page the engineer." The engineer's name resolved to an on-call rotation. The rotation pointed at a Slack channel that the team had wired up six months ago as a unified triage surface. The first message in the channel was the alert. The second message, posted nineteen seconds later, was a calm three-sentence summary: the alerting service, the failing dependency, the last deploy. It was well-written. It ended with "Acknowledged."

The incident commander, watching from her phone in bed, read "Acknowledged" and went back to sleep. Nobody had acknowledged. The agent subscribed to that channel as a first-line triage helper had restated the alert back to the room and signed off with the verb the channel's other readers used to mean "I have the context to act on this." The incident ran unowned for forty-one minutes until a customer ticket woke a different engineer through a different surface.