Skip to main content

907 posts tagged with "insider"

View all tags

The Annotation Queue Your Humans Quietly Stopped Reading

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.

This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.

The Are-You-Sure Confirmation Step Your Users Learned to Click Through

· 11 min read
Tian Pan
Software Engineer

The confirmation dialog is the cheapest safety layer in the AI agent toolbox. It's a string, a button, and a callback. The product manager who asked for it left the meeting believing the agent was now safe. The engineer who built it shipped it in an afternoon. The compliance reviewer who audited it ticked the box. And the user who saw it for the seventh time that morning had already moved their mouse to the Confirm button before their eyes finished reading the title.

Within a week, the confirmation step is no longer a decision point. It's a rhythm. The agent says "are you sure you want to send this email?" and the user says yes the way they say bless-you at a sneeze. The day the agent proposes an action that is actually wrong — wrong recipient, wrong amount, wrong tone — the user confirms it with the same automaticity they used for the six correct ones before it, and the email goes out, and the team writes a postmortem that says "user error."

It wasn't user error. It was a system that mistook the existence of a click for the existence of consent.

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

· 12 min read
Tian Pan
Software Engineer

The cluster did exactly what we told it to. Traffic dropped to zero for forty-five seconds, the queue-depth metric flatlined, KEDA flipped the replica count from one to zero, and the node autoscaler reclaimed the H100 pod ninety seconds later. The graph looked clean. The Slack channel was quiet. The cost dashboard ticked down half a cent.

An hour and twelve minutes later, a customer support ticket arrived: a long-running document-analysis job — a 180k-token reasoning task that was budgeted for twenty-eight minutes of decode — had vanished. No error in their client SDK. No exception in our application logs. Only a single 499 line buried in the gateway access log, timestamped roughly when the scheduler had decided the pod was idle and reaped it.

The Budget Cap That Fires After the Action Already Shipped

· 9 min read
Tian Pan
Software Engineer

A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.

This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the spend plane, but the damage lives in the action plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.

The Bug Report Against a Model Version You No Longer Serve

· 11 min read
Tian Pan
Software Engineer

A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.

The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your v1-chat endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.

This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.

The Compaction Strategy That Summarized Away the User's Original Question

· 10 min read
Tian Pan
Software Engineer

A user asked our support agent: "Why was invoice INV-2025-08-44719 charged twice on April 3rd?" Forty-five minutes and eighteen tool calls later, the agent confidently reported back: there was no evidence of any duplicate billing on the account that quarter. The user, understandably, escalated. When we replayed the trace, the answer became obvious. The agent had compacted its conversation at turn nine. The summary said the user was "asking about a duplicate charge in early April." It did not contain the string "INV-2025-08-44719." Every subsequent tool call — the ledger lookup, the chargeback API query, the audit log scan — was issued against a paraphrased intent, not the literal invoice number the user typed.

The bug was not in the tools. It was not in the model's reasoning. It was that our context manager had a contract with every downstream component, and nobody had written it down. The contract said: "I will preserve meaning." The components needed: "I will preserve strings."

The Conversation Summarization That Erased the Consent Flag the User Gave You

· 11 min read
Tian Pan
Software Engineer

At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.

Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for why the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.

The Data Labeler Whose Pricing Model Assumed Humans Wrote the Prompts

· 10 min read
Tian Pan
Software Engineer

Your labels-per-dollar dashboard is the most flattering line on the team review, and it is lying to you. The denominator is the per-task rate you negotiated with a labeling vendor in 2023, when a human research lead wrote each labeling prompt by hand, edited it twice, ran it past a teammate, and submitted maybe forty prompts a week. The numerator is the number of completed tasks coming back through the API. Sometime in the last three months, your team quietly stopped writing prompts by hand and started generating them with an LLM that emits a prompt every two seconds at a marginal cost rounding to zero. Your labels-per-dollar metric is going up, and the only person who knows the metric is meaningless is the account manager at the vendor who is watching their margin compress and is about to send a contract amendment your procurement team will read as a price hike.

The mismatch is not a vendor problem. It is a contract that encodes assumptions about your workflow that are no longer true, and the gap between those assumptions and your current behavior is the surplus value one side is silently absorbing until the renewal cycle forces a price-discovery conversation. The side that notices the mismatch first sets the new price.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Escalation Path That Routes Back to the Agent

· 10 min read
Tian Pan
Software Engineer

The escalation tool was the safety net. The agent's confidence dropped below threshold, it called escalate_to_human, and the request slid into a ticket queue with a polite "a specialist will follow up shortly" reply to the user. Engineering closed the loop on the launch checklist. The on-call calendar listed humans on the receiving end.

Six months later, an audit walked the path. The escalation tool opened a Zendesk ticket. The Zendesk queue was triaged by a triage agent the support team had stood up to keep response times within SLA. The triage agent, finding no policy match it could resolve directly, called its own delegate_to_specialist tool — which routed the case to a specialist agent. The specialist agent, when uncertain, called escalate_to_human. The trace was a closed circuit. No human had touched any of the escalations the audit sampled. The human-in-the-loop the launch doc described did not exist.

The escalation interface had not failed. It had been honored at every hop. What failed was the assumption that the receiving system was a person.

The Eval Harness Whose Judge Model Was Upgraded Silently

· 11 min read
Tian Pan
Software Engineer

A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.

This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.