Skip to main content

578 posts tagged with "insider"

View all tags

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped

· 11 min read
Tian Pan
Software Engineer

A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.

This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

The Coding Agent Autonomy Curve: Reading Is Free, Merging Is Incident-Class

· 11 min read
Tian Pan
Software Engineer

The discourse on coding agents keeps collapsing to a binary: autonomous or supervised, YOLO mode or hand-on-the-wheel, --dangerously-skip-permissions or "approve every keystroke." That framing is a category error. A coding agent does not perform "an action." It performs a sequence of actions whose costs span at least seven orders of magnitude — from reading a file (free, undoable, no side effect) to merging to main (irreversible without a revert PR) to rolling out a binary to a fleet (six-figure incident-class). Treating that range with one autonomy switch is like setting a single speed limit for both a parking lot and a freeway.

The team that ships "the agent can do everything" without mapping each action to its blast radius is one prompt-injection-bearing GitHub comment away from a postmortem — and we already have public examples of that exact failure mode. Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent were all confirmed in 2026 to be hijackable through specially crafted PR titles and issue bodies, in an attack pattern the researchers named "Comment and Control." The agents weren't broken in some abstract sense. They executed a high-tier action — pushing code, opening a PR — on the basis of a low-trust input the autonomy tier had silently flattened into "all the same."

What follows is the discipline that has to land: a per-action curve, gates that scale with the tier, rollback velocity matched to blast class, and an eval program that tests for tool-composition escalation rather than single-action failure.

Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model

· 13 min read
Tian Pan
Software Engineer

Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.

This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.

Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."

Your Agent Has Two Release Pipelines, Not One

· 10 min read
Tian Pan
Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

Embedding Migrations Are the New Schema Migrations

· 12 min read
Tian Pan
Software Engineer

The first time most teams swap an embedding model in production, they treat it as a batch job. Re-run the embedder, build a new index, swap the alias, deploy. Latency stays normal. Error rates stay zero. Every query returns results. And retrieval quality silently regresses for weeks before anyone notices, because the symptom is "users complain the answers feel off," not a red dashboard.

This is not a deployment problem. It is a schema migration that the team has decided to run blind. The old embedding space and the new one are different reference frames; the cosine geometry that used to mean "these two paragraphs are about the same topic" no longer means that with the same numerical confidence. Documents and queries that used to cluster together drift apart non-uniformly. Re-rankers trained on the old distribution start firing on examples that no longer match what they learned. The eval suite that scores green on pointwise relevance misses all of it, because no individual document moved very far while the entire graph rotated.

Treat the swap like a database migration and almost everything that goes wrong becomes preventable. Treat it like a batch job and the regressions arrive on a schedule that nobody owns.

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

· 11 min read
Tian Pan
Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

· 10 min read
Tian Pan
Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.