Skip to main content

639 posts tagged with "llm"

View all tags

When to Skip Real-Time LLM Inference: The Production Case for Async Batch Pipelines

· 10 min read
Tian Pan
Software Engineer

There's a team somewhere right now watching their LLM spend grow 10x month-over-month while their p99 latency hovers around four seconds. The engineers added more retries. The retries hit rate limits. The rate limits triggered fallbacks. The fallbacks are also LLM calls. Nobody paused to ask: does this feature actually need to respond in real time?

Most AI product teams architect for the happy path — user sends a message, model responds, user sees it. The synchronous call pattern is what the API SDK demonstrates in its first code sample, and so that's what ships. But a surprisingly large share of production LLM workloads have nothing to do with a user waiting at a keyboard. They're document enrichment jobs, content classification pipelines, embedding generation tasks, nightly digest generation, and background quality scoring. For those workloads, real-time inference is the wrong tool — and the price you pay for using it anyway is real money, cascading failures, and operational complexity you'll spend months untangling.

Snapshot Tests Lie When Your Model Is Stochastic

· 11 min read
Tian Pan
Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

The Tail-Tolerant Retry Policy Your LLM Gateway Doesn't Have

· 12 min read
Tian Pan
Software Engineer

Pull up your gateway's retry config. Three attempts. Exponential backoff with jitter. Retry on 5xx and timeout. Maximum delay capped at a few seconds. It looks reasonable, and someone copied it from a microservices runbook two years ago. It is also the single largest reason your P99 is twice your P50, your token bill spikes during provider incidents, and a meaningful slice of your users see a thirty-second spinner before silently bouncing.

A retry policy designed for 50ms RPCs does not survive contact with an 8-second LLM call. The shape of the failure is different, the cost of every attempt is different, and the user-perceived clock is different. The default is not safe, it is just familiar. Most teams discover this the same way: a postmortem where the gateway logs a successful response and the customer screenshot shows a frozen UI.

Tool Schema Design Is Your Blast Radius: When Function Definitions Become Security Boundaries

· 10 min read
Tian Pan
Software Engineer

The most dangerous file in your agent codebase is the one you've been writing as if it were API documentation. The tool registry — that JSON or Pydantic schema that tells the model what functions exist and what arguments they take — is no longer a docstring. It is your authorization layer. And if you designed it the way most teams do, you handed the LLM a master key and called it good engineering.

Consider the canonical first cut at a tool: query_database(sql: string). The intent is reasonable — let the model formulate the right SQL for the user's question. The reality is that the model is now an untrusted client with unlimited DDL and DML rights to whatever database the connection string points at. The system prompt that says "only run SELECTs on the orders table" is a suggestion, not a control. When a prompt-injected tool result — an email body, a webpage, a PDF — tells the model to run DROP TABLE users, your authorization model is the model's instruction-following discipline. That is not authorization. That is hope.

The Abandon Primitive: Why Your Agent Loop Needs a First-Class Way to Quit a Plan

· 11 min read
Tian Pan
Software Engineer

Look at the loop primitives most agent frameworks ship: continue, return, retry, and a step budget that hard-stops the run. Notice what is missing. There is a path that says "the work succeeded," a path that says "the model wants to keep going," and a path that says "we ran out of money or patience and shot the loop in the head." There is no first-class path that says "the plan I am executing is wrong, and I want to throw it away and start a different one." The abandon primitive — an explicit, structured way for the planner to declare its current trajectory hopeless — is the missing verb in the agent loop's grammar, and its absence is responsible for a category of failures that are usually misdiagnosed as "the model needs more reasoning."

A planner three steps into a doomed branch keeps refining the same wrong plan because the loop's only exits are succeed, retry the last step, or hit the budget. None of those are "give up on the strategy and try a different one." So the agent does what the loop allows: it edits its plan in place, calls one more tool, asks for one more clarification, and burns through its step budget converging on a non-solution. When the wall finally hits, the user sees a polite failure message that is not an answer to their question. The cost of those wasted steps is real — production data suggests 5–10% of token spend on agent systems goes into retries that produce nothing usable, and that figure is dominated by long doomed branches, not isolated tool errors.

Why Your AI Roadmap Shouldn't Have a 12-Month Plan

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter spent six weeks building a "smart document classifier" — fine-tuned model, eval harness, custom UI, the whole production pipeline. It shipped on a Tuesday. The following Monday, a new general-purpose model dropped that beat their fine-tune on the same eval, zero-shot, with no infrastructure investment. Their entire Q2 OKR became a wrapper around a one-line API call. The roadmap had committed twelve months earlier to "owning the classification stack." That commitment was wrong before the ink dried.

This is not an isolated story. Industry trackers logged 255 model releases from major labs in Q1 2026 alone, with roughly three meaningful frontier launches per week through March. Costs have collapsed: API pricing is down 97% since GPT-3, and the gap between top providers has narrowed to within statistical noise on most benchmarks. When the underlying substrate changes this fast, a twelve-month feature roadmap is not a plan — it is a list of bets you cannot revisit, made with information that will be stale before you ship the second item.

The Air-Gapped LLM Blueprint: What Egress-Free Deployments Actually Need

· 11 min read
Tian Pan
Software Engineer

The cloud AI playbook assumes one primitive that nobody writes down: outbound HTTPS. Vendor APIs, hosted judges, telemetry pipelines, model registries, vector stores, dashboard SaaS, secret managers — every one of them quietly resolves to a domain on the public internet. Pull that one cable and the stack does not degrade gracefully. It collapses.

That is the moment most teams discover their architecture has an egress dependency they never accounted for. A "small" prompt update needs to call out to a hosted classifier. The eval suite hits an LLM judge over the wire. The observability agent phones home. The model registry pulls weights from a CDN. None of it is malicious, and none of it is unusual. It is just what the cloud-native stack looks like when you stop noticing the cable.

Clarification Budgets: When Your Agent Should Ask Instead of Guess

· 10 min read
Tian Pan
Software Engineer

The two worst agent failure modes feel like opposites, but they originate from the same broken policy. The first agent asks four follow-up questions before doing anything and trains its users to abandon it. The second agent never asks, confidently produces output the user has to redo, and trains its users to mistrust it. Same policy, different settings of one missing parameter: the cost of a question relative to the cost of a wrong answer.

Most agents do not have a policy at all. The model is asked to "be helpful" and is left to negotiate ambiguity on its own. Because next-token prediction rewards committing to an answer, the agent leans toward guessing. Because RLHF rewards politeness, the agent occasionally over-corrects and asks a question for safety. The result is unprincipled behavior that varies from session to session, with no team-level intuition about when the agent will pause and when it will charge ahead.

A clarification budget is the missing parameter. It is a per-task allowance for how much friction the agent is permitted to impose, paired with a decision rule for when a question is worth spending that budget on. Think of it as the conversational analog of a latency budget — every product has one, even if no one wrote it down, and the team that writes it down stops shipping confused agents.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing

· 11 min read
Tian Pan
Software Engineer

Most agent memory systems collapse a four-layer problem into two layers and then act surprised when the seams show. There is the conversation buffer that gets truncated when it overflows the context window, and there is the vector store of "long-term memory" that everything older than the buffer gets dumped into. That is not a memory architecture. That is a queue and a junk drawer.

The agent that re-asks a regular user the same onboarding question three Mondays in a row is not failing because the model is bad. It is failing because there is no place in the system that holds "things this user has told me across sessions" with a different lifetime than "things every user has ever told me about how the product works." Those are different memories. They have different access patterns, different privacy contracts, and different rules for when to forget. Conflating them is the architectural mistake — and it has a fix.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams handle abstention with a single sentence in the system prompt: "If you are not confident, say you don't know." The model occasionally honors it, frequently doesn't, and the failure mode is asymmetric. A confidently-wrong answer ships at full velocity — it lands in the user's hands, gets quoted in a Slack thread, gets cited in a downstream summary. An honest abstention triggers a customer-success escalation because the user expected the agent to handle the request and now somebody has to explain why it didn't. Six months in, the team has learned which kind of failure costs less to ship, and the system prompt edit that nominally controls abstention has been quietly tuned for compliance, not for honesty.

The discipline that fixes this isn't a better wording. It's recognizing that abstention is a routing decision, not a prompt pattern. It deserves a first-class output channel, its own SLO, its own evaluation harness, and its own place in the system topology — somewhere outside the prompt, where it can be tested, owned, and scaled.