118 posts tagged with "llm-ops"

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

April 27, 2026 · 11 min read

Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.

The Model Deprecation Treadmill: Discipline That Has to Exist Before the Sunset Email

April 27, 2026 · 13 min read

Tian Pan

Software Engineer

The team that treats "we use the latest model" as a virtue is one sunset email away from a quarter of unplanned work. By the time the deprecation notice lands, the architectural decisions that determine whether you can absorb it have already been made — months ago, by people who weren't thinking about migrations at all. The eval suite was implicitly trained against a specific checkpoint. The prompts were tuned against a specific refusal style. The cost projections assumed a specific token-per-task baseline. The router has a hardcoded fallback to a model that is itself about to disappear. None of these decisions look like risks until the email arrives, and then all of them look like the same risk.

Model deprecation is now the most predictable surprise in the AI stack. Anthropic gives a minimum of 60 days' notice on publicly released models. OpenAI's notice windows range from three months for specialized snapshots to 18 months for foundational models, but in practice a recent batch of ChatGPT model retirements landed with as little as two weeks' warning for some teams. GitHub deprecated a slate of Anthropic and OpenAI models in February 2026 in a single coordinated changelog entry. The pattern is no longer "if a model retires" — it's "every quarter, at least one model your stack depends on enters a retirement window, and the calendar isn't synchronized to your roadmap."

The RAG Read-After-Write Race: When Your Vector Index Cites a Document That No Longer Exists

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A user asks your assistant a question at 14:32:07. Your retriever fires at 14:32:08 and pulls back five chunks from the policy handbook. The model thinks for a few seconds, drafts a response, and at 14:32:12 streams back an answer that confidently cites section 4.3 — the section that an admin deleted at 14:32:10 because it was wrong. The user reads an authoritative quotation from a document that no longer exists, complete with a clickable link that returns 404.

Nothing in your stack errored. The retriever returned a valid hit. The model produced fluent, grounded prose. The citation pointed at a real chunk ID that was real when the retrieval happened. And yet the answer is, by every reasonable definition, a hallucination — not because the model made something up, but because the world changed underneath the pipeline between the moment it looked and the moment it spoke.

This is the RAG read-after-write race, and most production pipelines have no defense against it.

The kWh Column Missing From Your Inference Span: Carbon Attribution Per Request

April 26, 2026 · 10 min read

Tian Pan

Software Engineer

Your inference flame graph has a cost axis. It does not have an energy axis. That gap is fine right up until the morning a customer's procurement team sends you a spreadsheet with twenty-three columns of vendor sustainability disclosures, and one of them is kgCO2e per 1,000 inferences. You have no way to fill that cell, your provider's answer is a methodology paper, and the deal closes in nine days. The token-cost dashboard your platform team has been polishing for two years suddenly looks like it was solving the wrong problem.

The shift here is not abstract. Sustainability disclosure is moving from corporate aggregate to product-level granularity. The first wave of that movement landed inside CSRD and ESRS in 2025, and the second wave is landing in B2B procurement contracts right now. Engineering organizations that built observability for cost are about to discover they need observability for carbon, and the two are not the same column on the same span.

DLP Belongs in Your AI Gateway, Not Bolted Into Every App

April 26, 2026 · 11 min read

Tian Pan

Software Engineer

The first internal LLM gateway is almost always built for the boring reasons: cost attribution so finance can answer "which team spent the inference budget," rate limiting so one runaway script doesn't burn the monthly quota, provider failover so an OpenAI hiccup doesn't take down the assistant. Data loss prevention shows up on the slide deck, but it ships as "each app team should redact sensitive fields before they call the model." Six months later there are nine apps in production, three half-maintained redaction libraries with subtly different regex sets, two prototypes that bypass the gateway entirely "just for testing," and a customer-data-in-prompt incident that everyone's middleware was supposed to prevent because nobody's middleware was the canonical egress point.

This is not a tooling problem. It is an architectural mistake. DLP is an egress control, and egress controls only work when the path is mandatory. The moment you let app teams own redaction, you've ceded the property that makes DLP function — that there is exactly one place sensitive data can leave, and you can prove what crossed it. The 2025 LayerX Security report puts the scale of the problem in numbers most teams haven't internalized: GenAI-related DLP incidents more than doubled in early 2025 and now make up 14% of all data-security incidents across SaaS traffic, with employees averaging 6.8 pastes into GenAI tools per day, more than half of which contain corporate information. The shadow path is winning by default.

Your Accuracy Went Up and Your Calibration Collapsed

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

A team ships a prompt refactor. The offline eval shows accuracy up three points. The PM posts the graph in Slack. Two weeks later, support tickets spike with a pattern nobody has a dashboard for: users trusted an answer they should not have, acted on it, and got burned. The model is right more often than it used to be. Trust in the model has gotten worse.

This is the calibration collapse. The model's confidence no longer matches its error rate, but the accuracy number went up, so the team thinks they shipped a win. They did not. They shipped a system that is more confidently wrong, and users — who calibrate trust on the model's voice (hedges, certainty, refusals) rather than on an accuracy number they never see — are now being misled on the exact fraction of queries where being misled matters most.

Accuracy and calibration are independent axes. You can move one without touching the other. You can improve one while destroying the other. Most teams measure only the first axis and ship against it, and most production incidents in LLM systems live on the second.

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

Your Eval Harness Runs Single-User. Your Agents Don't.

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.

The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

About Tian Pan