Skip to main content

27 posts tagged with "llm-ops"

View all tags

Your Accuracy Went Up and Your Calibration Collapsed

· 10 min read
Tian Pan
Software Engineer

A team ships a prompt refactor. The offline eval shows accuracy up three points. The PM posts the graph in Slack. Two weeks later, support tickets spike with a pattern nobody has a dashboard for: users trusted an answer they should not have, acted on it, and got burned. The model is right more often than it used to be. Trust in the model has gotten worse.

This is the calibration collapse. The model's confidence no longer matches its error rate, but the accuracy number went up, so the team thinks they shipped a win. They did not. They shipped a system that is more confidently wrong, and users — who calibrate trust on the model's voice (hedges, certainty, refusals) rather than on an accuracy number they never see — are now being misled on the exact fraction of queries where being misled matters most.

Accuracy and calibration are independent axes. You can move one without touching the other. You can improve one while destroying the other. Most teams measure only the first axis and ship against it, and most production incidents in LLM systems live on the second.

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

· 10 min read
Tian Pan
Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

Your Eval Harness Runs Single-User. Your Agents Don't.

· 9 min read
Tian Pan
Software Engineer

Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.

The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

· 11 min read
Tian Pan
Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

The Model-of-the-Week Roadmap: When Vendor Promises Become Committed Dependencies

· 9 min read
Tian Pan
Software Engineer

A product manager pulls up the next-quarter roadmap. Three features are marked "depends on next-gen model." Nobody asks what happens if next-gen slips, arrives 20% smaller than the demo suggested, or ships gated behind an enterprise tier your customers do not qualify for. Six months later, all three of those scenarios have happened, and the team is now rebuilding two quarters of architecture against the model that actually shipped — a different shape from the one they planned for.

This is the model-of-the-week roadmap: treating unreleased capability claims as committed dependencies. It is one of the most reliable ways to turn a twelve-month plan into a thirty-month plan, and it rarely looks risky in the moment because every vendor demo feels inevitable. The schedule damage is invisible until the slip compounds.

The Prompt Ownership Problem: When Conway's Law Comes for Your Prompts

· 11 min read
Tian Pan
Software Engineer

Every non-trivial AI product eventually develops a prompt that nobody is allowed to touch. It has three conditional branches, two inline examples pasted in during a customer-reported incident, and a sentence that begins with "IMPORTANT:" followed by a tone instruction nobody remembers writing. The prompt is 1,400 tokens. The PR that last modified it was reviewed by an engineer who has since changed teams. When a new model comes out, nobody is confident the prompt will still work. When evals regress, nobody is sure whether the prompt, the model, the retrieval pipeline, or a downstream tool caused it. The string is shared across four services. Every team has a local override. None of the overrides are documented.

This is the prompt ownership problem, and it is the single most under-discussed failure mode in multi-team AI engineering. It is not a technical problem. It is Conway's law reasserting itself at the token level. An organization's prompts end up mirroring its org chart, its RACI gaps, and its coordination tax — and the model, which does not care about your Jira hierarchy, produces correspondingly incoherent behavior for end users who do not care either.

Semantic Diff for Prompts: Why Git Diff Lies About What Your Prompt Change Will Do

· 10 min read
Tian Pan
Software Engineer

A teammate opens a pull request that rewrites your agent's system prompt from 420 lines to 380. The diff is green-and-red carnage: deleted paragraphs, moved sections, tightened language. You approve it because the cleanup looks sensible. A week later, refund-request accuracy has dropped eight points and nobody can say which line did it.

A different teammate adds the word "concise" to one instruction. Three characters of diff. Nobody reviews it closely because there is almost nothing to review. That edit flips tool-call behavior on 22% of queries.

The Ship-and-Pin Trap: How Model Version Stability Becomes Deprecation Debt

· 9 min read
Tian Pan
Software Engineer

Pinning a model version in production feels like engineering discipline. You lock claude-opus-4-0 or gpt-4o-2024-08-06 into config, write a note in the README about why, and move on to shipping features. The output distribution stops shifting under you, the evals stay green, and the prompt tuning you did last quarter keeps working. What you've actually done is start a silent timer. Twelve to fifteen months later the deprecation email arrives, and three sprints of undocumented behavioral dependencies — prompt tuning, eval calibration, output shape assumptions, temperature quirks — all come due at once.

This is the ship-and-pin trap. Pinning is correct in the short term and catastrophic in the long term, because the cost of stability compounds in places you aren't looking. The prompt that was "good enough" a year ago is now load-bearing in ways nobody documented. The JSON schema your downstream service expects was shaped to one model's tokenization habits. The few-shot examples you hand-tuned were tuned against a specific model's notion of helpfulness. When the provider retires the version string, none of these dependencies migrate automatically, and the work to re-qualify them always lands under deadline pressure.

Token Spend Is a Security Signal Your SOC Isn't Watching

· 10 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.