Skip to main content

108 posts tagged with "llm-ops"

View all tags

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

The Eval Overcrowding Problem: Why Your Bigger Test Suite Is Catching Fewer Regressions

· 9 min read
Tian Pan
Software Engineer

Your AI eval suite has 800 test cases. You add 200 more. Your model now scores 94% on evals and you ship with confidence. Three days later, a user finds a regression that none of your 1,000 tests caught.

This isn't bad luck — it's structural. The regression exists precisely because of how you grew your test suite, not despite it. The instinct to add more evals when something breaks is correct in theory and counterproductive in practice. More tests do not automatically mean better coverage of what matters. They mean better coverage of what's easy to test, which is a different thing entirely.

The Invisible Handoff: Why Production AI Failures Cluster at Component Boundaries

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships a wrong answer, the first question is always: "Was it the model?" Most engineers reach for model evaluation, run a few test prompts, and conclude the model looks fine. They're usually right. The model is fine. The breakage happened somewhere else—at one of the invisible seams where your components talk to each other.

The evidence for this is consistent. Analysis of production RAG deployments shows 73% of failures are retrieval failures, not generation failures. In multi-agent systems, the most common failure modes are message ordering violations, state synchronization gaps, and schema mismatches—none of which show up in any per-component health check. GPT-4 produces invalid responses on complex extraction tasks nearly 12% of the time, not because the model is broken, but because the output format contract between the model and the downstream parser was never enforced.

The model gets blamed. The boundary is the culprit.

AI Procurement Clauses Your Lawyers Haven't Learned to Ask For Yet

· 11 min read
Tian Pan
Software Engineer

The 14-month-old AI vendor contract on your shared drive was drafted from a SaaS template. It guarantees uptime, names a security contact, and caps liability at twelve months of fees. It says nothing about whether your prompts get fed into the next training run, what happens when the model you depend on is quietly swapped for a smaller variant, or which region your inference logs sit in when a regulator asks. The lawyer who drafted it did a competent job with the vocabulary they had. The vocabulary is a generation behind the surface area.

Procurement teams are still optimizing for the wrong contract. The standard MSA fights battles from the 2010s — outage credits, breach notification windows, indemnification for IP that makes it into the source repository. AI vendor relationships have a different attack surface, and the clauses that matter most are the ones that don't have a heading in your existing template. The team that lets last year's procurement playbook handle this year's vendor stack is signing away leverage they will need within a year.

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.

The First 100 Tickets After You Launch an AI Feature

· 12 min read
Tian Pan
Software Engineer

The bug count after an AI launch is not a quality problem. It is a discovery sequence — a sequence so predictable that you can sketch it on a whiteboard before the launch announcement goes out, week by week, ticket by ticket, and be embarrassingly close to right by the time the dashboards catch up. Every team that ships an AI feature runs this sequence. The only choice is whether you run it with a runbook or with a series of unscheduled all-hands.

I have watched enough launches now to believe the sequence is not really about engineering quality. It is about an information gap. Pre-launch, the team has a synthetic traffic mix, a curated eval set, a happy-path demo, and a board deck. Post-launch, real users arrive with intents the synthetic traffic never modeled, a marketing team that runs campaigns engineering hears about secondhand, a model provider that ships changes the team did not authorize, and a privacy reviewer who was on vacation when the feature shipped. The sequence below is the friction that happens when those two worlds collide.

The Model Provider Webhook Surface You Forgot to Subscribe To

· 11 min read
Tian Pan
Software Engineer

The first time my team found out a model we depended on was being retired, we found out from a customer. The deprecation email had landed in a shared inbox three engineers had unsubscribed from. The provider's status page had a banner up. The webhook event had fired into a void because we never wired up the receiver. Sixty days of warning, used by us as zero days of warning, ending with an outage and a calendar full of "emergency migration" syncs.

Most teams I talk to are running this exact setup right now and don't know it. Every major LLM provider has been quietly building out a notification surface — webhooks for incidents, deprecation events in changelogs, account warnings sent by email, billing anomaly pings, region failover signals — and most teams have it disabled or routed to a mailing list nobody reads. The provider has been telling you the bad news in advance. You've been choosing not to listen.

MTBF Is Dead When Your Agent Self-Heals

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter had every dashboard green. Tool error rate flat at 0.3%. End-to-end success at 98%. SLO budget barely touched. They were also burning four times their projected token spend, and nobody could explain it. When they finally instrumented retry depth per trace, the picture inverted: the median successful request was making 2.7 tool calls instead of the 1.0 the architecture diagram promised. The agent was not failing. It was failing and recovering, over and over, inside the same span, and the success rate metric had no way to tell them.

This is the part of agentic reliability that the old reliability vocabulary cannot reach. MTBF — mean time between failures — assumes failures are punctuated, observable events you can count between. You measure the gap, you compute the mean, you alert when the gap shrinks. It worked for hard drives, networks, deterministic services. It does not work for systems that retry, reroute, fall back, and recover silently inside a single user-visible operation.

Silent Quantization: Why the Model You Pay For Today Isn't the Model You Paid For Last Quarter

· 11 min read
Tian Pan
Software Engineer

The model name on your invoice is the same as it was last quarter. The version string in the API response hasn't changed. The model card and pricing page read identically. And yet your eval scores have drifted half a point downward, your refusal patterns shifted in ways your prompts didn't ask for, and a handful of customer complaints came in last Tuesday about output that "feels different." You debug your code. You don't find anything. The code didn't change. The weights did.

Silent quantization is the gap between the model you contracted for and the model the provider is actually serving. It happens because inference economics keep tightening — every dollar of GPU capacity has to feed more requests this quarter than last — and the cheapest way to absorb that pressure is to re-host the same model name on cheaper precision tiers. FP16 becomes FP8. FP8 becomes FP4 in some routes. Mixed-precision shards get swapped in. The version string doesn't move because the version string was never a precision contract; it was a marketing contract.

Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does

· 10 min read
Tian Pan
Software Engineer

The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.

The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.

The Two Clocks Problem: When Your Model Provider's Cadence Breaks Your Roadmap

· 10 min read
Tian Pan
Software Engineer

There are two clocks ticking on your AI product, and they are not synchronized. The model providers run on a roughly quarterly heartbeat — Claude Opus 4.6 in February 2026, GPT-5.4 in March, Claude Opus 4.7 in April, GPT-5.5 a week later. Your product roadmap was committed in January and does not look up again until July. Somewhere in between, a capability you spent eight engineer-weeks building gets shipped as a one-line API parameter, and nobody on the team has a process for noticing.

This is not a forecasting problem. The releases were widely telegraphed — anyone who reads the changelog could have seen each of them coming. It is a planning-artifact problem. Roadmaps were invented for a world where the platform underneath your product changed once a decade. The platform now changes once a quarter, and the artifact has not been updated to match.

The Brownout Pattern: When Your LLM Provider Is Slow but Not Down

· 10 min read
Tian Pan
Software Engineer

The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.

Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.