Skip to main content

89 posts tagged with "llm-ops"

View all tags

Diurnal Latency: Why Your AI Feature Is Slowest at 9am ET

· 8 min read
Tian Pan
Software Engineer

Sometime in the last quarter, an engineer on your team opened a Slack thread that started with "the model got slow." They had a graph: p95 latency for your assistant feature climbed steadily from 7am, peaked around 10am Eastern, plateaued through lunch, and quietly recovered after 5pm. The shape repeated the next day, and the day after that. The team retraced their deploys, blamed a tokenizer change, then a context-length regression, then nothing in particular. The fix never landed because the bug never lived in your code.

Frontier model providers run shared inference fleets. When your users wake up, so does the rest of North America, plus the European afternoon, plus every internal tool at every other company that bought into the same API. Queue depth at the provider doubles, GPU contention rises, and your p95 doubles with it — without a single line of your codebase changing. It is the most predictable production incident in your stack and almost no team builds a dashboard for it.

Fallback Path Atrophy: Your Graceful Degradation Stopped Working Three Months Ago

· 9 min read
Tian Pan
Software Engineer

The fallback path you wrote nine months ago — the one that catches model timeouts, swaps to a cheaper provider, returns a templated message when both are down — has not actually run in production for the last twelve weeks. It was exercised once during the original launch, the integration tests still pass against it, and the runbook still references it. None of that means it works. A refactor in week six changed the shape of the upstream context object. A library bump in week nine quietly moved a config key. The code still compiles. The tests still pass because they were written against the same stale fixtures the code was. The next time your primary path 504s, your "graceful degradation" will throw a NullPointerException into a user's face, and the postmortem will note — for the third time this year — that the fallback was never re-tested after the upstream contract changed.

This is the quiet failure mode of resilience engineering in AI systems. The fallback path is the part of your application that exists specifically to be ignored. Production traffic flows around it for ninety-nine days out of a hundred. CI never exercises it because no test was ever wired to. The team that owns it forgets it exists between incidents. Then on day one hundred, when the primary model provider has a regional outage and you finally need it, the path bit-rots in front of a paying customer.

Provider-Side Safety Drift: When Your Product Regresses Without a Deploy

· 9 min read
Tian Pan
Software Engineer

A prompt that worked on Tuesday returns "I can't help with that" on Thursday. The CI eval is green. The model name in your config didn't change. The prompt is byte-identical, hashed and pinned in source control. And yet a customer support thread is forming around the new refusal — the AI team won't see it for two weeks because it has to bubble through tier-one support, get triaged, and finally land on someone who can read the trace.

This is provider-side safety drift, and it is the most underbuilt monitoring gap in production AI today. Frontier providers tune safety filters, refusal thresholds, and content classifiers server-side on a cadence that is not on your release calendar. Your team isn't subscribed to it. There is often no release note. And the regressions are asymmetric in a way that is genuinely hard to detect: refusals creep up for legitimate intents while harmful queries you assumed the provider was filtering quietly start slipping through. The boundary moves on both sides, independently, without warning.

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

The Eval Overcrowding Problem: Why Your Bigger Test Suite Is Catching Fewer Regressions

· 9 min read
Tian Pan
Software Engineer

Your AI eval suite has 800 test cases. You add 200 more. Your model now scores 94% on evals and you ship with confidence. Three days later, a user finds a regression that none of your 1,000 tests caught.

This isn't bad luck — it's structural. The regression exists precisely because of how you grew your test suite, not despite it. The instinct to add more evals when something breaks is correct in theory and counterproductive in practice. More tests do not automatically mean better coverage of what matters. They mean better coverage of what's easy to test, which is a different thing entirely.

The Invisible Handoff: Why Production AI Failures Cluster at Component Boundaries

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships a wrong answer, the first question is always: "Was it the model?" Most engineers reach for model evaluation, run a few test prompts, and conclude the model looks fine. They're usually right. The model is fine. The breakage happened somewhere else—at one of the invisible seams where your components talk to each other.

The evidence for this is consistent. Analysis of production RAG deployments shows 73% of failures are retrieval failures, not generation failures. In multi-agent systems, the most common failure modes are message ordering violations, state synchronization gaps, and schema mismatches—none of which show up in any per-component health check. GPT-4 produces invalid responses on complex extraction tasks nearly 12% of the time, not because the model is broken, but because the output format contract between the model and the downstream parser was never enforced.

The model gets blamed. The boundary is the culprit.

AI Procurement Clauses Your Lawyers Haven't Learned to Ask For Yet

· 11 min read
Tian Pan
Software Engineer

The 14-month-old AI vendor contract on your shared drive was drafted from a SaaS template. It guarantees uptime, names a security contact, and caps liability at twelve months of fees. It says nothing about whether your prompts get fed into the next training run, what happens when the model you depend on is quietly swapped for a smaller variant, or which region your inference logs sit in when a regulator asks. The lawyer who drafted it did a competent job with the vocabulary they had. The vocabulary is a generation behind the surface area.

Procurement teams are still optimizing for the wrong contract. The standard MSA fights battles from the 2010s — outage credits, breach notification windows, indemnification for IP that makes it into the source repository. AI vendor relationships have a different attack surface, and the clauses that matter most are the ones that don't have a heading in your existing template. The team that lets last year's procurement playbook handle this year's vendor stack is signing away leverage they will need within a year.

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.

The First 100 Tickets After You Launch an AI Feature

· 12 min read
Tian Pan
Software Engineer

The bug count after an AI launch is not a quality problem. It is a discovery sequence — a sequence so predictable that you can sketch it on a whiteboard before the launch announcement goes out, week by week, ticket by ticket, and be embarrassingly close to right by the time the dashboards catch up. Every team that ships an AI feature runs this sequence. The only choice is whether you run it with a runbook or with a series of unscheduled all-hands.

I have watched enough launches now to believe the sequence is not really about engineering quality. It is about an information gap. Pre-launch, the team has a synthetic traffic mix, a curated eval set, a happy-path demo, and a board deck. Post-launch, real users arrive with intents the synthetic traffic never modeled, a marketing team that runs campaigns engineering hears about secondhand, a model provider that ships changes the team did not authorize, and a privacy reviewer who was on vacation when the feature shipped. The sequence below is the friction that happens when those two worlds collide.

The Model Provider Webhook Surface You Forgot to Subscribe To

· 11 min read
Tian Pan
Software Engineer

The first time my team found out a model we depended on was being retired, we found out from a customer. The deprecation email had landed in a shared inbox three engineers had unsubscribed from. The provider's status page had a banner up. The webhook event had fired into a void because we never wired up the receiver. Sixty days of warning, used by us as zero days of warning, ending with an outage and a calendar full of "emergency migration" syncs.

Most teams I talk to are running this exact setup right now and don't know it. Every major LLM provider has been quietly building out a notification surface — webhooks for incidents, deprecation events in changelogs, account warnings sent by email, billing anomaly pings, region failover signals — and most teams have it disabled or routed to a mailing list nobody reads. The provider has been telling you the bad news in advance. You've been choosing not to listen.