Skip to main content

33 posts tagged with "inference"

View all tags

The Streaming Abort Your Provider Billed Anyway: A 14% Gap Hiding in Your Invoice

· 10 min read
Tian Pan
Software Engineer

Your finance team filed a dispute and lost. The line item is "output tokens" and it exceeds your sum-of-delivered-tokens metric by fourteen percent. The provider's support engineer closed the ticket as "expected behavior under streaming cancellation," with a link to a documentation page that says "cancellation stops billing at the last delivered token." Both sentences are true, and the gap between them is the line of code you have not written.

The contract you read says one thing. The inference scheduler does another. The mismatch is not a bug, not a billing error, and not malice — it is a layered system in which the cancellation signal travels through three boundaries (browser, edge, GPU) and the billing meter sits at the third boundary while your "stop generating" button sits at the first. Closing the gap is an engineering project with a finance owner.

The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

· 12 min read
Tian Pan
Software Engineer

The cluster did exactly what we told it to. Traffic dropped to zero for forty-five seconds, the queue-depth metric flatlined, KEDA flipped the replica count from one to zero, and the node autoscaler reclaimed the H100 pod ninety seconds later. The graph looked clean. The Slack channel was quiet. The cost dashboard ticked down half a cent.

An hour and twelve minutes later, a customer support ticket arrived: a long-running document-analysis job — a 180k-token reasoning task that was budgeted for twenty-eight minutes of decode — had vanished. No error in their client SDK. No exception in our application logs. Only a single 499 line buried in the gateway access log, timestamped roughly when the scheduler had decided the pod was idle and reaped it.

The Deterministic Seed Your Eval Suite Set That Your Provider Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

You set seed=42. You set temperature=0. You logged the run, posted the dashboard, signed off on the model swap. The next morning the rerun returned a different number on the same prompts, and the explanation you reached for — "must be sampling noise" — was wrong twice over: there was no sampling, and the noise was structural. The seed left your client, the gateway threw it away, the kernel batched your request next to seventeen unrelated ones, and the floating-point reduction order changed under you. Your "reproducible" benchmark was always within one batch of being a different benchmark.

This failure mode is quiet because every layer in the stack is technically correct. The SDK accepts the seed. The provider documents the seed. The model returns a system_fingerprint. The eval harness logs all three. Nothing 5xx's, nothing warns, nothing protests. The number on the dashboard just shifts, and the team rationalizes the shift as the kind of jitter that always existed — because they have no instrument that can tell them whether they're looking at stochastic decoding or at a backend rotation that invalidated three weeks of comparisons.

The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

· 10 min read
Tian Pan
Software Engineer

The CI test was a single assertion: same model, same temperature, same prompt, same seed, same output string. It passed on every developer's laptop, passed on the first hundred CI runs, and then flaked once every fifty runs for three weeks before anyone admitted the pattern was real. The first hypothesis was the obvious one — a non-deterministic dependency somewhere in the test harness — and three days of investigation found nothing. The actual cause was sitting in a footnote on the provider's API reference: "seed provides best-effort determinism." The team had read the parameter name and assumed a contract. The provider had documented a hint.

This is a specific failure mode of hosted inference that catches teams who design test infrastructure around a single mental model: the model is a pure function of its inputs, and the seed is what makes the function reproducible. Both halves of that model are wrong in production, and the gap between the API surface and the underlying physics is wide enough that teams build entire eval and regression-test stacks on top of an assumption their provider explicitly disclaimed.

The GPU Reservation Your Batch Workload Starved Your Real-Time Path On

· 9 min read
Tian Pan
Software Engineer

The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.

This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference

· 9 min read
Tian Pan
Software Engineer

Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.

Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.

This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Diffusion Models in Production: The Engineering Stack Nobody Discusses After the Demo

· 10 min read
Tian Pan
Software Engineer

Your image generation feature just went viral. 100,000 requests are coming in daily. The API provider's rate limit technically accommodates it. Latency crawls to 12 seconds at p95. Your NSFW classifier is flagging legitimate medical illustrations. A compliance audit surfaces that California's AI Transparency Act required watermarking since September 2024. Support has 50 open tickets from users whose content was silently blocked. By the time you realize you need a real production stack, you've already burned two weeks in crisis mode.

This is the moment "just call the API" fails—not because the API is bad, but because the demo's success exposes every assumption you made about inference latency, content policy, moderation fairness, and regulatory compliance. The engineering work nobody shows you in tutorials lives here.

Thinking Budgets: When Extended Reasoning Models Actually Make Economic Sense

· 10 min read
Tian Pan
Software Engineer

A surprising number of AI teams default to extended thinking on every query once they gain access to an o3-class or Claude extended thinking model. The logic seems obvious: smarter reasoning equals better outputs, so why not always enable it? The problem is that this reasoning fails to account for a basic fact of how test-time compute scaling works in practice. Extended thinking dramatically improves performance on a specific class of tasks, degrades quality on others, and can inflate your inference costs by 5–30x across the board. The teams getting the most value from these models treat the reasoning budget as an explicit decision — one with the same weight as model selection or prompt engineering.

This post lays out the task taxonomy, the cost structure, and the routing decision framework that distinguishes teams who use thinking budgets strategically from teams who are just paying a premium for an illusion of quality.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.