Skip to main content

2 posts tagged with "llm-infra"

View all tags

The Fine-Tune Cold Start Your Provider Bills as Idle Time

· 11 min read
Tian Pan
Software Engineer

Your fine-tuned variant serves a few hundred requests per minute on a steady weekday, and the p99 latency dashboard is mostly flat. Then, at 03:14 local time on a Tuesday, p99 spikes from 800ms to 4.6 seconds for a single request, then settles back. The next night, it happens again, roughly the same shape, roughly the same hour. You file a ticket against the provider asking about the spike. The response is correct and unhelpful: their dashboard shows nothing anomalous on their side, no rate limits, no incidents, your token usage at the moment of the spike was unremarkable. The 4.6 seconds happened. The bill does not reflect it.

That gap — between a latency event a user clearly experiences and a bill that registers nothing — is the shape of the fine-tune cold start tax. It is not a bug in your code. It is not a regression on the provider's side. It is the seam where two billing models meet: the provider charges you for active inference time on the adapter, and the cost of loading the adapter into a serving slot is hidden inside the provider's infrastructure layer, where it shows up as your latency but their cost. If your traffic shape ever falls below the provider's keep-warm threshold, you pay for the round trip in p99 every time it climbs back.

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.