The Fine-Tune Cold Start Your Provider Bills as Idle Time
Your fine-tuned variant serves a few hundred requests per minute on a steady weekday, and the p99 latency dashboard is mostly flat. Then, at 03:14 local time on a Tuesday, p99 spikes from 800ms to 4.6 seconds for a single request, then settles back. The next night, it happens again, roughly the same shape, roughly the same hour. You file a ticket against the provider asking about the spike. The response is correct and unhelpful: their dashboard shows nothing anomalous on their side, no rate limits, no incidents, your token usage at the moment of the spike was unremarkable. The 4.6 seconds happened. The bill does not reflect it.
That gap — between a latency event a user clearly experiences and a bill that registers nothing — is the shape of the fine-tune cold start tax. It is not a bug in your code. It is not a regression on the provider's side. It is the seam where two billing models meet: the provider charges you for active inference time on the adapter, and the cost of loading the adapter into a serving slot is hidden inside the provider's infrastructure layer, where it shows up as your latency but their cost. If your traffic shape ever falls below the provider's keep-warm threshold, you pay for the round trip in p99 every time it climbs back.
