The Fine-Tune Cold Start Your Provider Bills as Idle Time
Your fine-tuned variant serves a few hundred requests per minute on a steady weekday, and the p99 latency dashboard is mostly flat. Then, at 03:14 local time on a Tuesday, p99 spikes from 800ms to 4.6 seconds for a single request, then settles back. The next night, it happens again, roughly the same shape, roughly the same hour. You file a ticket against the provider asking about the spike. The response is correct and unhelpful: their dashboard shows nothing anomalous on their side, no rate limits, no incidents, your token usage at the moment of the spike was unremarkable. The 4.6 seconds happened. The bill does not reflect it.
That gap — between a latency event a user clearly experiences and a bill that registers nothing — is the shape of the fine-tune cold start tax. It is not a bug in your code. It is not a regression on the provider's side. It is the seam where two billing models meet: the provider charges you for active inference time on the adapter, and the cost of loading the adapter into a serving slot is hidden inside the provider's infrastructure layer, where it shows up as your latency but their cost. If your traffic shape ever falls below the provider's keep-warm threshold, you pay for the round trip in p99 every time it climbs back.
The trap is that hosted fine-tunes feel like base models because the API surface is identical. You change a model identifier from provider/base-large-v3 to acct_123/ft-large-v3-customerA, and every other line of code stays the same. The base model's serving fleet is warm because every customer hits it; your fine-tuned adapter's serving slot is warm only as long as your traffic keeps it warm. Below a certain request rate, the provider scales the adapter to zero, and the next request — yours, the one a user is waiting on — pays the reload tax.
The deployment model that looks like serverless and isn't priced like it
There are two patterns most managed-fine-tune offerings settle into, and they have very different cold-start behavior.
The first is the shared multi-tenant fleet with hot-swappable adapters. A single base model lives on a GPU, and your LoRA adapter is one of dozens or hundreds the server can page in and out. When your request arrives and your adapter is not currently resident, the server loads it from host memory (fast, sub-second) or from object storage (slow, several seconds). This is the path Cloudflare Workers AI and several hyperscaler "bring your own LoRA" offerings use, and it is the cheapest from the provider's perspective because the base weights are amortized across tenants. Cold-start latency on this path is typically dominated by adapter weight transfer, and recent research papers report 23–55% reductions from techniques like proactive prefetching and KV-cache reuse — but the floor is still measured in hundreds of milliseconds, not the tens-of-milliseconds floor of a steady-state request.
The second is the dedicated provisioned-throughput slot. Bedrock fine-tuned models, OpenAI's reserved-capacity SKUs, and the equivalent enterprise offerings on Azure and Vertex AI fall here. Once you fine-tune a model in this lane, you cannot serve it through the on-demand pool at all — you must buy a model unit by the hour, starting around $7 per hour on the cheapest tier and climbing past $20 per hour on larger models. The latency floor is excellent because the slot is yours and it is always warm. The bill arrives whether the slot serves one request or one million. A single model unit at $20/hour, run continuously for a month, is roughly $14,400, and that is the price of eliminating the cold start, not paying for the inference itself.
A team building a product that doesn't yet justify $14k/month in dedicated capacity ends up in the first lane by default. And the first lane has a cold start the team didn't budget for.
Why the bill doesn't show what the user feels
The accounting choice is reasonable from the provider's side and treacherous from yours. When the provider's scheduler decides to evict your adapter, that decision is invisible to you. When the next request triggers a reload, the wall-clock latency the user sees includes the load time, but the provider's meter only counts the inference compute once the adapter is resident. From the meter's point of view, you bought one request worth of inference. From the user's point of view, they waited four seconds.
This creates a specific shape of dashboard mismatch. Your application's request-duration histogram shows a long tail with sporadic seconds-long outliers. Your provider's cost dashboard shows the requests the outliers correspond to as unremarkable — a normal number of tokens at a normal price. The two views never reconcile, and the team running the application starts attributing the spikes to network jitter, scheduler hiccups, the user's connection, anything but the deployment model. It can take an embarrassingly long time to notice that the outliers concentrate at the end of every traffic trough, because the dashboards aren't built to ask that question.
Three things together make this hard to see:
- The reload tax hits the first request after a quiet period, not all requests in the period. By the time you're investigating, traffic looks normal again.
- Provider documentation almost never names the scale-to-zero threshold explicitly. The eviction policy is part of the platform's cost optimization, and they are not incentivized to publish it.
- The spike looks identical to other tail events (long generation, retry on a backend hiccup, a transient model error). Without a deployment-aware lens, it is one outlier among many.
Treat cold start as a deployment-level SLI, not a tail event
The discipline that fixes this is borrowed from how mature serverless platforms handle Lambda cold starts: measure cold-start latency as a distinct signal, separate from steady-state latency, and track it against deployment-level events rather than request-level events.
In practice, this means three instrumentation moves.
Tag every request with whether the serving slot was warm. If your provider exposes a header or response field that indicates a cold serve, propagate it through your logging and split your latency histograms by it. If they don't, infer it: a request whose end-to-end latency exceeds the steady-state p99 by more than 2x, where the token count is normal, is a cold-start candidate. The signal will be noisy at the edges, but the concentration of those events at traffic-trough boundaries will be unmistakable once you look.
Track cold-start frequency as a rate, not a percentile. "What fraction of requests pay the cold-start tax?" is a more useful question than "what is the p99 of cold requests?" The former tells you whether the deployment model is fit for your traffic shape; the latter tells you how bad the tax is once it lands. Both matter, but most dashboards default to the second and never ask the first.
Watch the calendar, not just the clock. Cold starts cluster around the quietest hours of your traffic — late nights, weekends, holidays — and around demographic boundaries that the provider's UTC-defaulted dashboard hides. A team whose largest customers are in one time zone but whose traffic dashboard is in another will systematically misread when the tax is being paid. Roll cold-start frequency up by customer time zone if you serve a B2B product, not by the provider's clock.
The patterns that contain the tax
Once cold start is a metric you watch, the choices to contain it become tractable.
Pinned-warm capacity, sized below the dedicated tier. If your provider offers a tier between shared-multi-tenant and full-provisioned — and increasingly they do, branded variously as "always warm," "reserved adapter," or "low-priority dedicated" — buying just enough of it to keep one warm slot during your traffic troughs is dramatically cheaper than full provisioned throughput. The unit economics are: you pay the keep-warm price for the quiet hours and let the shared pool absorb the busy hours. Many teams that move from shared to fully dedicated do so as a panic response to an incident and discover months later that a partial reservation would have absorbed 95% of the cold starts at a third of the cost.
Synthetic traffic to suppress eviction on critical paths. A heartbeat request every 30 to 60 seconds against the fine-tuned variant keeps the adapter resident under most providers' eviction policies. The cost is real but small: a tiny request, billed at on-demand rates, runs roughly $1–$3 per day depending on the model. The trick is to send the heartbeat with a payload that is cheap to process — a short prompt with max_tokens=1 is usually enough — and to scope it to the variants that matter (your production-customer-facing adapter, not every internal experiment). This is the same pattern as a Lambda warmer, with the same caveat: it is a workaround, not a contract, and the provider can change the eviction policy out from under you.
Multi-region adapter prewarming. If you serve users in more than one region, the adapter has to be warm in each region the user might hit. Most managed fine-tune offerings deploy the adapter to one region by default, and the cross-region cold start is even more expensive than the same-region one. The fix is to explicitly pre-deploy the adapter to every region in your routing footprint, and to extend the synthetic-traffic heartbeat to each region. This is bookkeeping work, not engineering work, but the team that skipped it is paying the inter-region reload tax on every regional failover.
Routing fallback to the base model. The strongest pattern, and the most architecturally honest, is to treat the fine-tuned variant as the preferred path and the base model as the available path. If a request arrives and the fine-tuned adapter is cold, route it to the base model with a system-prompt or few-shot fallback that approximates the fine-tune's behavior well enough to serve the request, and emit a metric so you know how often this happened. The fine-tuned variant is still your steady-state choice — the few-shot prompt is more expensive in tokens and slightly worse in quality — but no user waits four seconds for a slot to warm up. This is the closest analog to the "circuit breaker" pattern from classical service architecture, and it works for the same reason: a degraded path is almost always better than a slow path on the critical request flow.
Serverless and hosted fine-tune do not share a cost curve
The deeper lesson is about the mental model the team brought to the deployment decision. "Serverless" in classical web infrastructure means: usage-based pricing, automatic scaling, no idle cost. It is a phenomenally good fit for spiky, low-utilization workloads where the alternative is paying for an always-on server.
Hosted fine-tunes look serverless because the on-demand-style API is the same. They are not priced like it. The provider's cost model for hosting your adapter is closer to a dedicated server than to a stateless function — there is a warm slot somewhere, and either you pay for it directly (provisioned throughput) or the provider pays for it and amortizes it across tenants (shared pool, with eviction). When your traffic falls into the latter regime, the provider's amortization decisions become your latency budget, and the cost-of-latency curve you assumed — flat, predictable, on-demand — turns out to be quite jagged.
The team that priced this correctly built two things into the deployment plan from the start. They knew their requests-per-minute floor by customer segment and time of day, and they knew which segments could tolerate a one-time multi-second latency on a quiet hour and which could not. They picked the deployment lane per segment, not per model. The team that priced this incorrectly assumed that "fine-tuned variant" and "base model" had the same operational characteristics because they had the same API surface, and they discovered the cost-of-latency curve in an incident where a customer's CEO saw the spike in a live demo.
The fix is not exotic. It is to treat fine-tune deployment as a capacity decision, not an API decision, and to put a number on the cold-start tax before the customer does.
- https://aws.amazon.com/blogs/machine-learning/efficient-and-cost-effective-multi-tenant-lora-serving-with-amazon-sagemaker/
- https://blog.cloudflare.com/fine-tuned-inference-with-loras/
- https://www.smiansh.com/blogs/aws-bedrock-on-demand-vs-provisioned-throughput/
- https://caylent.com/blog/amazon-bedrock-pricing-explained
- https://aws.amazon.com/bedrock/pricing/
- https://regolo.ai/scale-to-zero-cold-start-latency-why-serverless-gpu-breaks-real-time-ai-and-how-to-fix-it/
- https://acecloud.ai/blog/cold-start-latency-llm-inference/
- https://www.runpod.io/blog/llm-inference-optimization-techniques-reduce-latency-cost
- https://arxiv.org/abs/2502.15524
- https://arxiv.org/abs/2511.22880
- https://arxiv.org/abs/2512.20210
- https://www.inferless.com/learn/how-to-serve-multi-lora-adapters
