The On-Device LLM Problem Nobody Talks About: Model Update Propagation
Most engineers who build on-device LLM features spend their time solving the problems that are easy to see: quantization, latency, memory limits. The model fits on the phone, inference is fast enough, and the demo looks great. Then they ship to millions of devices and discover a harder problem that nobody warned them about: you now have millions of independent compute nodes running different versions of your AI model, and you have no reliable way to know which one any given user is running.
Cloud inference is boring in the best way. You update the model, redeploy the server, and within minutes the entire user base is running the new version. On-device inference breaks this assumption entirely. A user who last opened your app three months ago is still running the model that was current then — and there's no clean way to force an update, no server-side rollback, and no simple way to detect the mismatch without adding instrumentation you probably didn't build from the start.
This version fragmentation is the central operational challenge of on-device AI, and it has consequences that reach far beyond a slow rollout. It creates silent capability drift, complicates incident response, and turns your "AI feature" into a heterogeneous fleet of independently-behaving systems that you're responsible for but can't directly control.
What Makes On-Device Updates Fundamentally Different
A mobile app update and a model update look similar on paper: a new binary goes out, devices download it eventually, users get the new version when they update. In practice, the dynamics are completely different.
App updates are binary. Either the user has the new app or the old one. The behavioral difference between versions is visible in UI changes and release notes. Users and support teams can reason about it.
Model updates are continuous and invisible. The "behavior" of an LLM is not a set of discrete features — it's a probability distribution over outputs. Two model versions can produce superficially similar responses to most inputs while diverging dramatically on edge cases, adversarial prompts, or specific domains where the newer model was fine-tuned. Users experiencing this drift have no way to diagnose it. Your support team has no way to correlate it. Your monitoring dashboards show average quality metrics, which may not surface the degraded cohort at all.
The update adoption curve makes this worse. For a typical mobile app, 50% of users update within a week of a release; 80% within a month. For a model embedded in an app, adoption follows the app's update curve, not a model-specific curve. If your app releases monthly and the model updates weekly, most users are always running a model that is at least a few weeks behind. For users who opt out of automatic updates — a meaningful percentage on iOS and Android — the gap can stretch to months or years.
Meta's ExecuTorch deployment across Instagram, WhatsApp, and Facebook illustrates the scale: billions of users, devices ranging from last year's flagship to six-year-old mid-range phones, each potentially running a different quantized variant. The operational surface is not a model version — it's a model-version × hardware-capability × OS-version matrix that grows combinatorially with every release.
The Artifact Ensemble Problem
Version fragmentation is worse than it initially appears because "the model" is not a single artifact. In production, a working on-device AI feature is a combination of:
- The base model weights (quantized to 4-bit, 2-bit, or a hardware-specific format)
- Any fine-tuned LoRA adapters layered on top
- The system prompt or prompt template
- The tokenizer and its vocabulary
- If applicable, a retrieval index or vector store
- The inference runtime itself (llama.cpp, MLC-LLM, ExecuTorch, MLX)
When any one of these components is updated and the others are not, the behavior of the combined system can change in ways that are hard to predict and harder to debug. A base model update that ships without a corresponding prompt template update may behave differently on the updated prompt. A LoRA adapter trained against model version N applied to model version N+1 may silently produce degraded outputs — not crashes, just wrong answers.
Cloud inference sidesteps this because the entire ensemble is co-deployed atomically: update the container, update everything together. On-device, each component may be delivered through a different channel on a different schedule. The app store delivers the runtime and the base model. A dynamic asset delivery system might push updated adapters separately. The system prompt might be fetched from a remote config service. Coherence between these components is not guaranteed — it has to be engineered deliberately.
The practical recommendation: version the entire artifact ensemble as a single unit. Assign a single semantic version to the tuple of (model checkpoint, adapter weights, prompt template, tokenizer, runtime version). Validate compatibility as a bundle before shipping any component. Store the full version tuple in telemetry so that when a quality regression surfaces, you can identify which ensemble is responsible.
Capability Negotiation and Graceful Degradation
The model update propagation problem has two failure modes that require different responses.
The first is capability drift under a single version: users on the same version produce inconsistent outputs because of hardware differences (thermal throttling, memory pressure, NPU availability). An iPhone 17 Pro on a fresh boot, in a cool environment, with 8GB of free RAM produces different outputs than the same model on the same phone after 20 minutes of heavy use, with thermal throttling reducing sustained throughput by 40%.
The second is version mismatch across the population: users on different versions produce outputs that reflect different model capabilities, with no user-visible signal that this is happening.
Both problems are addressed, imperfectly, by a pattern borrowed from network protocol design: capability negotiation. Before the model runs, the application queries what the device can actually do — not what it theoretically supports, but what it can currently sustain given available memory, thermals, and installed model version. The application then routes the request to match. If the device supports the latest 7B model, use it. If it can only reliably run the 1B variant, use that. If neither can run, fall back to cloud inference.
- https://v-chandra.github.io/on-device-llms/
- https://machinelearning.apple.com/research/apple-foundation-models-2025-updates
- https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/
- https://medium.com/data-science-collective/on-device-llm-or-cloud-api-a-practical-checklist-for-product-owners-and-architects-30386f00f148
- https://arxiv.org/html/2412.15803v2
- https://blog.agentailor.com/posts/chrome-builtin-ai-cloud-hybrid-system
- https://arxiv.org/html/2507.16731v1
- https://arxiv.org/html/2507.19608
- https://www.callstack.com/blog/local-llms-on-mobile-are-a-gimmick
- https://rickytakkar.com/blog_russet_mlx_benchmark.html
- https://machinelearning.apple.com/research/exploring-llms-mlx-m5
- https://dl.acm.org/doi/pdf/10.1145/3696630.3728714
