The On-Device LLM Problem Nobody Talks About: Model Update Propagation
Most engineers who build on-device LLM features spend their time solving the problems that are easy to see: quantization, latency, memory limits. The model fits on the phone, inference is fast enough, and the demo looks great. Then they ship to millions of devices and discover a harder problem that nobody warned them about: you now have millions of independent compute nodes running different versions of your AI model, and you have no reliable way to know which one any given user is running.
Cloud inference is boring in the best way. You update the model, redeploy the server, and within minutes the entire user base is running the new version. On-device inference breaks this assumption entirely. A user who last opened your app three months ago is still running the model that was current then — and there's no clean way to force an update, no server-side rollback, and no simple way to detect the mismatch without adding instrumentation you probably didn't build from the start.
This version fragmentation is the central operational challenge of on-device AI, and it has consequences that reach far beyond a slow rollout. It creates silent capability drift, complicates incident response, and turns your "AI feature" into a heterogeneous fleet of independently-behaving systems that you're responsible for but can't directly control.
What Makes On-Device Updates Fundamentally Different
A mobile app update and a model update look similar on paper: a new binary goes out, devices download it eventually, users get the new version when they update. In practice, the dynamics are completely different.
App updates are binary. Either the user has the new app or the old one. The behavioral difference between versions is visible in UI changes and release notes. Users and support teams can reason about it.
Model updates are continuous and invisible. The "behavior" of an LLM is not a set of discrete features — it's a probability distribution over outputs. Two model versions can produce superficially similar responses to most inputs while diverging dramatically on edge cases, adversarial prompts, or specific domains where the newer model was fine-tuned. Users experiencing this drift have no way to diagnose it. Your support team has no way to correlate it. Your monitoring dashboards show average quality metrics, which may not surface the degraded cohort at all.
The update adoption curve makes this worse. For a typical mobile app, 50% of users update within a week of a release; 80% within a month. For a model embedded in an app, adoption follows the app's update curve, not a model-specific curve. If your app releases monthly and the model updates weekly, most users are always running a model that is at least a few weeks behind. For users who opt out of automatic updates — a meaningful percentage on iOS and Android — the gap can stretch to months or years.
Meta's ExecuTorch deployment across Instagram, WhatsApp, and Facebook illustrates the scale: billions of users, devices ranging from last year's flagship to six-year-old mid-range phones, each potentially running a different quantized variant. The operational surface is not a model version — it's a model-version × hardware-capability × OS-version matrix that grows combinatorially with every release.
The Artifact Ensemble Problem
Version fragmentation is worse than it initially appears because "the model" is not a single artifact. In production, a working on-device AI feature is a combination of:
- The base model weights (quantized to 4-bit, 2-bit, or a hardware-specific format)
- Any fine-tuned LoRA adapters layered on top
- The system prompt or prompt template
- The tokenizer and its vocabulary
- If applicable, a retrieval index or vector store
- The inference runtime itself (llama.cpp, MLC-LLM, ExecuTorch, MLX)
When any one of these components is updated and the others are not, the behavior of the combined system can change in ways that are hard to predict and harder to debug. A base model update that ships without a corresponding prompt template update may behave differently on the updated prompt. A LoRA adapter trained against model version N applied to model version N+1 may silently produce degraded outputs — not crashes, just wrong answers.
Cloud inference sidesteps this because the entire ensemble is co-deployed atomically: update the container, update everything together. On-device, each component may be delivered through a different channel on a different schedule. The app store delivers the runtime and the base model. A dynamic asset delivery system might push updated adapters separately. The system prompt might be fetched from a remote config service. Coherence between these components is not guaranteed — it has to be engineered deliberately.
The practical recommendation: version the entire artifact ensemble as a single unit. Assign a single semantic version to the tuple of (model checkpoint, adapter weights, prompt template, tokenizer, runtime version). Validate compatibility as a bundle before shipping any component. Store the full version tuple in telemetry so that when a quality regression surfaces, you can identify which ensemble is responsible.
Capability Negotiation and Graceful Degradation
The model update propagation problem has two failure modes that require different responses.
The first is capability drift under a single version: users on the same version produce inconsistent outputs because of hardware differences (thermal throttling, memory pressure, NPU availability). An iPhone 17 Pro on a fresh boot, in a cool environment, with 8GB of free RAM produces different outputs than the same model on the same phone after 20 minutes of heavy use, with thermal throttling reducing sustained throughput by 40%.
The second is version mismatch across the population: users on different versions produce outputs that reflect different model capabilities, with no user-visible signal that this is happening.
Both problems are addressed, imperfectly, by a pattern borrowed from network protocol design: capability negotiation. Before the model runs, the application queries what the device can actually do — not what it theoretically supports, but what it can currently sustain given available memory, thermals, and installed model version. The application then routes the request to match. If the device supports the latest 7B model, use it. If it can only reliably run the 1B variant, use that. If neither can run, fall back to cloud inference.
Google's LiteRT-LM implements a practical version of this with its Engine/Session separation. A single foundation model engine serves multiple application features simultaneously, with session-level isolation. Context switching and copy-on-write KV-cache sharing allow features to share the loaded model without re-initializing it for each request. The engine exposes capability metadata that the application layer can query before dispatching.
The fallback to cloud is not a failure — it's the safety valve that makes on-device AI deployable at all. The architecture that works in practice: on-device handles routine, low-sensitivity, high-frequency requests; cloud handles complex reasoning, long-context operations, and requests that exceed local capability. The handoff should be transparent to users. The routing decision should be deterministic and testable so you can reason about which user cohort gets which backend.
Delta Updates: Bandwidth and Coherence
If model update propagation is the problem, delta updates seem like the obvious solution. Instead of pushing a full 2GB model download, push only the changed parameters. DeltaLLM and similar approaches exploit attention sparsity — the observation that across sequential model versions, most weights change by small amounts — to reduce update size by 60-80%.
The bandwidth math is compelling. A 4-bit quantized 7B model is roughly 4GB. Delta updates between fine-tune versions could be 400-800MB. For users on metered connections or with limited storage, this is the difference between a feature that works and one they disable.
But delta updates introduce a new coherence problem. The delta is valid only from a specific base version. If user A has version 1.0 and user B has version 1.2, the delta from 1.2 to 1.3 is valid for user B but not user A. You need either a delta from 1.0 to 1.3 (larger) or a full model download for the users on older bases. Managing this delta graph across a fragmented user population requires infrastructure that most teams don't build until they're already running on millions of devices.
The practical approach: use deltas for users on the previous one or two versions, full downloads for everyone else. Implement staged rollouts that prioritize the delta-eligible cohort first, validate quality metrics before widening to the full-download group. This limits the delta graph complexity to a manageable depth while capturing most of the bandwidth savings.
Privacy: The Tradeoff You Agreed to Without Knowing
Every on-device deployment makes a privacy tradeoff, but not the one most teams think about.
The obvious tradeoff: user data never leaves the device, so it never appears in cloud logs, never gets retained for training, never gets exposed in a cloud breach. For healthcare, finance, and regulated industries, this is often the decision-making criterion. GDPR and HIPAA compliance becomes structurally simpler when the data doesn't move.
The less obvious tradeoff: privacy guarantees are only as strong as the update mechanism. If you push model updates through the same channel as app updates, users can inspect what changed and audit behavior differences. If you push model updates through a dynamic asset delivery system — which most teams do, because it decouples model iteration from app store review cycles — users have limited visibility into what's running on their device. The model update mechanism becomes a trusted but opaque channel that the user has implicitly granted execution privileges to.
This is a design choice that deserves explicit architectural review. The teams that handle it well document exactly what the dynamic asset delivery system can and cannot update, implement model integrity verification (hash the downloaded weights, verify before loading), and give users an opt-out path that degrades gracefully to cloud inference or disables the feature entirely.
What Actually Fails in Production
The failure modes in on-device LLM deployments that don't appear in benchmarks:
OOM crashes during update: When a model update is downloaded and verified, both the old and new model may temporarily coexist in storage. During the swap, memory fragmentation can cause crashes on devices with limited RAM. The solution is aggressive pre-update cache clearing and staged atomic swaps rather than in-place overwrites.
Interrupted downloads leave corrupt state: A model download interrupted halfway through leaves an unusable partial file. Without integrity verification and resume-capable downloads, users end up in a broken state where the app reports the feature as available but inference fails at runtime. The fix is chunk-level verification with resumable downloads and a clear rollback to the previous version.
Adapter/base model version mismatch: A LoRA adapter trained against base model version N silently produces degraded output when applied to version N+1 if weight dimensions or token embeddings have changed. This doesn't error — it runs to completion and produces plausible-looking but wrong output. The fix is versioned adapter packages that declare their base model dependency, with runtime checks before loading.
Thermal throttling cascades: On mobile devices, sustained inference generates heat. After 3-5 minutes of continuous inference on flagship iPhones, thermal throttling cuts throughput by 40% or more. Applications that don't account for this produce inconsistent latency and quality under load, with no clear signal to the user about why responses are getting slower. Monitoring per-request latency (not just average latency) surfaces this; the fix is inference session budgeting and explicit cool-down periods.
Silent capability drift from quantization differences: Users on 2-bit quantized models and users on 4-bit models are, in effect, running different AI features. The 2-bit variant is faster and smaller but may produce noticeably different outputs on nuanced tasks. If your monitoring aggregates quality metrics across all users without controlling for quantization level, you'll miss regressions that affect only the most constrained devices.
The Operational Posture
On-device LLM is not cloud LLM with a smaller model. It's a different operational paradigm with different failure modes, different update dynamics, and different privacy implications.
The teams making it work treat the model as a fleet, not a service. They version the entire artifact ensemble atomically. They monitor output quality segmented by model version, hardware tier, and quantization level. They build capability negotiation that routes requests to the appropriate backend — local or cloud — based on what's actually available, not what's theoretically installed. They implement fallback paths that are transparent to users and testable by engineers.
The teams that struggle treat on-device AI as a deployment target and cloud AI as a deployment target, assuming the operational model is the same. It isn't. The moment you push model weights to a user's device, you've created a distributed system with partial updates, inconsistent versions, and no central rollback mechanism. Managing that system requires the same rigor you'd apply to any distributed infrastructure — the kind where you monitor the fleet, test updates in canary cohorts, and design every component with graceful degradation in mind.
The hardware is ready. The frameworks are mature. The hard part is the operations.
- https://v-chandra.github.io/on-device-llms/
- https://machinelearning.apple.com/research/apple-foundation-models-2025-updates
- https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/
- https://medium.com/data-science-collective/on-device-llm-or-cloud-api-a-practical-checklist-for-product-owners-and-architects-30386f00f148
- https://arxiv.org/html/2412.15803v2
- https://blog.agentailor.com/posts/chrome-builtin-ai-cloud-hybrid-system
- https://arxiv.org/html/2507.16731v1
- https://arxiv.org/html/2507.19608
- https://www.callstack.com/blog/local-llms-on-mobile-are-a-gimmick
- https://rickytakkar.com/blog_russet_mlx_benchmark.html
- https://machinelearning.apple.com/research/exploring-llms-mlx-m5
- https://dl.acm.org/doi/pdf/10.1145/3696630.3728714
