On-Device LLM Inference in Production: When Edge Models Are Right and What They Actually Cost
Most teams decide to use on-device LLM inference the same way they decide to rewrite their database: impulsively, in response to a problem that a cheaper solution could have solved. The pitch is always compelling—no network round-trips, full privacy, zero inference costs—and the initial prototype validates it. Then six months post-ship, the model silently starts returning worse outputs, a new OS update breaks quantization compatibility, and your users on budget Android phones are running a version you can't push an update to.
This guide is about making that decision with eyes open. On-device inference is genuinely the right call in specific situations, but the cost structure is different from what teams expect, and the production failure modes are almost entirely unlike cloud LLM deployment.
The Three Cases Where On-Device Inference Actually Wins
On-device models make economic and technical sense in three specific scenarios. Outside these, you are almost always better served by a cloud API or a hosted open-weight endpoint.
Privacy that cannot be satisfied by contractual guarantees. Healthcare, legal, and financial applications often need data to never leave the device—not because cloud providers are untrustworthy, but because compliance regimes (HIPAA, GDPR, sector-specific regulations) require it as a hard architectural property, not a policy commitment. On-device inference provides this without negotiation. No transmission, no storage, no audit surface outside the device itself.
Offline-first functionality. Field technicians, pilots, maritime operators, and enterprise tools for factory floors need inference to work without a network. A 7B model that runs at 15 tokens per second locally is infinitely better than a frontier API call that never completes. This is also the right pattern for apps that need synchronous, low-latency behavior—typing assistants, search autocomplete, real-time translation—where even a 200ms network round-trip feels broken to users.
Scale where per-token cloud costs become prohibitive. This is the most misunderstood case. On-device inference shifts cost from per-token to fixed hardware cost on the user's device. If your feature runs millions of short inferences per day per user, and the task doesn't require frontier-model reasoning, routing those to the device eliminates a cost center entirely. The math only works when utilization is sustained; idle GPU hours are expensive on your hardware but free on your users' phones.
The Quality-Size Tradeoff Is Not What Benchmarks Suggest
Model benchmarks are optimized for tasks that behave well under standardized evaluation. Production tasks are not standardized. The practical capability cliff by parameter count looks like this:
1B models run fast (sub-200ms TTFT on modern mobile silicon) and fit comfortably on any device released in the last three years. They are useful for intent classification, spam detection, simple summarization of short texts, and language identification. They fail reliably on multi-step reasoning, complex instruction following, and anything requiring broad factual knowledge.
3B models (Gemini Nano, Llama 3.2 3B) hit a useful sweet spot for many consumer applications. Sentiment analysis, basic customer support routing, short-form content generation, and extraction tasks all become tractable. A 2025 benchmark comparing small models on Android found that Gemma2-2B provided meaningfully better answers than its 1B and 3B peers—suggesting that architecture matters as much as parameter count at this scale.
7B models are the most commonly deployed size for quality-sensitive tasks. They handle code completion, document summarization, and moderate reasoning. Storage is the hard constraint: a 7B model in Q4 quantization requires 4–5GB, which is feasible on premium devices but rules out large portions of the Android install base. Sustained inference on a 7B model also generates enough heat to trigger thermal throttling within minutes on most phones.
Quantization choices compound these constraints. INT8 is the production-safe default—roughly 75% size reduction from FP32, typically less than 1% accuracy degradation when properly calibrated. INT4 (AWQ format, specifically) cuts size further but introduces noticeable quality drops on reasoning and math tasks. The correct approach is to benchmark quantized models on your actual task distribution before committing, not on leaderboards. A model that benchmarks well in INT4 on MMLU may degrade substantially on your specific extraction or summarization workload.
Framework Selection Isn't About Performance—It's About Deployment Surface
The on-device inference framework you choose is largely determined by which platforms you need to support.
CoreML is the right choice if you are building exclusively for iOS and macOS. It gets native Neural Engine acceleration, and Apple silicon handles LLM inference well—Llama 3.1 8B achieves around 33 tokens/second on M1 Max under CoreML. The downside is total lock-in to Apple's ecosystem and conversion pipeline.
ExecuTorch (Meta's production framework, generally available as of late 2025) is the right choice for cross-platform mobile deployment, particularly if you need custom silicon optimization. It powers WhatsApp, Instagram, Messenger, and Facebook at production scale—handling billions of users' on-device inference across diverse hardware. The 50KB base footprint and support for NPU delegation make it the most production-proven option for consumer apps.
ONNX Runtime is the pragmatic choice for teams that need Android and iOS support with a single model format and moderate ops overhead. It lacks ExecuTorch's per-silicon optimization depth, but the interoperability story (exporting from PyTorch or TensorFlow directly, delegating to CoreML on iOS or NNAPI/QNN on Android) simplifies the model pipeline.
WebLLM/WebGPU is no longer experimental. As of early 2026, WebGPU is enabled by default in all major browsers, and WebLLM achieves up to 80% of native performance. This is the right path for web-first apps that can't afford app store friction or need model updates to deploy instantly without an app release cycle.
The Real Cost Structure
Cloud API costs are visible and per-request. On-device inference costs are invisible and borne by your users' hardware. This asymmetry distorts cost calculations in both directions.
For teams running moderate inference volumes on their own GPU infrastructure, the break-even against a mid-tier cloud API (GPT-4o-mini class) runs 18–24 months at sustained 70%+ GPU utilization. At 10% utilization, the same hardware costs 10x more per token than just calling the API. This is the math that kills self-hosted deployments: teams provision for peak and pay for idle.
On-device inference sidesteps this entirely—the GPU is the user's phone. But it introduces costs that don't appear in API invoices: model size affects app store download conversion rates, inference affects battery life and thermal behavior, and on older devices you're competing with everything else the device is doing.
Google's Pixel 10 Pro achieves nearly 1,000 tokens per second for specific Gemini Nano tasks. Pixel 9 Pro runs at roughly 510 tokens per second. Samsung S24 Ultra on Snapdragon 8 Gen 3 benchmarks at around 10 tokens per second on larger models—an order of magnitude slower—because thermal throttling floors GPU frequency within minutes of sustained load. Your "average mobile device" is not an M-series chip; it's a mid-range Android phone that overheats under sustained LLM workloads.
The Update Problem Nobody Plans For
This is where most production on-device deployments quietly fail.
When you deploy a cloud API call, updating the underlying model is a configuration change. When you ship a model baked into an app, updating it requires a new app release, app store review, and user willingness to update. Median time for 90% of your install base to update an app is measured in weeks to months. During that period, you're serving multiple model versions simultaneously with no ability to hotfix the older ones.
Delta updates can reduce download size by 80% or more compared to full model replacements, but they require infrastructure to compare versions and serve diffs—most teams don't build this until it becomes a crisis. Android's hardware fragmentation amplifies the problem: 3.9 billion Android devices across 1,300+ manufacturers means behavior differs unpredictably across device families, and older devices may never receive your update at all.
The subtler failure mode is capability drift. LLM models are static snapshots of a distribution. Your users' queries evolve, your product surface area grows, and within 90 days of ship you typically see measurable degradation in output quality on the tasks that have shifted since training. Cloud deployments hide this because providers continuously update underlying models and you get implicit improvements. On-device deployments expose it because your model version is frozen in the hands of users who haven't updated.
Hardware-specific quantization failures are another trap. An INT4 model that was benchmarked on one generation of NPU may silently convert to FP16 at load time on a different chip—because the runtime doesn't support INT4 on that hardware—negating all memory savings without any error or warning. You discover this through user-reported battery drain, not through monitoring.
What a Production-Hardened On-Device Deployment Looks Like
Teams that operate on-device inference reliably do several things that are rarely mentioned in framework documentation.
They version models explicitly and track which version each device is running. This is table stakes for detecting when a specific model version correlates with quality regressions on a device family.
They instrument for output quality, not just availability. A model that returns responses is not a healthy model. Sampling production traffic against a small held-out evaluation set continuously is the only reliable way to detect silent capability drift before users report it.
They plan for thermal budget. Running inference only while the device is plugged in, or building in cool-down periods between inference calls, prevents the degraded-performance-over-session pattern that manifests as "AI is slow on older phones."
They treat the web as the escape hatch. For tasks where model quality matters enough to require frequent updates, shipping WebGPU inference avoids app release cycles entirely. The browser update mechanism is faster and more reliable than app store distribution, and models can be fetched on demand rather than bundled.
They design a hybrid routing layer from day one. Simple, latency-sensitive, privacy-sensitive tasks route to the device. Complex reasoning, tasks requiring up-to-date information, and requests from devices without capable hardware route to the cloud. Getting this routing right—including fallback when the on-device model fails or thermal constraints prevent inference—is harder than building either endpoint independently.
The Honest Decision Framework
On-device inference is the right architectural choice when you have strong privacy requirements, need reliable offline functionality, or are running high-volume low-complexity inference that makes cloud costs unsustainable. It is the wrong choice when you need frequent model updates, when your target devices span the full Android ecosystem, or when task complexity requires frontier-model reasoning.
The teams that succeed with it treat the model as infrastructure: versioned, monitored, with explicit deprecation timelines and a fallback path. The teams that fail treat it as an optimization—something to add after the cloud version is working—and discover six months post-ship that they've shipped a static model with no viable update path, on hardware they can't test, to users they can't reach.
The hardware is getting faster. Pixel 10 Pro and A18-class silicon handle 7B models at interactive latency. WebGPU has made browser inference a real option. The framework ecosystem is maturing. On-device LLM inference is no longer experimental—but the production discipline required to operate it reliably is still an unsolved problem for most teams.
- https://machinelearning.apple.com/research/apple-foundation-models-2025-updates
- https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/
- https://arxiv.org/html/2410.03613v1
- https://arxiv.org/html/2603.23640v1
- https://developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/
- https://arxiv.org/html/2412.15803v2
- https://www.androidauthority.com/gemini-nano-4-benchmarks-3655763/
- https://arxiv.org/html/2509.18101v3
- https://optimusai.ai/production-llm-90-days-and-how-to-prevent-it/
- https://arxiv.org/html/2501.14794v2
