On-Device AI Needs a Fleet Manager, Not a Model Card
The on-device AI demo that shipped last quarter ran a single 4-bit Llama variant, ran it on a single test phone, and ran it well. Six months later, the same feature has a one-star tail of reviews complaining about heat, battery drain, or — worse — silent quality degradation that users only notice as "the AI got dumber on my old phone." The model didn't change. The fleet did. And the team that thought it was shipping a model has discovered, late, that it was actually shipping a fleet.
This is the gap that sinks most on-device AI launches: the strategy is built around picking the model, when the actual hard problem is delivering the right model to each device class, observing whether it's working, and rolling it back when it isn't. The discipline that closes that gap looks far more like CDN operations than like ML research — manifest-driven delivery, per-cohort telemetry, decoupled rollout channels, and a model-variant pipeline that produces N quantization tiers from one trained checkpoint. Most teams don't have any of that. They have a model card and a build artifact.
The fleet is not a phone, it's a population
A modern Android install base spans something like five SoC vendors, four to five generations of NPU architecture, three or four supported OS versions, and at least two memory tiers — call it 4GB, 6–8GB, and 12GB+ for practical purposes. iOS is narrower but not narrow: the actively-supported iPhone fleet still spans A14 through A18, with materially different Neural Engine throughput, memory bandwidth, and thermal envelopes. Multiplying these out yields hundreds of meaningful device classes, and the multiplication isn't a thought experiment — practitioners report testing matrices that look like 5 SoCs × 5 RAM tiers × 4 OS versions, easily clearing 900 configurations.
Each combination has a different "best model" given quantization tolerance and thermal headroom. A flagship with 12GB of RAM and dedicated low-precision math units can comfortably run a 7B model at INT4, with an 8B variant at W4A8 sometimes performing better than a 3B at FP16. A 6GB midrange phone running an older NPU without low-precision support might fall back to CPU execution and lose 5x throughput, which means the same 4B model that flies on the flagship throttles to unusability mid-paragraph. A 4GB device probably can't run a useful generative model at all and should route to a cloud fallback or a much smaller specialized model — but it can run a 1B classifier just fine.
The team that ships one model for every device has chosen a spot on a Pareto frontier without knowing the frontier exists. Either the flagship is over-paying battery to run a model that's smaller than it could comfortably handle, or the long tail is throttling, overheating, and producing degraded output that the eval set never caught.
Quantization tolerance is per-task, not per-model
The standard recipe — train in 16-bit, quantize to 4-bit for deployment — gets you a single artifact, but not a deployment strategy. The fleet manifest needs more granular variants because quantization quality loss is non-uniform across tasks. An INT4 model that's perfectly fine for autocomplete may fall off a cliff for code generation. A summarization task that tolerates 8-bit weights may degrade noticeably at 4-bit on long-context inputs because attention error compounds across tokens. The W8A8 and W4A8 schemes that exploit NPU 8-bit activation paths can outperform pure W4 on devices with the right silicon support, but lose to W4 on devices without it.
What this means for the variant pipeline: producing one quantized artifact per architecture isn't enough. The pipeline needs to produce, from a single trained checkpoint, a small matrix of quantization tiers — say FP16, W8A8, W4A8, W4A16 — and run the eval set against each variant on representative hardware emulators or reference devices. The output of the pipeline isn't just "here are four files," it's "here are four files plus an eval table showing which task qualities each variant clears, broken out by representative device class."
A team without this matrix is flying blind in the most literal sense: they cannot answer the question "if I push variant X to device class Y, will quality regress?" without doing the experiment in production. Most teams do exactly that. The first time they discover that INT4 broke their tool-calling reliability is when a Pixel 7 user files a support ticket and a senior engineer has to bisect across model versions and devices to figure out which combination caused it.
The fleet manifest is a real artifact, not a comment in a README
The manifest is the small piece of infrastructure that distinguishes a team that ships on-device AI as a product from a team that shipped it once. It maps device class — typically a tuple of (chipset family, NPU capability tier, RAM tier, OS version range) — to the model variant that should run on that class. It carries a version, an effective date range, and a fallback path. It's deployed independently of the app binary and can be updated without an app-store review cycle.
What lives in the manifest, concretely:
- Device class definitions: not just "iPhone 15 Pro" but "device with ≥8GB RAM, ≥30 TOPS NPU, OS supports INT8 NPU acceleration, last-three-years generation." Anything less specific creates buckets too coarse to meaningfully target.
- Variant assignment: which artifact (by content hash) each class should pull. Hashes, not version names — version names drift, hashes don't.
- Rollout state: what percentage of each device class is currently on the new variant vs. the previous variant. This needs to be queryable, because incidents often look like "users in cohort X are degraded" and you need to know whether cohort X is on variant N or N-1 before you can diagnose anything.
- Fallback routing: when the on-device variant fails (OOM, repeated timeouts, NPU unavailable), where does the request go? Cloud? Smaller variant? Disabled with graceful UI? This belongs in the manifest because the answer differs per device class.
A surprising number of teams treat all four of these as implicit. The model is bundled with the app binary, the variant is whichever one shipped, the rollout state is "everyone who installed this version," and the fallback is "the feature crashes." Every one of those defaults will hurt later.
OTA model delivery is a separate channel from app-binary delivery
- https://v-chandra.github.io/on-device-llms/
- https://developer.android.com/ai/gemini-nano
- https://android-developers.googleblog.com/2026/04/gemma-4-new-standard-for-local-agentic-intelligence.html
- https://arxiv.org/html/2603.23640v1
- https://arxiv.org/html/2503.21109v1
- https://medium.com/@dev0x13/on-neural-networks-and-cpu-throttling-in-iphones-999fabe8bd73
- https://dasroot.net/posts/2026/02/quantization-tradeoffs-4-bit-8-bit-fp16-production/
- https://www.callstack.com/blog/how-quantization-and-naming-conventions-shape-llm-deployment
- https://medium.com/@silverskytechnology/react-native-ota-updates-in-2026-shipping-hotfixes-without-app-store-delays-fe492ed8a7f4
- https://source.android.com/docs/core/ota/ab
- https://dev.to/engineeredai/run-a-local-llm-on-android-what-ram-tier-you-need-and-which-models-actually-work-2nkp
- https://www.f22labs.com/blogs/what-is-on-device-ai-a-complete-guide/
