On-Device AI Needs a Fleet Manager, Not a Model Card
The on-device AI demo that shipped last quarter ran a single 4-bit Llama variant, ran it on a single test phone, and ran it well. Six months later, the same feature has a one-star tail of reviews complaining about heat, battery drain, or — worse — silent quality degradation that users only notice as "the AI got dumber on my old phone." The model didn't change. The fleet did. And the team that thought it was shipping a model has discovered, late, that it was actually shipping a fleet.
This is the gap that sinks most on-device AI launches: the strategy is built around picking the model, when the actual hard problem is delivering the right model to each device class, observing whether it's working, and rolling it back when it isn't. The discipline that closes that gap looks far more like CDN operations than like ML research — manifest-driven delivery, per-cohort telemetry, decoupled rollout channels, and a model-variant pipeline that produces N quantization tiers from one trained checkpoint. Most teams don't have any of that. They have a model card and a build artifact.
The fleet is not a phone, it's a population
A modern Android install base spans something like five SoC vendors, four to five generations of NPU architecture, three or four supported OS versions, and at least two memory tiers — call it 4GB, 6–8GB, and 12GB+ for practical purposes. iOS is narrower but not narrow: the actively-supported iPhone fleet still spans A14 through A18, with materially different Neural Engine throughput, memory bandwidth, and thermal envelopes. Multiplying these out yields hundreds of meaningful device classes, and the multiplication isn't a thought experiment — practitioners report testing matrices that look like 5 SoCs × 5 RAM tiers × 4 OS versions, easily clearing 900 configurations.
Each combination has a different "best model" given quantization tolerance and thermal headroom. A flagship with 12GB of RAM and dedicated low-precision math units can comfortably run a 7B model at INT4, with an 8B variant at W4A8 sometimes performing better than a 3B at FP16. A 6GB midrange phone running an older NPU without low-precision support might fall back to CPU execution and lose 5x throughput, which means the same 4B model that flies on the flagship throttles to unusability mid-paragraph. A 4GB device probably can't run a useful generative model at all and should route to a cloud fallback or a much smaller specialized model — but it can run a 1B classifier just fine.
The team that ships one model for every device has chosen a spot on a Pareto frontier without knowing the frontier exists. Either the flagship is over-paying battery to run a model that's smaller than it could comfortably handle, or the long tail is throttling, overheating, and producing degraded output that the eval set never caught.
Quantization tolerance is per-task, not per-model
The standard recipe — train in 16-bit, quantize to 4-bit for deployment — gets you a single artifact, but not a deployment strategy. The fleet manifest needs more granular variants because quantization quality loss is non-uniform across tasks. An INT4 model that's perfectly fine for autocomplete may fall off a cliff for code generation. A summarization task that tolerates 8-bit weights may degrade noticeably at 4-bit on long-context inputs because attention error compounds across tokens. The W8A8 and W4A8 schemes that exploit NPU 8-bit activation paths can outperform pure W4 on devices with the right silicon support, but lose to W4 on devices without it.
What this means for the variant pipeline: producing one quantized artifact per architecture isn't enough. The pipeline needs to produce, from a single trained checkpoint, a small matrix of quantization tiers — say FP16, W8A8, W4A8, W4A16 — and run the eval set against each variant on representative hardware emulators or reference devices. The output of the pipeline isn't just "here are four files," it's "here are four files plus an eval table showing which task qualities each variant clears, broken out by representative device class."
A team without this matrix is flying blind in the most literal sense: they cannot answer the question "if I push variant X to device class Y, will quality regress?" without doing the experiment in production. Most teams do exactly that. The first time they discover that INT4 broke their tool-calling reliability is when a Pixel 7 user files a support ticket and a senior engineer has to bisect across model versions and devices to figure out which combination caused it.
The fleet manifest is a real artifact, not a comment in a README
The manifest is the small piece of infrastructure that distinguishes a team that ships on-device AI as a product from a team that shipped it once. It maps device class — typically a tuple of (chipset family, NPU capability tier, RAM tier, OS version range) — to the model variant that should run on that class. It carries a version, an effective date range, and a fallback path. It's deployed independently of the app binary and can be updated without an app-store review cycle.
What lives in the manifest, concretely:
- Device class definitions: not just "iPhone 15 Pro" but "device with ≥8GB RAM, ≥30 TOPS NPU, OS supports INT8 NPU acceleration, last-three-years generation." Anything less specific creates buckets too coarse to meaningfully target.
- Variant assignment: which artifact (by content hash) each class should pull. Hashes, not version names — version names drift, hashes don't.
- Rollout state: what percentage of each device class is currently on the new variant vs. the previous variant. This needs to be queryable, because incidents often look like "users in cohort X are degraded" and you need to know whether cohort X is on variant N or N-1 before you can diagnose anything.
- Fallback routing: when the on-device variant fails (OOM, repeated timeouts, NPU unavailable), where does the request go? Cloud? Smaller variant? Disabled with graceful UI? This belongs in the manifest because the answer differs per device class.
A surprising number of teams treat all four of these as implicit. The model is bundled with the app binary, the variant is whichever one shipped, the rollout state is "everyone who installed this version," and the fallback is "the feature crashes." Every one of those defaults will hurt later.
OTA model delivery is a separate channel from app-binary delivery
Coupling the model to the app binary is the single biggest unforced error in on-device AI today. Models change at a different cadence than UI code. They have different size profiles (often hundreds of MB, sometimes a GB+). They need rollback semantics distinct from app rollbacks — pushing an old app binary to fix a bad model means downgrading every other improvement that shipped in between.
The right shape is a delivery channel that looks like a CDN with model-aware semantics. The app boots, asks the manifest service which variant it should run for its device class, downloads (or verifies it already has) that artifact, and runs it. The model is content-addressed, so two app versions that target the same variant share a single download. Rollouts can be staged independently of app rollouts — you can move 5% of flagship users to variant N+1 while the app version stays exactly the same. Rollbacks are instant: change the manifest, devices fetch the previous artifact's hash on next launch.
Both Android's A/B partition system and the various commercial OTA tools (Shorebird for Flutter, the React Native ecosystem's hot-update tools, Capgo, etc.) have demonstrated that the platforms support the mechanics of decoupled binary delivery. The app-store policies are increasingly accommodating: model artifacts that don't change app behavior in ways requiring review are generally fine to ship via OTA. The blocker is rarely platform — it's that teams haven't built the manifest infrastructure on top.
A practical hardening detail most teams miss: the OTA channel needs automatic rollback triggered by health signals, not just manual rollback. If a new variant pushes 95th-percentile inference latency from 800ms to 4 seconds in the first hour of rollout, the manifest should revert without a human in the loop. Production OTA systems for app code have learned this; production OTA systems for models often haven't, because the team treating models like static assets never built the auto-revert in the first place.
Per-cohort telemetry is the invisible 80%
The telemetry needed to operate an on-device model fleet is not the telemetry you have. Server-side AI gives you full request logs, full prompt and response, latency, cost, error codes — all in one place. On-device gives you almost none of that for free. You need to design for it.
The minimum useful set:
- Inference latency, joined to device class. Not just p50/p95 globally — p95 per device class, because a regression on the flagship cohort is a different bug from a regression on the long tail, and a global p95 will hide one inside the other. This means events need to carry device class identifiers (or at least the chipset and RAM tier), and the analytics backend has to be willing to slice on them.
- Inference quality signals. This is the hard one. On-device models don't get observed by the same eval rubric you ran in CI; you have to bring quality observation to where the inference is happening. Implicit signals (user reformulation rate, abandon rate, follow-up edits) are usually the only thing available at scale, but they need to be collected and joined to model variant and device class. Explicit signals (thumbs up/down) are useful only if the rate is high enough to detect regressions before users churn.
- Resource impact. Battery drain attributable to inference, thermal events triggered, OOM kills caused by the model loading. Operating systems give you most of this, but you have to ask. The flagship cohort being twice as fast as the midrange cohort is fine; the flagship cohort being twice as fast and burning a third more battery to do it is a problem you need to know about.
- Variant attribution. Every event must carry the variant hash it was produced under, because the entire point of having multiple variants is to detect which ones are working. An event stream that doesn't tag the variant cannot be used for variant comparison, no matter how rich it is otherwise.
If you've built this telemetry well, you can answer the question that distinguishes operating a fleet from shipping a model: "is variant N+1 better, worse, or unchanged compared to variant N, broken out by device class?" If you can't answer that quickly, you don't have a fleet — you have an artifact and hope.
The org failure mode: the model is somebody else's problem
The pattern that fails is depressingly stable across companies. The mobile team treats the model as a bundled asset — they ship it once, they don't own its quality, and they assume the model team will tell them when to update. The model team treats the device fleet as a deployment substrate — they trained the model, they validated it on a few reference devices, and they assume the mobile team owns the long tail. Neither team owns the fleet manifest because neither team thinks the manifest is their problem.
Six months later, a customer support escalation reveals that the on-device AI feature is materially worse on Pixel 6a than on Pixel 8 Pro. Nobody can say whether that's a quantization issue, a memory pressure issue, or a thermal issue, because nobody collected the telemetry that would distinguish them. The fix nominated by an executive — "ship a better model" — doesn't help, because the failure isn't in the model artifact; it's in the absence of the discipline that would have routed the right artifact to the right device class in the first place.
The org reframing that has to land: shipping on-device AI is a fleet operations role, not a modeling role or a mobile-engineering role. It needs an explicit owner, a manifest under version control, a variant pipeline that produces evidence per tier, an OTA channel decoupled from app binaries, and per-cohort telemetry. Teams that staff for this look more like infrastructure teams than ML teams; teams that don't staff for it tend to discover the role exists by hitting the on-call rotation that needs it.
On-device AI is a CDN problem with a model attached
The architectural realization that closes the loop: the hard part of on-device AI is not the model — it's the per-device delivery, the per-cohort observability, and the rollback channel that proves the system is working. Frontier modeling teams have spent years building the discipline that turns checkpoints into products. On-device AI inherits very little of that discipline, because the deployment substrate (a long-tailed fleet of phones) is not the substrate that frontier modeling tools were built for.
The teams that get this right will look, in 2027, less like ML organizations and more like the kind of platform-engineering organizations that ship to billions of devices: a manifest service, a variant pipeline, a CDN-shaped delivery channel, fleet-wide telemetry, and a small group of engineers who think about cohort regressions for a living. The teams that don't will keep shipping a model and hoping the fleet looks like the test phone.
The model is the artifact. The fleet is the product. Treat them differently.
- https://v-chandra.github.io/on-device-llms/
- https://developer.android.com/ai/gemini-nano
- https://android-developers.googleblog.com/2026/04/gemma-4-new-standard-for-local-agentic-intelligence.html
- https://arxiv.org/html/2603.23640v1
- https://arxiv.org/html/2503.21109v1
- https://medium.com/@dev0x13/on-neural-networks-and-cpu-throttling-in-iphones-999fabe8bd73
- https://dasroot.net/posts/2026/02/quantization-tradeoffs-4-bit-8-bit-fp16-production/
- https://www.callstack.com/blog/how-quantization-and-naming-conventions-shape-llm-deployment
- https://medium.com/@silverskytechnology/react-native-ota-updates-in-2026-shipping-hotfixes-without-app-store-delays-fe492ed8a7f4
- https://source.android.com/docs/core/ota/ab
- https://dev.to/engineeredai/run-a-local-llm-on-android-what-ram-tier-you-need-and-which-models-actually-work-2nkp
- https://www.f22labs.com/blogs/what-is-on-device-ai-a-complete-guide/
