Serving AI at the Edge: A Decision Framework for Moving Inference Out of the Cloud
Most AI inference decisions get made the same way: the model lives in the cloud because that's where you can run it, full stop. But that calculus is changing fast. Flagship smartphones now carry neural engines capable of running 7B-parameter models at interactive speeds. A Snapdragon 8 Elite can generate tokens from a 3B model at around 10 tokens per second — fast enough for conversational use — while a Qualcomm Hexagon NPU hits 690 tokens per second on prefill. The question is no longer "can we run this on device?" but "should we, and when?"
The answer is rarely obvious. Moving inference to the edge introduces real tradeoffs: a quality tax from quantization, a maintenance burden for fleet updates, and hardware fragmentation across device SKUs. But staying in the cloud has its own costs: round-trip latency measured in hundreds of milliseconds, cloud GPU bills that compound at scale, and data sovereignty problems that no SLA can fully solve. This post lays out a practical framework for navigating those tradeoffs.
