Skip to main content

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.

What "Edge" Actually Means

"Edge inference" is not a single thing. It spans a spectrum of deployment targets, each with different tradeoffs:

  • On-device (phones, laptops, wearables): Model weights live on the end-user's hardware. iOS/macOS with Apple Silicon, Android via MediaPipe, laptops via llama.cpp. Available RAM: 4–16GB on phones, up to 128GB on M-series Macs.
  • On-premise servers: Dedicated GPU servers in a corporate data center. You control the hardware, data never leaves your network. Examples: a hospital running Llama 3 on an internal server, a law firm deploying Mistral behind a firewall.
  • Regional edge nodes: Inference co-located near users at CDN PoPs or telecom infrastructure. Latency advantage of 30–80ms vs. central cloud, without full on-device constraints.
  • Purpose-built edge accelerators: NVIDIA Jetson AGX Orin, Hailo-10H NPU, Qualcomm AI Box. Designed for sustained inference in industrial and automotive settings.

Cloud inference is the contrasting baseline: your app sends a network request to a provider-managed GPU cluster, waits for a response, and pays per token. No hardware cost, but every call crosses the network.

The Four Dimensions of the Decision

1. Latency Requirements

Network round-trip alone costs 20–300ms depending on geography. For most chat applications this is invisible. For voice AI, it is catastrophic.

A typical cloud voice pipeline looks like this: audio capture (40ms) → speech-to-text (350ms) → LLM (375ms) → text-to-speech (100ms) → network hops (50ms) = approximately 915ms total. Human perception of conversational lag begins around 500ms. At 915ms, the interaction feels broken.

Edge inference eliminates the network component. Some providers achieve sub-200ms audio round-trips by co-locating inference inside the same data centers as their telephony infrastructure — not cloud, not on-device, but infrastructure edge.

For interactive text generation: a 7–8B quantized model running on Apple Silicon via llama.cpp generates 60–120 tokens/second and delivers a first token in under 100ms for most prompts. Cloud APIs for comparable models deliver 50–80 tokens/second with 200–400ms time-to-first-token. On paper it is similar. In practice, the variance matters: cloud p99 latency can spike to 2–5 seconds under load; local inference is deterministic.

Rule of thumb: If your latency SLO is below 300ms end-to-end, cloud is marginal. Below 150ms, edge is almost certainly required.

2. Data Privacy and Residency

HIPAA, GDPR, EU AI Act (August 2024), and sector-specific regulations in finance and legal all create hard constraints on where data can travel.

Zero-data-retention architectures — where user inputs never leave the device — are only achievable with on-device inference. You cannot architect a GDPR zero-retention guarantee if your inference call traverses a third-party API.

A nuanced development from 2025: the EU is now treating models that have memorized personal data as potentially constituting personal data themselves. If your cloud provider fine-tunes on your users' data without explicit consent mechanisms, the model weights themselves may carry compliance exposure.

On-premise inference sidesteps these problems entirely. For regulated industries, the technical architecture question and the compliance question are the same question.

3. Cost Structure

Cloud inference is priced per token. Edge inference has fixed hardware cost with near-zero marginal cost per inference. The crossover point depends on request volume and utilization.

A concrete comparison: a cloud API call running a 7B-equivalent task costs approximately $1.65 per response at current frontier pricing. The same inference on a Jetson Nano running a 3B quantized model costs approximately $0.0017 per response — roughly 970x cheaper at sustained load. At 1 million requests per day, the arithmetic produces a $1.7/day edge cost versus $1,650/day cloud cost.

That is a dramatic edge case (literally). The real comparison for most teams involves smaller volume differences and higher quality requirements that push toward larger models. The key question: at what request volume does owning a GPU server cost less than cloud API calls? That calculation breaks in cloud's favor below roughly 200,000 tokens/day; it breaks heavily in edge's favor above 10 million tokens/day with predictable load.

The catch: edge requires capital expenditure, dedicated ops, and utilization discipline. Cloud scales to zero; edge hardware depreciates whether you use it or not. Bursty workloads that periodically spike 10x should stay on cloud, or use cloud as the overflow tier.

4. Model Capability and Update Frequency

This is where most teams underestimate edge limitations.

The capability ceiling: The best models you can run on edge hardware today are 7B–13B parameters (INT4 quantized). They perform well on classification, intent detection, simple RAG, instruction following, and structured extraction. They fail on complex multi-step reasoning, frontier-quality code synthesis, long-context synthesis over 32K tokens, and tasks requiring world knowledge past their training cutoff.

On MMLU, a well-tuned 7B model scores around 69%. GPT-4 class models score 86–90%. For many enterprise tasks, 69% is sufficient. For high-stakes decisions, it is not.

The capability freeze: This is the most underappreciated operational risk in on-device AI. When a model is compiled and shipped inside an app (ExecuTorch AOT, Core ML .mlpackage, GGUF bundled in an APK), it freezes at that capability level until the next app update cycle. Meanwhile, open-source models advance rapidly: Llama 3 → 3.1 → 3.2 → 4 all shipped within 18 months. A model shipped in Q1 may be two capability generations behind by Q4, with no path to update users who haven't accepted the app update.

Security vulnerabilities in model weights — jailbreaks, adversarial prompts — can be patched on the cloud in hours. On-device patches follow app store review cycles: days to weeks. Safety updates cannot be deployed OTA to edge devices the way they are in cloud.

If your use case requires staying current with model capabilities, or if your threat model includes adversarial prompt attacks, cloud is architecturally safer than edge for that component.

The Quantization Tradeoff

Edge deployment requires model compression. The key parameters are size reduction versus quality loss:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates