12 posts tagged with "edge-ai"

Disconnected Agent Mode: Designing for the Network You Don't Have

May 2, 2026 · 11 min read

Software Engineer

A flight attendant asks you to switch to airplane mode. The customer-support agent your team shipped last quarter is mid-conversation in a tab, and the next user turn returns a spinner that never resolves. The agent isn't broken in any interesting way. It just assumed, in a hundred unwritten places, that the network exists.

That assumption is the most expensive line of code your product team never wrote down. It governs how you store conversation state, how you call tools, how you surface errors, what you eval against, and what your users do when the connection drops in the middle of work that mattered to them. Disconnected agent mode is the discipline of pulling that assumption out of the foundation, looking at it, and deciding — explicitly — what should happen when the round trip to a hosted API isn't available.

On-Device AI Needs a Fleet Manager, Not a Model Card

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The on-device AI demo that shipped last quarter ran a single 4-bit Llama variant, ran it on a single test phone, and ran it well. Six months later, the same feature has a one-star tail of reviews complaining about heat, battery drain, or — worse — silent quality degradation that users only notice as "the AI got dumber on my old phone." The model didn't change. The fleet did. And the team that thought it was shipping a model has discovered, late, that it was actually shipping a fleet.

This is the gap that sinks most on-device AI launches: the strategy is built around picking the model, when the actual hard problem is delivering the right model to each device class, observing whether it's working, and rolling it back when it isn't. The discipline that closes that gap looks far more like CDN operations than like ML research — manifest-driven delivery, per-cohort telemetry, decoupled rollout channels, and a model-variant pipeline that produces N quantization tiers from one trained checkpoint. Most teams don't have any of that. They have a model card and a build artifact.

The On-Device LLM Problem Nobody Talks About: Model Update Propagation

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

Most engineers who build on-device LLM features spend their time solving the problems that are easy to see: quantization, latency, memory limits. The model fits on the phone, inference is fast enough, and the demo looks great. Then they ship to millions of devices and discover a harder problem that nobody warned them about: you now have millions of independent compute nodes running different versions of your AI model, and you have no reliable way to know which one any given user is running.

Cloud inference is boring in the best way. You update the model, redeploy the server, and within minutes the entire user base is running the new version. On-device inference breaks this assumption entirely. A user who last opened your app three months ago is still running the model that was current then — and there's no clean way to force an update, no server-side rollback, and no simple way to detect the mismatch without adding instrumentation you probably didn't build from the start.

This version fragmentation is the central operational challenge of on-device AI, and it has consequences that reach far beyond a slow rollout. It creates silent capability drift, complicates incident response, and turns your "AI feature" into a heterogeneous fleet of independently-behaving systems that you're responsible for but can't directly control.

On-Device LLM Inference in Production: When Edge Models Are Right and What They Actually Cost

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams decide to use on-device LLM inference the same way they decide to rewrite their database: impulsively, in response to a problem that a cheaper solution could have solved. The pitch is always compelling—no network round-trips, full privacy, zero inference costs—and the initial prototype validates it. Then six months post-ship, the model silently starts returning worse outputs, a new OS update breaks quantization compatibility, and your users on budget Android phones are running a version you can't push an update to.

This guide is about making that decision with eyes open. On-device inference is genuinely the right call in specific situations, but the cost structure is different from what teams expect, and the production failure modes are almost entirely unlike cloud LLM deployment.

Serving AI at the Edge: A Decision Framework for Moving Inference Out of the Cloud

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI inference decisions get made the same way: the model lives in the cloud because that's where you can run it, full stop. But that calculus is changing fast. Flagship smartphones now carry neural engines capable of running 7B-parameter models at interactive speeds. A Snapdragon 8 Elite can generate tokens from a 3B model at around 10 tokens per second — fast enough for conversational use — while a Qualcomm Hexagon NPU hits 690 tokens per second on prefill. The question is no longer "can we run this on device?" but "should we, and when?"

The answer is rarely obvious. Moving inference to the edge introduces real tradeoffs: a quality tax from quantization, a maintenance burden for fleet updates, and hardware fragmentation across device SKUs. But staying in the cloud has its own costs: round-trip latency measured in hundreds of milliseconds, cloud GPU bills that compound at scale, and data sovereignty problems that no SLA can fully solve. This post lays out a practical framework for navigating those tradeoffs.

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

On-Device LLM Inference: When to Move AI Off the Cloud

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover that running AI inference in the cloud has sharp edges only after they've already hit them: a HIPAA audit that traces back to PHI crossing API boundaries, latency numbers in staging that look fine until a user on a spotty connection reports "it just spins," or a per-inference API bill that looked reasonable at 10,000 requests per day and catastrophic at 10 million. On-device inference is often the right answer — but the reasons teams reach for it, and the problems they hit when they do, are rarely the same ones that show up in blog post comparisons.

This is a practical guide to the decision: when local execution beats cloud APIs, which small models actually deliver, and what the deployment lifecycle looks like once the benchmark demo is over.

Edge LLM Inference: When Latency, Privacy, or Cost Force You Off the Cloud

April 13, 2026 · 9 min read

Tian Pan

Software Engineer

A fine-tuned 7B parameter model running on a single RTX 4090 can outperform GPT-4 on domain-specific tasks while costing you nothing per token after the initial hardware investment. That is not a theoretical claim — Diabetica-7B, a diabetes-focused model, hit 87.2% accuracy on clinical queries, beating both GPT-4 and Claude 3.5 on the same benchmark. The catch? Getting there requires understanding exactly when edge inference makes sense and when it is an expensive distraction.

Most teams default to cloud APIs because they are easy — make an HTTP call, get tokens back. But that simplicity has costs that scale in ways engineers do not anticipate until it is too late, and those costs are not always measured in dollars.

Hybrid Cloud-Edge LLM Architecture: Routing Inference Where It Actually Belongs

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams pick a side: run everything in the cloud, or compress a model to fit on-device. Both choices leave money and performance on the table. The teams getting the best results in 2025-2026 are doing neither — they're building hybrid architectures that route each inference request to the right tier based on complexity, latency budget, and data sensitivity.

The core insight is simple but underappreciated: 70-80% of production queries don't need a frontier model. They need a fast answer from a small model that sits close to the user. The remaining 20-30% genuinely benefit from a cloud-hosted heavyweight. The engineering challenge is building the routing layer that makes this split invisible.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

Hybrid Cloud-Edge LLM Inference: The Latency-Privacy-Cost Triangle That Determines Where Your Model Runs

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams run every LLM call through a cloud API. It's the path of least resistance: no hardware to manage, no models to optimize, and the latest frontier capabilities are one HTTP request away. But as AI moves deeper into production — processing sensitive documents, powering real-time interactions, running on mobile devices — the assumption that cloud is always the right answer starts to crack.

The cracks show up in three places simultaneously. Latency: a 200ms network round-trip that's invisible in a chatbot becomes unacceptable in voice AI or real-time code completion. Privacy: data that leaves the device creates compliance surface area that legal teams increasingly won't sign off on. Cost: at high request volumes with low utilization variance, you're paying a significant premium for infrastructure you could own.

Hybrid Cloud-Edge LLM Inference: When On-Device Models Beat the Cloud

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Every token your LLM generates in the cloud costs money, adds latency, and sends user data across a network boundary. Every token generated on-device avoids all three—but caps out at what a phone or laptop GPU can handle. The interesting engineering happens at the boundary: deciding which queries deserve the cloud's frontier capabilities and which are better served by a 3B parameter model running locally in under 20 milliseconds.

The hybrid cloud-edge inference pattern isn't theoretical. Apple Intelligence routes between on-device models and Private Cloud Compute. Google's Gemini Nano runs directly on Pixel and Samsung devices while escalating complex requests to cloud Gemini. These aren't demos—they're shipping at billion-device scale. And the underlying architecture is now accessible to any team willing to think carefully about the latency-privacy-cost triangle.

About Tian Pan