5 posts tagged with "model-compression"

The Distillation That Lost a Capability Your Eval Suite Never Measured

June 2, 2026 · 9 min read

Software Engineer

A team shrinks a 200B teacher into a 7B student because the eval suite — fifty thousand examples covering everything the product launched with — shows the student trailing the teacher by less than two points and inference cost dropping by an order of magnitude. The migration ships. The cost graph drops. The customer-satisfaction graph holds. Three weeks later, support starts seeing a class of failures the team cannot reproduce in eval.

The student no longer recognizes a corner-case input format the teacher had silently handled. It no longer recovers from a particular ambiguous instruction the teacher had reliably disambiguated. It no longer produces the rare-but-load-bearing "ask a clarifying question instead of guessing" behavior — because the eval set was scrubbed of ambiguous prompts on the grounds that they were "bad data."

The eval said the distillation was faithful. The eval was wrong about what faithfulness means.

Hybrid Cloud-Edge LLM Architecture: Routing Inference Where It Actually Belongs

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams pick a side: run everything in the cloud, or compress a model to fit on-device. Both choices leave money and performance on the table. The teams getting the best results in 2025-2026 are doing neither — they're building hybrid architectures that route each inference request to the right tier based on complexity, latency budget, and data sensitivity.

The core insight is simple but underappreciated: 70-80% of production queries don't need a frontier model. They need a fast answer from a small model that sits close to the user. The remaining 20-30% genuinely benefit from a cloud-hosted heavyweight. The engineering challenge is building the routing layer that makes this split invisible.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

Hybrid Cloud-Edge LLM Inference: The Latency-Privacy-Cost Triangle That Determines Where Your Model Runs

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams run every LLM call through a cloud API. It's the path of least resistance: no hardware to manage, no models to optimize, and the latest frontier capabilities are one HTTP request away. But as AI moves deeper into production — processing sensitive documents, powering real-time interactions, running on mobile devices — the assumption that cloud is always the right answer starts to crack.

The cracks show up in three places simultaneously. Latency: a 200ms network round-trip that's invisible in a chatbot becomes unacceptable in voice AI or real-time code completion. Privacy: data that leaves the device creates compliance surface area that legal teams increasingly won't sign off on. Cost: at high request volumes with low utilization variance, you're paying a significant premium for infrastructure you could own.

Hybrid Cloud-Edge LLM Inference: When On-Device Models Beat the Cloud

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Every token your LLM generates in the cloud costs money, adds latency, and sends user data across a network boundary. Every token generated on-device avoids all three—but caps out at what a phone or laptop GPU can handle. The interesting engineering happens at the boundary: deciding which queries deserve the cloud's frontier capabilities and which are better served by a 3B parameter model running locally in under 20 milliseconds.

The hybrid cloud-edge inference pattern isn't theoretical. Apple Intelligence routes between on-device models and Private Cloud Compute. Google's Gemini Nano runs directly on Pixel and Samsung devices while escalating complex requests to cloud Gemini. These aren't demos—they're shipping at billion-device scale. And the underlying architecture is now accessible to any team willing to think carefully about the latency-privacy-cost triangle.

About Tian Pan