The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

April 17, 2026 · 10 min read

Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

The 100ms Ceiling and Why It Changes Everything

The sub-100ms threshold is not arbitrary. User perception research consistently places 100ms as the boundary where interactions feel immediate rather than responsive. Beyond it, users perceive latency as a noticeable pause. For autocomplete, voice interfaces, real-time document analysis, and manufacturing inspection systems, crossing that threshold noticeably degrades the product.

Cloud inference for a 70B parameter model sitting on an overloaded GPU cluster easily takes 200–400ms for a medium-length generation. A 7B model on a dedicated GPU can hit 50–80ms. An INT4-quantized 7B model can hit 30–50ms. A distilled 1–3B model running on-device can hit 8–15ms per token with no network overhead.

The compression ladder runs from "quantize your existing model" at one end to "build and deploy a purpose-built smaller model entirely on the edge device" at the other. Each rung adds engineering complexity but unlocks new latency targets.

Quantization: The Fastest Path to Smaller

Quantization shrinks a model by reducing the numerical precision of its weights — from 32-bit floats (FP32) to 16-bit (FP16), 8-bit integers (INT8), or 4-bit integers (INT4). The model architecture stays identical. The operation is largely reversible. It requires no retraining.

The practical tradeoffs break down like this:

FP16/BF16 is now the effective baseline for production inference. It halves memory footprint with no measurable quality loss on almost all tasks. If you're still serving in FP32, start here.
INT8 halves memory again relative to FP16, with typically less than 1% accuracy degradation. For most chat, summarization, and retrieval-augmented generation tasks, it is indistinguishable from FP16. Tools like LLM.int8() and SmoothQuant handle the calibration. This is the safe default for memory-constrained deployments.
INT4 is where the tradeoffs become task-dependent. Advanced methods like AWQ and GPTQ preserve accuracy far better than naive INT4 by identifying which weights are sensitivity-critical and keeping them at higher precision. Even so, field benchmarks show 10–15% drops in real-world task success rates on agent and tool-use workloads. Code generation and multi-step reasoning degrade noticeably. INT4 works well for factual recall, summarization, and short-form generation; it is risky for reasoning-heavy tasks.

One often-missed decision: calibration dataset matters. Quantization accuracy depends on the statistics of the calibration set used to compute scale factors. If your calibration data does not represent your production query distribution, you will see accuracy drops that your standard benchmark suite will not catch. Always calibrate on a sample of real production traffic.

For hardware, if your deployment target is NVIDIA Hopper or Blackwell architecture, FP8 has emerged as the pragmatic sweet spot — it's natively supported in hardware, delivers throughput close to INT8 with quality close to FP16, and requires less careful calibration than INT4.

Distillation: Task-Specific Speed at Training Cost

Knowledge distillation creates a new smaller model (the student) trained to mimic the behavior of a larger model (the teacher). Unlike quantization, this is not a post-hoc transformation — it requires a full training run. The payoff is a model that is architecturally smaller, not just numerically compressed, meaning it executes fewer operations per token and can hit fundamentally lower latency targets.

DeepSeek R1 is a widely-studied example: distilled from 641B parameters down to variants ranging from 1.5B to 70B, enabling deployment on hardware ranging from a single CPU to a multi-GPU workstation. A distilled 8B Llama model with task-specific fine-tuning can retain 90–95% of a 70B model's accuracy on targeted tasks while running 3–4x faster.

The decision criteria for distillation:

Your task domain is narrow and stable. Distillation specializes a model. The student excels in the teacher's distribution but degrades on out-of-scope queries. If your feature handles one class of user request, this is a strength. If users ask diverse questions, it is a liability.
You have training infrastructure. A distillation run costs GPU time, requires curated training data, and needs an evaluation pipeline capable of measuring output quality — not just perplexity. This is a non-trivial investment compared to running a quantization script.
Target latency requires a genuinely smaller architecture. If you need 20ms end-to-end on a mobile device, quantizing a 7B model will not get you there. A 1B distilled model might.

Research consistently shows that knowledge-intensive tasks (factual recall, entity extraction) survive distillation well, while complex reasoning, instruction-following chains, and multilingual tasks degrade substantially. This is the same fragility profile as INT4 quantization, but for a different reason: fewer parameters mean less capacity for the generalization required by reasoning tasks.

Compression Ordering: When You Apply Both

Quantization and distillation are not mutually exclusive. When you need maximum compression, you apply both. A recent systematic study of compression ordering found that the sequence Pruning → Distillation → Quantization produces the best balance of size reduction and preserved capability. The intuition is sound: prune to remove redundant structure, distill to rebuild specialized capability in the smaller model, then quantize to extract the final efficiency gain from numerical precision reduction.

What the research makes clear is that order matters significantly. Applying quantization before distillation forces the student to learn from an already-degraded teacher signal. The quality loss compounds.

On-Device Deployment: When Latency Meets Sovereignty

Moving inference to the edge — mobile devices, embedded systems, on-premise servers — addresses a different set of constraints than cloud optimization. The drivers are usually one of three:

Hard latency requirements: Manufacturing defect detection at 30fps, real-time voice transcription, or autocomplete that must respond before the user stops typing. Cloud round-trips introduce 50–200ms of unavoidable overhead regardless of how fast the GPU cluster is.
Data sovereignty: Healthcare, finance, and defense workloads frequently cannot send raw inputs to third-party cloud infrastructure. Running on-premise or on-device eliminates the compliance surface.
Cost at scale: At 100M+ daily inference requests, cloud pricing ( $0.03–0.60 per million tokens) becomes a dominant cost line. A$ 10,000 edge device amortized over three years costs under $0.10 per million tokens at that volume.

Hardware options have matured substantially. NVIDIA Jetson Orin handles 8B parameter models with INT8 quantization at 8–12ms per token, supporting 50–200 concurrent requests depending on sequence length. Google Edge TPU v3 targets INT8 quantized smaller models with sub-10ms latency for targeted inference blocks. Apple Silicon (M-series chips) runs llama.cpp with 7B models at competitive speeds and is increasingly used for professional workstation deployments.

The inference runtime layer matters here. llama.cpp remains the most widely deployed option for its portability and GGUF format support. TensorRT-LLM delivers best-in-class throughput on NVIDIA hardware. vLLM with PagedAttention handles high-concurrency edge server deployments. The runtime decision locks you into an optimization and format ecosystem; switching later is painful.

The Hybrid Pattern: Not All Requests Are Equal

Most production AI features do not need sub-100ms latency for every query. A pragmatic architecture routes requests by complexity and latency requirement:

Fast path (on-device or local): Autocomplete, intent classification, form extraction, short-form generation. Use a distilled + quantized 1–3B model. Target latency under 50ms.
Standard path (cloud, optimized): Multi-turn dialogue, document summarization, code review comments. Use a quantized 7–14B model on a dedicated inference cluster. Target latency 100–300ms.
Deep path (cloud, full quality): Complex analysis, long-form generation, reasoning-heavy tasks where quality cannot be compromised. Use a full-precision or FP16 model. Accept latency 500ms+.

The routing decision itself should be lightweight — a rule-based classifier or a tiny classification model, not another LLM call. The overhead of routing must not consume the latency budget of the fast path.

Progressive rollout to the edge is non-negotiable. Fleet deployments should use canary releases that validate accuracy on a production traffic sample before full rollout, since quantization and distillation quality losses often only surface on the long tail of production queries that your evaluation set does not cover.

Building Your Evaluation Set Before You Compress

The most common mistake teams make is running standard benchmarks against compressed models and reporting the numbers as meaningful. MMLU and HumanEval measure broad capability. Your product feature has a specific task distribution. The model that scores 2% lower on MMLU may be 15% worse on your actual user queries if your task happens to stress the exact capability that compression degrades.

Build a task-specific evaluation set before you compress anything:

Sample 500–1000 real production queries, stratified by query type.
Generate golden responses with your uncompressed baseline model (or by hand for critical cases).
Define automated metrics appropriate to your task — ROUGE for summarization, exact match for extraction, pass rate for code, LLM-as-judge for open-ended quality.
Run every compression candidate against this set and compare regression against your baseline.

This set becomes your compression regression suite. It runs in CI before every deployment and tells you which model version is safe to ship.

Deciding Without Regret

The compression decision is not a one-time call. It is a recurring engineering tradeoff as models improve, hardware evolves, and your task distribution shifts. The framework that holds up over time is simple:

Start with quantization. Low engineering cost, reversible, handles the majority of latency and cost problems. Use INT8 by default; move to AWQ INT4 only if INT8 still fails your latency target and your task tolerates reasoning degradation.
Add distillation when quantization is not enough and your task is narrow. Budget the training cost, build the evaluation suite first, and treat the distilled model as a separate versioned artifact with its own deployment lifecycle.
Move to the edge when latency or compliance demands it. Start with a cloud model running on-premise before committing to embedded hardware — the operational complexity grows significantly at each step.
Always benchmark on your task distribution, not public leaderboards. The model that wins general benchmarks is not necessarily the one that survives compression on your specific workload.

The teams that get this right treat compression as part of the model's production lifecycle, not a deployment-day optimization. The benchmark suite, the distillation training pipeline, and the edge deployment infrastructure are all built before they are needed, because adding them under deadline pressure is where shortcuts get taken and quality regressions ship to users.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

The 100ms Ceiling and Why It Changes Everything

Quantization: The Fastest Path to Smaller

Distillation: Task-Specific Speed at Training Cost

Compression Ordering: When You Apply Both

On-Device Deployment: When Latency Meets Sovereignty

The Hybrid Pattern: Not All Requests Are Equal

Building Your Evaluation Set Before You Compress

Deciding Without Regret

Recommended Reading

About Tian Pan

The 100ms Ceiling and Why It Changes Everything​

Quantization: The Fastest Path to Smaller​

Distillation: Task-Specific Speed at Training Cost​

Compression Ordering: When You Apply Both​

On-Device Deployment: When Latency Meets Sovereignty​

The Hybrid Pattern: Not All Requests Are Equal​

Building Your Evaluation Set Before You Compress​

Deciding Without Regret​

Recommended Reading

About Tian Pan

The 100ms Ceiling and Why It Changes Everything

Quantization: The Fastest Path to Smaller

Distillation: Task-Specific Speed at Training Cost

Compression Ordering: When You Apply Both

On-Device Deployment: When Latency Meets Sovereignty

The Hybrid Pattern: Not All Requests Are Equal

Building Your Evaluation Set Before You Compress

Deciding Without Regret