Blog

Page 87

12 articles

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System
Swapping a model component for a faster version often increases end-to-end latency and cost. Here's why—and the profiling discipline that prevents it.
insiderai-engineering
Apr 189 min
What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor
The decisions made inside LLM inference infrastructure—KV cache eviction, continuous batching, chunked prefill—set your application's performance envelope before you write a line of code. Here's what's actually happening and the few knobs you control.
llminference
Apr 1811 min
Invisible Model Drift: How Silent Provider Updates Break Production AI
LLM providers update models without changelogs. Your prompt regressions are real, they're silent, and they're your problem to detect. Here's how.
insiderllm
Apr 1810 min
Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks
How to use frontier model outputs as supervision signal to build task-specific small models—covering the dataset curation pipeline, quality collapse detection, and the benchmarking methodology that tells you when the distilled model is ready for production.
ai-engineeringllms
Apr 189 min
Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths
A practical decision framework for AI engineers on when distilling frontier model capabilities into smaller student models actually pays off—and when it silently fails on out-of-distribution inputs.
ai-engineeringllm
Apr 1810 min
The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem
Frontier models plateau on domain-specific tasks well before teams expect it. Here's how to diagnose whether you've hit a true capability ceiling or a prompt, eval, or data problem — and which technique actually breaks through.
llmfine-tuning
Apr 1810 min
The Idempotency Crisis: LLM Agents as Event Stream Consumers
At-least-once delivery assumes reprocessing an event produces the same result. LLMs don't. A practical guide to idempotency keys, deduplication windows, and compensating read-models for AI-powered Kafka consumers.
insiderai-engineering
Apr 1811 min
LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks
Most LLM benchmarks measure chatbot quality. But the bulk of enterprise LLM spend is going into batch pipelines — and almost nobody is measuring whether those pipelines actually work.
insiderdata-engineering
Apr 1810 min
LLM Vendor Lock-In Is a Spectrum, Not a Binary
Not all LLM dependencies are created equal. Some are acceptable engineering tradeoffs; others are technical debt from day one. Here's how to tell them apart across six distinct lock-in layers.
insiderllm
Apr 1810 min
Long-Session Context Degradation: How Multi-Turn Conversations Go Stale
Sessions beyond 50 turns accumulate contradictions, user intent drift, and sycophancy loops. Here's the engineering playbook for detecting degradation and keeping long conversations useful.
llmcontext-engineering
Apr 188 min
The Long-Tail Coverage Problem: Why Your AI System Fails Where It Matters Most
Aggregate metrics like accuracy and F1 can look great while your AI system silently fails on the minority inputs that matter most. How to detect, measure, and fix long-tail coverage gaps before users find them.
evaluationtesting
Apr 1810 min
LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars
Teams build separate LoRA adapters for tone, format, domain knowledge, and safety — then hit conflicts when composing them. Here's how to detect interference, choose the right merge strategy, and serve mixed adapters per-request without reloading weights.
insiderfine-tuning
Apr 189 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 87

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

Invisible Model Drift: How Silent Provider Updates Break Production AI

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths

The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem

The Idempotency Crisis: LLM Agents as Event Stream Consumers

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

LLM Vendor Lock-In Is a Spectrum, Not a Binary

Long-Session Context Degradation: How Multi-Turn Conversations Go Stale

The Long-Tail Coverage Problem: Why Your AI System Fails Where It Matters Most

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

About Tian Pan