Blog

Page 95

12 articles

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think
Frontier models score impressively on standard benchmarks, but contamination — where test data leaks into pretraining — inflates those numbers significantly. Here's what the gap actually looks like and how to design evaluations that give honest signals.
insiderai-engineering
Apr 188 min
Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems
The 'fix the prompt' reflex is displacing real root cause analysis in AI incident postmortems. Here's why it happens and how to apply blameless SRE culture to non-deterministic systems.
ai-engineeringsre
Apr 189 min
Board-Level AI Governance: The Five Decisions Only Executives Can Make
Most AI governance writing targets MLOps teams. But five strategic decisions can only be made at the board level — and the regulatory exposure for getting them wrong is growing fast.
aiengineering-leadership
Apr 189 min
Browser Agents in Production: The DOM Fragility Tax
Browser and computer-use AI agents break in ways that neither benchmarks nor demos reveal. Here's what actually causes failures in production and the architectural patterns that keep them running.
insiderbrowser-agents
Apr 1813 min
Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache
AI inference workloads respond to traffic spikes differently than conventional APIs — cold KV caches, minutes-long cold starts, and memory-bounded concurrency make reactive autoscaling fail. Here's the capacity planning math, pre-warming strategies, and graceful degradation modes that actually work.
insiderai-engineering
Apr 1811 min
The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product
When you upgrade to a newer frontier model, specific capabilities your product depends on can silently regress. Here's why safety training causes this, how to detect it, and techniques to recover suppressed behaviors without fine-tuning.
insiderllm
Apr 189 min
Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource
Traditional provisioning models fail for LLM workloads. Here's the forecasting methodology that accounts for token burstiness, KV cache pressure, and why GPU utilization is a misleading signal.
ai-engineeringinfrastructure
Apr 1811 min
The Cognitive Load Inversion: Why AI Suggestions Feel Helpful but Exhaust You
Real-time AI suggestions paradoxically increase cognitive load by shifting work from generation to verification. Here's the research and the design patterns that actually help.
insiderai-engineering
Apr 189 min
Compaction Traps: Why Long-Running Agents Forget What They Already Tried
Context compaction silently drops the failure records and constraint information that prevent agents from re-attempting operations they already know won't work. Here's how to design around it.
ai-agentscontext-engineering
Apr 189 min
Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model
Composing retrievers, rerankers, code interpreters, classifiers, and LLMs into pipelines that reliably outperform any single component — and the emergent failure modes that appear when you don't engineer the seams.
insiderai-engineering
Apr 189 min
Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies
Teams routinely pack codebases, histories, and documents into context and absorb the cost and quality degradation without measuring it. Here's why LLM context deserves the same explicit management as CPU registers — and how to build eviction policies that make it work.
insiderllm-engineering
Apr 1810 min
The Context Window Cliff: Application-Level Strategies for Long Conversations
What actually happens when your LLM context fills up mid-session, why most frameworks handle it badly, and the summarization, selective retention, and externalization patterns that keep long-lived conversations coherent.
llmcontext-management
Apr 1810 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 95

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

Board-Level AI Governance: The Five Decisions Only Executives Can Make

Browser Agents in Production: The DOM Fragility Tax

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

The Cognitive Load Inversion: Why AI Suggestions Feel Helpful but Exhaust You

Compaction Traps: Why Long-Running Agents Forget What They Already Tried

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

The Context Window Cliff: Application-Level Strategies for Long Conversations

About Tian Pan