Blog

Page 65

12 articles

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability
Curated eval sets silently drift from production reality over months. Learn how to detect when your evals are measuring the wrong thing, the rotation strategies that keep benchmarks honest, and the monitoring triggers that tell you it's time to rebuild.
insiderai-engineering
Apr 199 min
Goodhart's Law Is Now an AI Agent Problem
AI agents are mathematically exhaustive optimizers — when a proxy metric becomes the training target, capable models reliably find and exploit it. Here's how to audit your reward signals before they become attack surfaces.
insiderai-agents
Apr 1911 min
Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing
Most agent UIs handle the happy path and nothing else. Here's the error contract and UX patterns that turn tool-call failure from a crash into a recoverable moment.
insiderai-engineering
Apr 1911 min
Defining Escalation Criteria That Actually Work in Human-AI Teams
Most AI teams treat escalation as an afterthought. Here's how to define structured escalation specs, pick the right confidence thresholds, and build feedback loops that improve over time.
ai-engineeringhuman-in-the-loop
Apr 1910 min
Idempotency Is Not Optional in LLM Pipelines
Traditional idempotency breaks when outputs are stochastic. Here's the architectural rethink that prevents duplicate actions, cost explosions, and corrupted state machines in production LLM systems.
llmproduction
Apr 1910 min
The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems
When the engineers who built your AI system leave, the system doesn't break immediately — it rots slowly. Here's how to prevent the decay with prompt rationale files, eval provenance logs, and guardrail justification comments.
insiderai-engineering
Apr 1910 min
When Vector Search Fails: Why Knowledge Graphs Handle Queries Embeddings Can't
Vector search fails silently on multi-hop queries, entity disambiguation, and cross-document reasoning. Here's when knowledge graphs and hybrid retrieval are the right architecture.
insiderrag
Apr 199 min
The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable
95% accuracy sounds great until you realize it means your 20-step AI workflow succeeds only 36% of the time. Here is the failure taxonomy and the architectural fixes that actually close the last mile.
insiderllm
Apr 199 min
The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch
A 3-second streaming response often feels faster than a 1-second batch response. Here's the psychology behind it and the engineering patterns that exploit it.
insiderai-engineering
Apr 1911 min
Why Your LLM Alerting Is Always Two Weeks Late
LLM quality degrades silently while your infrastructure metrics stay green. Learn the specific signals — semantic drift score, output schema conformance, user-repair rate — and anomaly detection patterns that catch model degradation 11 days before users start filing tickets.
observabilitymonitoring
Apr 1910 min
Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production
LLMs trained with RLHF are systematically miscalibrated — highest verbal confidence often marks incorrect outputs. How to measure calibration error on your task and fix the routing logic that depends on it.
llmproduction-ai
Apr 199 min
LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip
Token counts in production depend on user behavior you can't predict at design time. Here's how to build a cost model that bounds variance before launch—through simulation, canary traffic, and framework-level budget enforcement.
llmcost-optimization
Apr 199 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 65

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

Goodhart's Law Is Now an AI Agent Problem

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

Defining Escalation Criteria That Actually Work in Human-AI Teams

Idempotency Is Not Optional in LLM Pipelines

The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems

When Vector Search Fails: Why Knowledge Graphs Handle Queries Embeddings Can't

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

Why Your LLM Alerting Is Always Two Weeks Late

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

About Tian Pan