Blog

Page 72

12 articles

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor
Most teams trust the eval because nobody owns auditing it. The labeling pipeline is a human supply chain — and the gold set inherits whatever distortion the humans introduce.
ai-engineeringevaluation
Apr 2710 min
Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season
Production traffic is not stationary. An eval set sampled in March, run in October, scores an October-shaped customer who never appeared in the gold rows. Here is how to keep the gate honest.
evalsllm
Apr 2712 min
Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It
Gold eval pass rates stay green while production drifts away from them. Run a shadow eval built from current traffic in parallel — the disagreement metric is the drift detector your dashboard is missing.
insiderai-engineering
Apr 2712 min
The Human Attention Budget Is the Constraint Your HITL System Silently Overspends
HITL systems treat reviewer time as infinite, but vigilance decrement and automation bias quietly turn the safety net into a rubber stamp. Design for the real human limits.
hitlai-agents
Apr 2710 min
The Idle Agent Tax: What Your AI Session Costs While the User Is in a Meeting
Long-lived AI agent sessions keep accruing cost even when the user is in a meeting. Here is what those idle minutes actually pay for, and how to design hibernation tiers that hold latency without burning the bill.
llmai-engineering
Apr 2711 min
Your Inference Chargeback Is Quietly Taxing Eval Discipline
Pricing inference tokens but not eval coverage rewards model upgrades and punishes evaluation, so eval coverage shrinks while the bill grows — exactly opposite to FinOps intent.
finopsllm-eval
Apr 2712 min
Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write
Classical capacity planning assumes a measurable workload and a stable unit cost. AI workloads break both — and the SaaS-style forecast you hand finance is the reason they keep calling for a re-baseline. Here's the four-discipline FinOps shape it should take instead.
finopsai-engineering
Apr 2712 min
The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters
LLM-as-judge agreement with humans is highest in the muddy middle and collapses at the decision boundary. The discipline that keeps the unlock honest: per-slice kappa, drift dashboards, cross-family ensembles for high-stakes slices, and an explicit ceiling past which humans grade.
llm-evalai-engineering
Apr 2710 min
The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise
A patch bump on your model SDK can quietly rewrite prompt behavior, break JSON parsing, and ship regressions past your eval gate. Here is the discipline that catches it.
insiderllm
Apr 2710 min
Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap
Traditional APMs were built for bounded dimensions and stateless services. LLM workloads have a cardinality profile closer to product analytics, and the mismatch silently strips the only signal that would surface a broken prompt.
insiderai-engineering
Apr 2711 min
The Model-Preference Fork: Why Your Prompt Library Has Three Versions and No One Is Tracking the Drift
A shared prompt library quietly accretes model-specific forks that nobody tracks, breaking the contract between your eval suite and your routing layer at every model upgrade.
llmprompt-engineering
Apr 2711 min
Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'
Rolling back an LLM upgrade isn't a button press — it's a partial, hysteretic operation closer to a database migration. Here's the control plane your incident playbook needs before the next bad model rolls out.
llmopsmlops
Apr 2712 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 72

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It

The Human Attention Budget Is the Constraint Your HITL System Silently Overspends

The Idle Agent Tax: What Your AI Session Costs While the User Is in a Meeting

Your Inference Chargeback Is Quietly Taxing Eval Discipline

Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write

The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters

The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise

Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap

The Model-Preference Fork: Why Your Prompt Library Has Three Versions and No One Is Tracking the Drift

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

About Tian Pan