Blog

Page 10

12 articles

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which
Agents inherit your wiring but not your sense of place. When the prompt is the same in staging and prod, the model fills in 'where it is' from training data — and 'production database' is the default. Here is how to ground an agent in its environment.
ai-engineeringagents
Jun 111 min
The Distillation That Lost a Capability Your Eval Suite Never Measured
Distillation optimizes a divergence over a finite sample, then ships against a finite eval. Behaviors the eval never measured are free entropy the student is licensed to drop — and the ones it drops first are usually the rare-but-load-bearing ones.
distillationevals
Jun 19 min
The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter
Why vendor-side embedding upgrades silently break your A/B tests on retrieval features, and the experimentation discipline that closes the gap.
insiderembeddings
Jun 110 min
The Escalation Path That Routes Back to the Agent
An escalate_to_human tool stops being human-in-the-loop the moment the downstream queue grows its own automation. Why the contract has to outlive the consumer.
insiderai-agents
Jun 110 min
The Eval Harness Whose Judge Model Was Upgraded Silently
An LLM judge whose endpoint silently updates is a measurement instrument with no calibration contract. Pin snapshots, build anchor sets, and run dual-judge windows so a six-point lift means your system improved — not the ruler.
insiderllm-eval
Jun 111 min
The Eval Rubric Pulled By Two Drift Vectors
An eval rubric read by humans and an LLM judge drifts on two axes at once. Composite scores hide the motion. Here is the measurement protocol that keeps each drift attributable.
insiderevals
Jun 19 min
The Eval Set That Sampled Production Traffic at 3am EST
An offline eval built from a nightly 3am cron quietly becomes a survey of overnight batch retries and APAC traffic — and the leaderboard cannot tell you whose model it is.
insiderllm-evals
Jun 110 min
The Eval That Converges, Then Quietly Collapses
A plateaued eval score does not always mean a model ceiling. When labelers homogenize, agreement metrics climb and the eval stops measuring what the team thinks it does.
evalsllm-judge
Jun 111 min
The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See
An LLM prompt experiment leaks assignment whenever the routing hash and the prompt assembler share an input — a walk through how the lift gets manufactured, the symptoms your dashboard does not surface, and the disciplines that close the gap.
insiderexperimentation
Jun 110 min
The Fine-Tune Cold Start Your Provider Bills as Idle Time
Hosted fine-tunes share an API surface with base models but not a cost-of-latency curve. Here's why the cold start tax hides in your p99 and never shows up on the bill.
llm-infrafine-tuning
Jun 111 min
The Fine-Tune Dataset You Accidentally Built While Debugging
When the thumbs-down button in your staging UI silently doubles as a training pipeline, you fine-tune on six months of personal taste, customer text, and engineer venting. Separate the debug surface from the curation surface — or ship a model trained on whatever your team was feeling that week.
insiderfine-tuning
Jun 19 min
The Fine-Tune That Erased the Alignment You Inherited
Supervised fine-tuning quietly strips the refusal training your base model came with. Why task-only evals miss it, and the four practices that catch the regression before customers do.
fine-tuningalignment
Jun 19 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 10

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

The Distillation That Lost a Capability Your Eval Suite Never Measured

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

The Escalation Path That Routes Back to the Agent

The Eval Harness Whose Judge Model Was Upgraded Silently

The Eval Rubric Pulled By Two Drift Vectors

The Eval Set That Sampled Production Traffic at 3am EST

The Eval That Converges, Then Quietly Collapses

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

The Fine-Tune Cold Start Your Provider Bills as Idle Time

The Fine-Tune Dataset You Accidentally Built While Debugging

The Fine-Tune That Erased the Alignment You Inherited

About Tian Pan