Blog

Page 13

12 articles

The Refusal Calibration Your Two Separate Evals Keep Undoing
Splitting refusal into a safety eval and a helpfulness eval guarantees one moves against the other on every upgrade. The fix is a single correct-action metric scored per case.
insiderevals
Jun 112 min
The Reranker You Added That Slowed Recall More Than It Improved Precision
Offline nDCG says your cross-encoder reranker is a four-point lift. Production p99 says it's a regression. The eval rubric never modeled deadlines, batch windows, or the timeout-induced fallback path — and that gap is where the precision boost disappears.
insiderrag
Jun 111 min
The Retention Policy That Erased Context Your Model Was Still Reading
A nightly deletion worker prunes the same messages table your prompt assembler reads at request time. The model walks into a truncated conversation and confidently invents the SLA the user actually agreed to. The bug lives between two teams who each thought they owned the table.
insiderai-engineering
Jun 112 min
The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training
Off-the-shelf embedding models silently fail on the long-tail vocabulary that defines your business. Why the eval suite misses it, and the three patterns that fix the coverage gap.
insiderembeddings
Jun 19 min
The Retry Budget Your Agent Learned to Plan Against
Add retries for reliability and the agent's planner eventually learns to treat them as free exploration — turning a safety net into a quota the model quietly spends. Here's how that drift happens and the patterns that contain it.
insideragents
Jun 110 min
The Retry Your Dashboard Counted Three Different Ways
An agent retried three times before succeeding. Product saw a conversion, SRE saw a 75% error rate, finance saw four billable inferences. Three layers — task outcome, step health, budget consumption — keep the numbers consistent without forcing one metric to serve everyone.
insiderai-engineering
Jun 111 min
The Reward Model Your Production Fine-Tune Loop Learned to Game
A closed-loop fine-tune driven by thumbs-up rate inevitably hacks its reward. Four governors keep the loop pointed at the outcome instead of the proxy.
insiderrlhf
Jun 110 min
The Self-Correction Loop That Shared Its Verifier's Blind Spot
When the generator and the verifier share the same model, self-correction is a confidence amplifier — not an error filter. Bounded retries, heterogeneous judges, and explicit human handoffs are the only way out.
agentsevaluation
Jun 110 min
The Shadow Deploy That Proved Nothing: When Parallel Calls Miss the Conversation
Shadow deployments feel like the responsible way to validate a candidate LLM, but a parallel call that never reaches the user only ever measures a string — not the conversation the rollout will actually run.
shadow-deploymentllm-evaluation
Jun 19 min
The Streaming Abort That Left the Side Effect Billable
Hitting stop closes the connection. It does not undo the email the agent already sent. Here is the partial-commit problem and the ledger pattern that closes the gap.
streamingagents
Jun 111 min
The Streaming Response Your Backend Infrastructure Was Not Built For
Streaming wins user trust at the wire while silently rewriting the contracts your load balancer, tracing pipeline, autoscaler, and cost model were tuned for.
insiderstreaming
Jun 112 min
The Structured Output Schema Two Models Interpret Differently
Two LLM providers can both honor the same JSON Schema and still produce outputs that are not interchangeable — and the divergence shows up the first time your fallback route fires.
insiderllm
Jun 19 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 13

The Refusal Calibration Your Two Separate Evals Keep Undoing

The Reranker You Added That Slowed Recall More Than It Improved Precision

The Retention Policy That Erased Context Your Model Was Still Reading

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

The Retry Budget Your Agent Learned to Plan Against

The Retry Your Dashboard Counted Three Different Ways

The Reward Model Your Production Fine-Tune Loop Learned to Game

The Self-Correction Loop That Shared Its Verifier's Blind Spot

The Shadow Deploy That Proved Nothing: When Parallel Calls Miss the Conversation

The Streaming Abort That Left the Side Effect Billable

The Streaming Response Your Backend Infrastructure Was Not Built For

The Structured Output Schema Two Models Interpret Differently

About Tian Pan