Blog

Page 74

12 articles

The Reranker Is the Silent Second Model Your RAG Eval Never Measures
Most RAG pipelines run two models in series — retriever and reranker — but eval suites only grade the generator's output. When the reranker drifts, the dashboard shows answer quality dropping with no causal arrow. Here's how to build a reranker eval that catches the silent regressions.
ragevaluation
Apr 2710 min
Retiring an AI Feature Is a Trust Event, Not a Deprecation
Sunsetting an AI assistant breaks differently than deprecating an API — the playbook needs cohort cuts, a maintenance-cost ledger, and comms calibrated to relationships, not contracts.
ai-engineeringproduct-management
Apr 2713 min
Retries Aren't Free: The FinOps Math of LLM Retry Policies
Classical retry policies assume bounded cost and independent retries. LLM workloads break both — and the bill compounds on the worst inputs. A field guide to rebuilding retry budgets for token economics.
llmfinops
Apr 2711 min
Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion
Adding a retrieval step to fix every model failure looks like progress until your system is a pile of retrievers gluing together a prompt that still has the original problem. A diagnostic framework, ablation discipline, and complexity budget for RAG.
ragai-engineering
Apr 2711 min
Your Review Queue Is Where the Autonomy Promise Goes to Die
Human-in-the-loop AI fails quietly: the review queue grows, latency creeps, and the safety story breaks one item at a time. A field guide to SLOs, capacity tripwires, and tiered review for AI features.
ai-engineeringhuman-in-the-loop
Apr 2710 min
The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation
An LLM call's behavior depends on the wall clock — batch size, cache state, and routing tier shift with provider load. Evals that run at 2 AM calibrate on conditions production never sees. Five practices that close the gap between off-peak eval and peak-hour reality.
insiderai-engineering
Apr 2712 min
The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust
An AI feature that succeeds 70% of the time can be worse than one that fails 70% of the time — concentrated, unpredictable failures collapse user trust faster than consistent unreliability. Why aggregate accuracy lies, why users cannot self-calibrate, and how to design for the uncanny zone.
ai-producttrust-calibration
Apr 2712 min
The Structured-Output Retry Loop Is Your Hidden Compute Waste
A 98.4% structured-output success rate hides a 2% retry loop that quietly eats 12–18% of your inference budget. A practical guide to retry-token budgets, per-field failure dashboards, and fall-through paths that keep the bill honest.
insiderllm
Apr 2711 min
Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute
Total GWh on a slide is not an AI sustainability metric. Task-watts joined to product telemetry is — and the dashboard your CFO is about to ask for cannot compute it yet.
insiderai-engineering
Apr 2711 min
Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth
Local tokenizers and provider billing counters disagree by 5–15% on the long-tail content your CI never tests. The gap eats your safety margin where your users live.
llmtokenization
Apr 279 min
Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists
Function-calling layers default to fire-and-forget, with no call stack and no cycle detector — and the cost shows up as per-request token counts that drift upward as the tool catalog grows.
insideragents
Apr 2711 min
Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote
Cached tool results that look clean in the trace are quietly producing confidently-wrong agent answers. Treat the cache as a per-tool freshness contract — TTLs by volatility, freshness metadata in the result, bypass tiers, and a stale-cache eval slice.
ai-engineeringagents
Apr 2711 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 74

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

Retiring an AI Feature Is a Trust Event, Not a Deprecation

Retries Aren't Free: The FinOps Math of LLM Retry Policies

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

Your Review Queue Is Where the Autonomy Promise Goes to Die

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

The Structured-Output Retry Loop Is Your Hidden Compute Waste

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

About Tian Pan