Blog

Page 43

12 articles

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget
Layered safety pipelines silently triple p95 latency and cost on the long tail. Treat guardrails as a budgeted resource with tiered classifiers, parallel checks, and an honest latency contract.
llm-safetyguardrails
Apr 2710 min
The Reranker Is the Silent Second Model Your RAG Eval Never Measures
Most RAG pipelines run two models in series — retriever and reranker — but eval suites only grade the generator's output. When the reranker drifts, the dashboard shows answer quality dropping with no causal arrow. Here's how to build a reranker eval that catches the silent regressions.
ragevaluation
Apr 2710 min
Retiring an AI Feature Is a Trust Event, Not a Deprecation
Sunsetting an AI assistant breaks differently than deprecating an API — the playbook needs cohort cuts, a maintenance-cost ledger, and comms calibrated to relationships, not contracts.
ai-engineeringproduct-management
Apr 2713 min
Retries Aren't Free: The FinOps Math of LLM Retry Policies
Classical retry policies assume bounded cost and independent retries. LLM workloads break both — and the bill compounds on the worst inputs. A field guide to rebuilding retry budgets for token economics.
llmfinops
Apr 2711 min
Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion
Adding a retrieval step to fix every model failure looks like progress until your system is a pile of retrievers gluing together a prompt that still has the original problem. A diagnostic framework, ablation discipline, and complexity budget for RAG.
ragai-engineering
Apr 2711 min
Your Review Queue Is Where the Autonomy Promise Goes to Die
Human-in-the-loop AI fails quietly: the review queue grows, latency creeps, and the safety story breaks one item at a time. A field guide to SLOs, capacity tripwires, and tiered review for AI features.
ai-engineeringhuman-in-the-loop
Apr 2710 min
The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation
An LLM call's behavior depends on the wall clock — batch size, cache state, and routing tier shift with provider load. Evals that run at 2 AM calibrate on conditions production never sees. Five practices that close the gap between off-peak eval and peak-hour reality.
insiderai-engineering
Apr 2712 min
The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust
An AI feature that succeeds 70% of the time can be worse than one that fails 70% of the time — concentrated, unpredictable failures collapse user trust faster than consistent unreliability. Why aggregate accuracy lies, why users cannot self-calibrate, and how to design for the uncanny zone.
ai-producttrust-calibration
Apr 2712 min
The Structured-Output Retry Loop Is Your Hidden Compute Waste
A 98.4% structured-output success rate hides a 2% retry loop that quietly eats 12–18% of your inference budget. A practical guide to retry-token budgets, per-field failure dashboards, and fall-through paths that keep the bill honest.
insiderllm
Apr 2711 min
Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute
Total GWh on a slide is not an AI sustainability metric. Task-watts joined to product telemetry is — and the dashboard your CFO is about to ask for cannot compute it yet.
insiderai-engineering
Apr 2711 min
Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth
Local tokenizers and provider billing counters disagree by 5–15% on the long-tail content your CI never tests. The gap eats your safety margin where your users live.
llmtokenization
Apr 279 min
Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists
Function-calling layers default to fire-and-forget, with no call stack and no cycle detector — and the cost shows up as per-request token counts that drift upward as the tool catalog grows.
insideragents
Apr 2711 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 43

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

Retiring an AI Feature Is a Trust Event, Not a Deprecation

Retries Aren't Free: The FinOps Math of LLM Retry Policies

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

Your Review Queue Is Where the Autonomy Promise Goes to Die

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

The Structured-Output Retry Loop Is Your Hidden Compute Waste

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

About Tian Pan