Blog

Page 126

12 articles

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric
Most production LLM systems track accuracy but ignore variance. Measuring the distribution of outputs over identical inputs — your prompt entropy budget — is the missing metric that determines UX consistency at scale.
llmproduction
Apr 1511 min
Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking
Reasoning models like o1, o3, and Claude with extended thinking process prompts fundamentally differently than instruction-following models. The patterns that work for GPT-4 actively hurt performance on thinking models — here's the framework for adapting.
prompt-engineeringreasoning-models
Apr 1510 min
The Public Hallucination Playbook: What to Do When Your AI Says Something Stupid in Public
A practical playbook for engineers and product teams facing a public AI hallucination incident — covering triage, root cause classification, user-facing communications, and the post-incident eval work that actually prevents recurrence.
aillm
Apr 1510 min
RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline
Five carefully crafted documents in a corpus of millions can manipulate a RAG system's responses 90% of the time — and your input validation layer never sees them coming. Here's why the threat model for RAG is fundamentally different, and the defenses that actually work.
securityrag
Apr 159 min
The Query Rewrite Layer Your RAG System Is Missing
Most RAG tuning effort goes into chunking strategies and embedding models. The highest-leverage intervention is earlier in the pipeline: transforming user queries before they hit the vector index.
ragretrieval
Apr 1510 min
The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'
Vector search always returns top-k regardless of match quality, turning absent information into confident fiction. Fixing it takes more than raising a threshold — abstention has to be a first-class output.
insiderrag
Apr 1510 min
Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions
Coding agents converge toward a single correct answer. Research agents must explore open-ended hypothesis spaces where success is undefined upfront. Here's what that difference demands architecturally.
ai-agentsresearch
Apr 1510 min
Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill
A 20% per-step retry rate on a chained LLM agent rarely costs 20% more — with context replay it climbs to ~2x. Here is how to bound retries with a budget, catch explosions in CI, and stop paying twice for failure.
insiderllm-agents
Apr 158 min
Designing AI Safety Layers That Don't Kill Your Latency
Serial safety checks compound into hundreds of milliseconds of overhead before a response reaches users. Here's how to design guardrails that maintain safety posture without destroying the user experience.
insiderguardrails
Apr 159 min
SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications
A practical decision framework for choosing between supervised fine-tuning, RLHF, and DPO when aligning LLMs for narrow domain applications — including how to diagnose whether your alignment gap is a data problem, a reward problem, or a missing capability.
insiderfine-tuning
Apr 1511 min
The Shadow Prompt Library: Governance for an Asset Class Nobody Owns
Prompts run production AI features but have no code review, deploy pipeline, or owner. A practical governance stack — registry, change review, model compatibility, audit trails — before regulators force one on you.
prompt-engineeringai-governance
Apr 1512 min
Shipping AI in Regulated Industries: When Compliance Is an Engineering Constraint
The default AI stack fails in healthcare and fintech. Here's the technical architecture that lets you ship LLM features when auditability, explainability, and data residency are non-negotiable constraints.
compliancehealthcare
Apr 1511 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 126

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

The Public Hallucination Playbook: What to Do When Your AI Says Something Stupid in Public

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

The Query Rewrite Layer Your RAG System Is Missing

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

Designing AI Safety Layers That Don't Kill Your Latency

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

The Shadow Prompt Library: Governance for an Asset Class Nobody Owns

Shipping AI in Regulated Industries: When Compliance Is an Engineering Constraint

About Tian Pan