Blog

Page 56

12 articles

Async Agents Need an Inbox, Not a Chat
Chat UX collapses when agents run past thirty seconds. The inbox primitive — durable run IDs, completion notifications, result-over-progress framing — is the product shape long-running agents actually need.
insideragent-ux
Apr 2211 min
The benchmark leak: how your eval set quietly joins the training corpus
Public LLM benchmarks quietly become training data and inflate scores by 5–15 points. A practical contamination audit (n-gram, canary, held-out) and the organizational reasons your eval team won't run it.
llm-evaluationbenchmarks
Apr 2211 min
The Cancellation Tax: Your Inference Bill After the User Hits Stop
Hitting stop halts your UI, not the GPU. Most providers finish generating and bill you for tokens no user ever read. Here's how to measure and shrink the gap.
llm-costfinops
Apr 229 min
The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95
Cascade routers cut LLM spend dramatically — and quietly degrade tail latency, poison your training data, and invalidate your A/B tests. Here's what to instrument before the cost win turns into a reliability bill.
insiderllm-routing
Apr 2210 min
Your Chain-of-Thought Is a Story, Not an Audit Log
Reasoning traces read like audit evidence but only describe intent — not what executed. Why compliance needs a runtime-emitted sidecar action log.
insiderai-agents
Apr 2211 min
The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans
LLM-authored agent plans routinely contain implicit cycles that classic deadlock detection cannot see. A static plan-graph pass plus a runtime watchdog catches them before tokens evaporate.
insiderai-agents
Apr 2211 min
Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field
LLM agents have no clock — they trust whatever timestamp you injected. Treat the time-in-prompt as a correctness contract, not a log field, or keep shipping the Tuesday-vs-Wednesday bug.
insiderai-agents
Apr 2210 min
Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces
No production traces means no free eval signal — but waiting for real users is not the fix either. A four-layer cold-start eval stack: structured dogfooding, scenario simulation with personas, an expert-labeled seed set, and a public adversarial probe library, with explicit weights so the loudest internal user doesn't set the rubric.
ai-evaluationllm
Apr 2210 min
Conversation Branching as a First-Class Primitive: Why Linear Threads Force Users to Kill and Restart
Linear chat threads force users to kill-and-restart to explore alternatives. The copy-on-branch state model, DAG storage, and UI patterns that make divergence native instead of bolted-on.
insideragent-ux
Apr 2210 min
Conversation History Is a Liability Your Prompt Never Admits
Chat history is not free context. Every turn adds noise, poisons attention, and bends the per-turn accuracy curve downward — here is how to detect, compact, and curate it.
llmcontext-management
Apr 2210 min
Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets
Token spend per endpoint hides which AI features make money. A tagging discipline that joins inference traces to product telemetry turns pricing, gating, and deprecation calls into decisions with numbers instead of vibes.
finopsllm
Apr 2210 min
The Demo Loop Bias: How Your Dev Process Quietly Optimizes for Impressive Failures
Demos select for fluent, confident output — not correct output. Here is how the LLM dev loop quietly drifts toward charismatic failure, and the eval workflow that fixes it.
insiderllm-evals
Apr 2210 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 56

Async Agents Need an Inbox, Not a Chat

The benchmark leak: how your eval set quietly joins the training corpus

The Cancellation Tax: Your Inference Bill After the User Hits Stop

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

Your Chain-of-Thought Is a Story, Not an Audit Log

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

Conversation Branching as a First-Class Primitive: Why Linear Threads Force Users to Kill and Restart

Conversation History Is a Liability Your Prompt Never Admits

Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets

The Demo Loop Bias: How Your Dev Process Quietly Optimizes for Impressive Failures

About Tian Pan