Chat UX collapses when agents run past thirty seconds. The inbox primitive — durable run IDs, completion notifications, result-over-progress framing — is the product shape long-running agents actually need.
Public LLM benchmarks quietly become training data and inflate scores by 5–15 points. A practical contamination audit (n-gram, canary, held-out) and the organizational reasons your eval team won't run it.
Hitting stop halts your UI, not the GPU. Most providers finish generating and bill you for tokens no user ever read. Here's how to measure and shrink the gap.
Cascade routers cut LLM spend dramatically — and quietly degrade tail latency, poison your training data, and invalidate your A/B tests. Here's what to instrument before the cost win turns into a reliability bill.
Reasoning traces read like audit evidence but only describe intent — not what executed. Why compliance needs a runtime-emitted sidecar action log.
LLM-authored agent plans routinely contain implicit cycles that classic deadlock detection cannot see. A static plan-graph pass plus a runtime watchdog catches them before tokens evaporate.
LLM agents have no clock — they trust whatever timestamp you injected. Treat the time-in-prompt as a correctness contract, not a log field, or keep shipping the Tuesday-vs-Wednesday bug.
No production traces means no free eval signal — but waiting for real users is not the fix either. A four-layer cold-start eval stack: structured dogfooding, scenario simulation with personas, an expert-labeled seed set, and a public adversarial probe library, with explicit weights so the loudest internal user doesn't set the rubric.
Linear chat threads force users to kill-and-restart to explore alternatives. The copy-on-branch state model, DAG storage, and UI patterns that make divergence native instead of bolted-on.
Chat history is not free context. Every turn adds noise, poisons attention, and bends the per-turn accuracy curve downward — here is how to detect, compact, and curate it.
Token spend per endpoint hides which AI features make money. A tagging discipline that joins inference traces to product telemetry turns pricing, gating, and deprecation calls into decisions with numbers instead of vibes.
Demos select for fluent, confident output — not correct output. Here is how the LLM dev loop quietly drifts toward charismatic failure, and the eval workflow that fixes it.