Pinning a model ID does not freeze behavior. Refusal thresholds and content classifiers move server-side without a release note, and the regression is asymmetric across the safety boundary.
Pure semantic retrieval ignores time, while recency-weighted retrieval rewards activity over correctness. A practical look at per-query time-sensitivity classifiers, per-document volatility scores, and the two-axis scoring that makes RAG correct on both stable and time-varying questions.
Most agent bugs live in the joints between model, tools, and harness — single-layer logs cannot see them. Build a unified trace, an OpenTelemetry GenAI span surface, a cause-hypothesis panel, and a reproducibility envelope to debug agents like the distributed systems they are.
Refusal rate is a two-sided distribution, but most safety dashboards plot only one side. Here is what to instrument, how to sample, and who should own the calibration.
When source documents disappear, their embeddings linger in the vector index and keep returning confidently wrong answers. A field guide to tombstones, cascade invalidation, and retrieval-time freshness checks.
One session_id column, three meanings — billing, eval, and memory each define a conversation differently, and a single default ships three unrelated bugs with the same root cause.
Most AI features ship with a visible reasoning trace because the model emits one and hiding it feels wasteful. It is a product decision the team never made — and a measurable source of trust loss.
Switching to a smaller model to cut cost-per-token can quietly raise your LLM bill. The right unit is cost per successful task, and most dashboards never measure it.
Hand-curated LLM eval sets decay the moment user behavior shifts. Pin production traces, assert semantic equivalence on outputs, structural equality on tool calls, and latency bands instead of point estimates.
Stop sequences chosen for clean engineering examples become silent ambient hazards once user content joins the prompt. How the bug manifests, why eval suites miss it, and the reserved-namespace fix that prevents recurrence.
Token-streaming and structured output are architecturally at odds. The naive try/catch JSON.parse loop is O(n²), the is_complete boolean is a lie, and partial enums are how a Delete tool fires on DeleteIfEmpty.
Long-running agents trigger summarization on overflow or hierarchically, and at scale the compaction passes quietly become the dominant inference cost — and the dashboard never tells you.