Fluent, on-topic LLM answers that solve the wrong problem are the hardest bug class in production. A practical playbook for detecting surface-feature overfitting and designing prompts that expose it.
Plan-and-execute agents emit plans that look like contracts but behave like forecasts. Treat plan adherence as an SLI with measurement, enforcement, and bounded re-planning budgets — not a quality nice-to-have you grade once a quarter.
Scoping the tools list at execution time is too late. If the planner sees the full catalog, its refusals, clarifying questions, and reasoning trace leak capability existence to users who aren't authorized to know.
Why a few chunks dominate every RAG query — how high-dimensional hubness and ANN graph structure silently collapse retrieval diversity, and the diagnostics plus mitigations that keep the long tail alive.
Prompts live in four teams at once — authors, evaluators, deployers, and support. When no single role owns the whole loop, Conway's law guarantees silent quality leaks. The RACI gaps, shared-library traps, and steward role that actually keep behavior coherent.
Foundation models arrive pre-loaded with strong opinions about your domain. Probe the prior, refute the default, and stop shipping prompts that compete with what the model already believes.
Treat your RAG chunker like preprocessing and every boundary tweak becomes a silent schema migration. Version it, stage it, and own the retrieval eval alongside it.
Between 50 and 90 percent of LLM citations do not fully support the claims they are attached to. Here is why post-hoc attribution makes RAG systems quietly untrustworthy, how to measure citation faithfulness with NLI, and the architectural fixes that actually help.
One user's agent fan-out can starve every other user of the same quota. Why flat token buckets collapse under agent workloads, and the four-layer hierarchy that keeps the platform honest.
Reasoning models win benchmarks but bleed latency and quality at tool-choice steps. A per-step hybrid routing pattern, attribution, and anti-patterns.
Single-model reflection loops mostly return the first plan with cosmetic edits while compounding the token bill. Here is how to measure the placebo and what actually produces divergent plans.
Refusal in language models is two distinct capabilities that training pipelines conflate, leaving models that block benign requests while confidently fabricating answers to questions they cannot reliably answer.