The Silent Personalization Layer Your Customers Could Not Reproduce
A platform team ships a quality improvement. An inference-time layer reads the user's recent interactions and silently nudges the response style: more formal here, more terse there, more technical when the history suggests an engineer is asking. The A/B test shows an aggregate satisfaction lift of a couple of points. The launch post goes out under the heading "smarter responses, no API changes required." Nobody flips a flag in the API. Nobody updates the docs. Nothing in the response payload indicates which persona the model just adopted.
Six weeks later an enterprise customer files a support ticket that says, "your model is worse than you advertised." Their internal eval suite — running the same prompts your team published benchmarks against — scores eight points lower. Your team's first move is to verify prompt parity. Prompts match exactly. Decoding parameters match. The model version string matches. The divergence traces to the personalization layer, which infers a "thin-history default persona" for the customer's freshly-provisioned test account and a richer one for the long-lived user accounts your benchmarks were measured against. The conversation about whether the personalization is a feature or a bug stops being a product decision and becomes a contract negotiation.
