2 posts tagged with "llm-api"

The Silent Personalization Layer Your Customers Could Not Reproduce

June 3, 2026 · 11 min read

Software Engineer

A platform team ships a quality improvement. An inference-time layer reads the user's recent interactions and silently nudges the response style: more formal here, more terse there, more technical when the history suggests an engineer is asking. The A/B test shows an aggregate satisfaction lift of a couple of points. The launch post goes out under the heading "smarter responses, no API changes required." Nobody flips a flag in the API. Nobody updates the docs. Nothing in the response payload indicates which persona the model just adopted.

Six weeks later an enterprise customer files a support ticket that says, "your model is worse than you advertised." Their internal eval suite — running the same prompts your team published benchmarks against — scores eight points lower. Your team's first move is to verify prompt parity. Prompts match exactly. Decoding parameters match. The model version string matches. The divergence traces to the personalization layer, which infers a "thin-history default persona" for the customer's freshly-provisioned test account and a richer one for the long-lived user accounts your benchmarks were measured against. The conversation about whether the personalization is a feature or a bug stops being a product decision and becomes a contract negotiation.

Provider Rate Limits Are a Capacity Plan You Never Wrote

May 19, 2026 · 9 min read

Tian Pan

Software Engineer

The first time your application hits a 429 from a model provider, something important happens, and almost nobody notices it. Not the error itself — the line of code that runs next. Maybe your HTTP client retries with exponential backoff. Maybe it falls back to a smaller model. Maybe it queues the request, or drops it, or surfaces a spinner that never resolves. Whatever it does, that behavior is now your capacity policy. It decides which users get served and which get degraded when demand exceeds supply.

And you almost certainly didn't write it. It was authored by whoever wrote the SDK wrapper, the retry decorator, or the three-line try/except someone copied from a tutorial. The most consequential decision in your system under load — what to do when you can't do everything — is being made by code nobody reviewed as a capacity decision.

This post is an argument for treating that code as what it actually is: a load-shedding policy and a capacity plan. Not an error handler. The 429 is not the problem. The problem is that you have outsourced the design of your system's behavior under contention to a library default.

About Tian Pan