Skip to main content

578 posts tagged with "insider"

View all tags

Generative UI as a Production Discipline: When the Model Renders the Screen

· 12 min read
Tian Pan
Software Engineer

The button label that shipped to your users last Tuesday was never seen by a copywriter, never reviewed in Figma, never QA'd, and didn't exist until inference time. It was generated by a model that decided, mid-conversation, that the right way to collect a shipping address was a six-field form rendered inline rather than three more turns of prose. The form worked. The label was fine. Nobody on the team can tell you which model run produced it, because the trace was rotated out of hot storage and the eval suite tests text outputs, not component graphs.

This is generative UI in production: the model is no longer just a text generator that occasionally invokes a tool. It is a UI compiler whose output is a component tree, and the design system is now a contract the model is constrained to rather than a guideline a human loosely follows. The shift breaks an entire stack of assumptions — QA against static specs, accessibility audits of fixed layouts, copy review of finalized strings, design-system adherence checks at build time — and most teams ship the feature before they have replaced any of them.

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

The Knowledge Cutoff Is a UX Surface, Not a Footnote

· 12 min read
Tian Pan
Software Engineer

The model has a knowledge cutoff. The user does not know what it is. The product, in almost every case, does not tell them. And on the day the user asks a question whose right answer changed three months ago, the assistant gives a confidently-stated wrong one — not because the model failed, but because the product never gave it a way to flag the gap. The trust contract between your users and your assistant is implicit, asymmetric, and silently broken every time the world moves and your UX pretends it didn't.

The dominant pattern is to treat the cutoff as a footnote: a line of disclosure copy buried in a help center, a /about page no one reads, a one-time tooltip dismissed in week one. That framing is a bug. Knowledge cutoff is not a property of the model the way "context length" is. It is a UX surface — instrumented, designed, and evolved — and treating it as anything less ships a product that confabulates around its own ignorance in a register the user cannot audit.

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

· 11 min read
Tian Pan
Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 12 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green

· 11 min read
Tian Pan
Software Engineer

The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.

This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.

The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.

The 30-Day Prompt Apprenticeship: Onboarding Engineers When 'Read the Code' Doesn't Work

· 12 min read
Tian Pan
Software Engineer

A senior engineer joins your team on Monday. By Friday they've shipped a TypeScript refactor that touches eleven files and passes review with two nits. The same engineer, two weeks later, opens the system prompt for your routing agent — 240 lines of instructions, three numbered example blocks, four "you must never" clauses, and a paragraph at the bottom that reads like an apology — and stares at it for an hour. They cannot tell you what would happen if you deleted lines 87–94. Neither can the engineer who wrote them six months ago.

This is the gap nobody puts on the onboarding doc. A prompt-heavy codebase looks like a codebase, lives in the same repo, runs through the same CI, and gets reviewed in the same PRs. But its semantics live somewhere else: in the observed behavior of a model that nobody on the team built, against a distribution of inputs nobody fully enumerated, with failure modes that surface as PRs to add a sentence rather than as bug reports. The traditional tools of code reading — types, signatures, tests, naming — do almost no work. A new hire who tries to "read the code" learns nothing about why each line is there, and a team that hands them a Notion doc and a Slack channel is implicitly outsourcing onboarding to the prompt's original author.

Prompt-Eligibility: The Missing Column in Your Data Classification

· 11 min read
Tian Pan
Software Engineer

Pull up your company's data classification policy. Public, internal, confidential, restricted — four neat tiers, each mapped to a set of access controls and a list of approved storage locations. Now ask a question the policy was never written to answer: which of these tiers are allowed to leave the corporate perimeter as a token sequence sent to a third-party model API?

The answer is almost always silence. Not because the policy is wrong, but because it is incomplete. Every classification scheme in use today was designed for an access vector that asks "is this employee allowed to read this row?" The prompt layer introduced a different vector entirely: an authorized service reads the row, transforms it into a prompt, and ships it across the network to a vendor that may log it, train on it, or hold it in plaintext for thirty days. None of that is read-access. None of it is covered.

This is the missing column. Until you add it, your data classification document is confidently asserting a control posture you do not have.