Skip to main content

18 posts tagged with "ai-infrastructure"

View all tags

Prompt Position Is Policy: The Silent Merge Conflict When Three Teams Co-Own a System Prompt

· 11 min read
Tian Pan
Software Engineer

The diff in your prompt repo says three lines changed. The behavioral diff in production says everything changed. The safety team moved a refusal rule from line 14 to line 87 to "group it with related guardrails," the product team didn't notice because the wording was identical, and a week later the eval suite is showing a 9-point drop on adversarial inputs. Nobody edited the rule. Somebody moved it. In a 2,400-token system prompt with primacy bias on guardrails and recency bias on instruction-following, moving a rule is a behavioral change as load-bearing as rewriting it — and your tooling surfaces neither.

This is the merge-conflict pattern that AI teams discover at the end of a regression review, not the beginning of one. The system prompt grew past 2K tokens sometime in late 2025. The safety team owns the top, the product team owns the middle, the agent team owns the bottom, and three months of "small edits" have silently rearranged everyone else's intent because the line-based diff tool that worked fine for code can't tell you that an instruction crossed a section boundary. The bug isn't in any single edit. The bug is that position is now policy, and you have no policy on position.

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration

· 9 min read
Tian Pan
Software Engineer

Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.

This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

· 11 min read
Tian Pan
Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Tool Outputs Are an Untrusted Channel Your Agent Treats as Trusted

· 11 min read
Tian Pan
Software Engineer

The threat model most teams ship their agents with has one quiet assumption buried inside: when the model calls a tool, whatever comes back is safe to read. The user's prompt is the adversary, goes the story, and tool outputs are "just data" — search results, inbox summaries, database rows, RAG chunks, file contents, page scrapes. That story is the entire reason prompt injection keeps landing in production. Tool outputs are not data. They are another input channel into the planner, with the same privilege as the user prompt and none of the suspicion.

If that framing sounds abstract, consider what happened inside Microsoft 365 Copilot in June 2025. A researcher sent a single email with hidden instructions; the victim never clicked a link, never opened an attachment, never read the message themselves. A routine "summarize my inbox" query asked Copilot to read the email. The agent dutifully followed the instructions it found inside the body, reached into OneDrive, SharePoint, and Teams, and exfiltrated organizational data through a trusted Microsoft domain before anyone noticed. The CVE (2025-32711, "EchoLeak") earned a 9.3 CVSS and a server-side patch, but the class of bug did not go away. It cannot go away, because every read-tool on every production agent is a version of that email inbox.

This post is about the framing shift that gets you unstuck: stop thinking about "prompt injection" as a user-input problem, and start thinking about every tool output as an untrusted channel that happens to share a token stream with your system prompt.

Tool Output Compression: The Injection Decision That Shapes Context Quality

· 10 min read
Tian Pan
Software Engineer

Your agent calls a database tool. The query returns 8,000 tokens of raw JSON — nested objects, null fields, pagination metadata, and a timestamp on every row. Your agent needs three fields from that response. You just paid for 7,900 tokens of noise, and you injected all of them into context where they'll compete for attention against the actual task.

This is the tool output injection problem, and it's the most underrated architectural decision in agent design. Most teams discover it the hard way: the demo works, production degrades, and nobody can explain why the model started hedging answers it used to answer confidently.

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.

The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.

Data Versioning for AI: The Dataset-Model Coupling Problem Teams Discover Too Late

· 9 min read
Tian Pan
Software Engineer

Your model's accuracy dropped 8% in production overnight. Nothing in the model code changed. No deployment happened. The eval suite is green. So you spend a week adjusting hyperparameters, tweaking prompts, comparing checkpoint losses — and eventually someone notices that a schema migration landed three days ago in the feature pipeline. A single field that switched from NULL to an empty string. That's it. That's the regression.

This is the most common failure mode in production ML systems, and it has almost nothing to do with model quality. It has everything to do with a structural gap most teams don't close until they've been burned: data versions and model versions are intimately coupled, but they're tracked by different tools and owned by different teams.

Your Annotation Pipeline Is the Real Bottleneck in Your AI Product

· 10 min read
Tian Pan
Software Engineer

Every team working on an AI product eventually ships a feedback widget. Thumbs up. Thumbs down. Maybe a star rating or a correction field. The widget launches. The data flows. And then nothing changes about the model — for weeks, then months — while the team remains genuinely convinced they have a working feedback loop.

The widget was the easy part. The annotation pipeline behind it is where AI products actually stall.

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.