Skip to main content

22 posts tagged with "ai-infrastructure"

View all tags

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

Prompts Don't Roll Back Like Code: Why git revert Is the Wrong Primitive

· 9 min read
Tian Pan
Software Engineer

A senior engineer ships a prompt change behind a 10% canary. By the next morning, the canary cohort's helpfulness score has dropped four points, the on-call notices, and the team does what every team does — they revert the commit and redeploy. The dashboard does not recover. It does not recover the next day either. Three days later, a postmortem reveals that the cohort that saw the bad prompt is still seeing degraded outputs because their conversation histories now contain assistant turns produced by the rolled-back prompt, and the model is conditioning on those turns. The commit is gone. The damage is not.

This is the part of LLMOps that the "treat prompts like code" advice quietly skips. Code rollback is a text replacement that restores a deterministic past state. Prompt rollback has to reconcile with a tail of side effects — caches, histories, eval baselines, experiment cohorts, downstream contracts — that the bad prompt has already imprinted on the production world. git revert flips the text. It does not flip the consequences.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

· 10 min read
Tian Pan
Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.

Lazy Evaluation in AI Pipelines: Stop Calling the LLM Until You Have To

· 11 min read
Tian Pan
Software Engineer

Most AI pipelines are written as if every request deserves a full LLM call. The user submits a message, the pipeline passes it to the model, waits for a response, and returns it — every time, unconditionally. This works, but it's expensive, slow, and often unnecessary.

The fraction of requests that actually require a full LLM inference is smaller than most engineers assume. Research on token-level routing shows that only about 11% of tokens differ between a 1.5B and a 32B parameter model, and only 4.9% of tokens are genuinely "divergent" — meaning they alter the reasoning path if handled by the smaller model. Production semantic caches show that 65% of incoming traffic is semantically similar to something the pipeline has already answered. These aren't edge cases. They're the majority of your traffic, and you're paying full price to handle them.

The fix is lazy evaluation: don't invoke the expensive model until you've confirmed that the expensive model is actually needed.

Prompt Position Is Policy: The Silent Merge Conflict When Three Teams Co-Own a System Prompt

· 11 min read
Tian Pan
Software Engineer

The diff in your prompt repo says three lines changed. The behavioral diff in production says everything changed. The safety team moved a refusal rule from line 14 to line 87 to "group it with related guardrails," the product team didn't notice because the wording was identical, and a week later the eval suite is showing a 9-point drop on adversarial inputs. Nobody edited the rule. Somebody moved it. In a 2,400-token system prompt with primacy bias on guardrails and recency bias on instruction-following, moving a rule is a behavioral change as load-bearing as rewriting it — and your tooling surfaces neither.

This is the merge-conflict pattern that AI teams discover at the end of a regression review, not the beginning of one. The system prompt grew past 2K tokens sometime in late 2025. The safety team owns the top, the product team owns the middle, the agent team owns the bottom, and three months of "small edits" have silently rearranged everyone else's intent because the line-based diff tool that worked fine for code can't tell you that an instruction crossed a section boundary. The bug isn't in any single edit. The bug is that position is now policy, and you have no policy on position.

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration

· 9 min read
Tian Pan
Software Engineer

Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.

This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

· 11 min read
Tian Pan
Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Tool Outputs Are an Untrusted Channel Your Agent Treats as Trusted

· 11 min read
Tian Pan
Software Engineer

The threat model most teams ship their agents with has one quiet assumption buried inside: when the model calls a tool, whatever comes back is safe to read. The user's prompt is the adversary, goes the story, and tool outputs are "just data" — search results, inbox summaries, database rows, RAG chunks, file contents, page scrapes. That story is the entire reason prompt injection keeps landing in production. Tool outputs are not data. They are another input channel into the planner, with the same privilege as the user prompt and none of the suspicion.

If that framing sounds abstract, consider what happened inside Microsoft 365 Copilot in June 2025. A researcher sent a single email with hidden instructions; the victim never clicked a link, never opened an attachment, never read the message themselves. A routine "summarize my inbox" query asked Copilot to read the email. The agent dutifully followed the instructions it found inside the body, reached into OneDrive, SharePoint, and Teams, and exfiltrated organizational data through a trusted Microsoft domain before anyone noticed. The CVE (2025-32711, "EchoLeak") earned a 9.3 CVSS and a server-side patch, but the class of bug did not go away. It cannot go away, because every read-tool on every production agent is a version of that email inbox.

This post is about the framing shift that gets you unstuck: stop thinking about "prompt injection" as a user-input problem, and start thinking about every tool output as an untrusted channel that happens to share a token stream with your system prompt.

Tool Output Compression: The Injection Decision That Shapes Context Quality

· 10 min read
Tian Pan
Software Engineer

Your agent calls a database tool. The query returns 8,000 tokens of raw JSON — nested objects, null fields, pagination metadata, and a timestamp on every row. Your agent needs three fields from that response. You just paid for 7,900 tokens of noise, and you injected all of them into context where they'll compete for attention against the actual task.

This is the tool output injection problem, and it's the most underrated architectural decision in agent design. Most teams discover it the hard way: the demo works, production degrades, and nobody can explain why the model started hedging answers it used to answer confidently.