10 posts tagged with "ai-architecture"

Per-Customer Prompt Forks: Why Your Next Model Migration Is 47 Migrations

May 14, 2026 · 12 min read

Software Engineer

The CTO of an AI startup I talked to last month opened her laptop and showed me a number: 47. That was the count of distinct system prompts running in production, one per enterprise customer or per logical group of them. The base prompt had been forked once in month four for a healthcare customer that needed a softer refusal posture. Then once more for a legal customer that wanted citations. Then for a financial-services customer whose compliance team had a list of forbidden phrases. None of these felt like a big deal at the time. Each was a small ask, approved in isolation, that the account team could close the deal on.

Two years later, the model provider announced the cutover deadline for the version her prompts were tuned against. Her engineering team's first instinct was to run the eval suite against the new model. The eval suite was scoped to the base prompt. The base prompt was still serving customer zero, which had no overrides, and which represented roughly 9% of revenue.

When Your Forbidden List Becomes a Recipe: The Hidden Cost of Negative Examples in Prompts

May 13, 2026 · 10 min read

Tian Pan

Software Engineer

Open a mature production system prompt and search for the word "not." On a feature that has shipped through three quarters and survived a handful of incidents, you will almost always find a section that looks like a list of regrets — "do not give medical advice, do not generate code matching these patterns, do not produce content with this regex, do not impersonate these competitors, do not use these phrases." Each line traces back to a specific incident. Each line was added with confidence by an engineer who said "this will fix it." And the list grows, every quarter, in the same way a museum acquires exhibits.

What very few teams will admit out loud is that this list — the prompt's most defensive, most carefully reviewed section — is also the most useful artifact in the entire feature for the wrong reader. A determined user who extracts the system prompt now has a curated, organized, model-readable inventory of every behavior the team is afraid of. The forbidden list is a recipe. The team wrote the cookbook.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

May 5, 2026 · 11 min read

Tian Pan

Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

April 27, 2026 · 13 min read

Tian Pan

Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

Your System Prompt Will Leak: Designing for Prompt Extraction

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The threat model for LLM features over-indexes on three failure modes: prompt injection, user-data exfiltration, and unauthorized tool calls. There is a quieter attack that lands more often, costs less to mount, and shows up in fewer postmortems because nobody filed one — prompt extraction. An adversarial user, sometimes a competitor, sometimes a curious researcher, walks the model into reciting its own system prompt over a handful of turns. The carefully tuned instructions that encode your team's product behavior, refusal policy, retrieval scaffolding, and brand voice land in a public GitHub repository within the week.

The repositories already exist. A widely-circulated GitHub project tracks extracted system prompts from Claude, ChatGPT, Gemini, Grok, Perplexity, Cursor, and v0.dev — updated as new model versions ship, often within hours of release. Anthropic's full Claude prompt clocks in at over 24,000 tokens including tools, and you can read it. The companies most invested in prompt secrecy are the ones whose prompts leak most reliably, because they are also the ones whose attackers are most motivated.

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

April 13, 2026 · 10 min read

Tian Pan

Software Engineer

The instinct is always to reach for the biggest model. Pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

LLM Provider Lock-in: The Portability Patterns That Actually Work

April 13, 2026 · 8 min read

Tian Pan

Software Engineer

Everyone talks about avoiding LLM vendor lock-in. The advice usually boils down to "use an abstraction layer" — as if swapping openai.chat.completions.create for litellm.completion solves the problem. It doesn't. The API call is the easy part. The real lock-in is invisible: it lives in your prompts, your evaluation data, your tool-calling assumptions, and the behavioral quirks you've unconsciously designed around.

Provider portability isn't a boolean. It's a spectrum, and most teams are further from the portable end than they think. The good news is that the patterns for genuine portability are well understood — they just require more discipline than dropping in a wrapper library.

Stateful vs. Stateless AI Features: The Architectural Decision That Shapes Everything Downstream

April 12, 2026 · 12 min read

Tian Pan

Software Engineer

When a shopping assistant recommends baby products to a user who mentioned a pregnancy two years ago, nobody threw an exception. The system worked exactly as designed. The LLM returned a confident response with HTTP 200. The bug was in the data — a stale memory that was never invalidated — and it was completely invisible until a customer complained. That's the ghost that lives in stateful AI systems, and it behaves nothing like the bugs you're used to debugging.

The decision between stateful and stateless AI features looks deceptively simple on the surface. In practice, it's one of the earliest architectural choices you'll make for an AI product, and it propagates consequences through your storage layer, your debugging toolchain, your security posture, and your operational costs. Most teams make this decision implicitly, by defaulting to one pattern without examining the tradeoffs. This post is about making it explicitly.

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Here is a number that should give you pause before adopting a reasoning model for your agent: a single query that costs 7 tokens with a standard fast model costs 255 tokens with Claude extended thinking and 603 tokens with an aggressively-configured reasoning model. For an isolated chatbot query, that is manageable. But inside an agent loop that calls the model twelve times per task, you are not paying a 10x premium — you are paying a 10x premium times twelve, compounded further by the growing context window that gets re-fed on every turn. Billing surprises have killed agent projects faster than accuracy problems.

The question is not whether reasoning models are better. On hard tasks, they clearly are. The question is whether they are better for your specific workload, at your specific position in the agent loop, and by a margin that justifies the cost. Most teams answer this incorrectly in both directions — they either apply reasoning models uniformly (burning budget on tasks that don't need them) or avoid them entirely (leaving accuracy gains on the table for the tasks that do).

About Tian Pan