Skip to main content

7 posts tagged with "ai-architecture"

View all tags

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

Your System Prompt Will Leak: Designing for Prompt Extraction

· 10 min read
Tian Pan
Software Engineer

The threat model for LLM features over-indexes on three failure modes: prompt injection, user-data exfiltration, and unauthorized tool calls. There is a quieter attack that lands more often, costs less to mount, and shows up in fewer postmortems because nobody filed one — prompt extraction. An adversarial user, sometimes a competitor, sometimes a curious researcher, walks the model into reciting its own system prompt over a handful of turns. The carefully tuned instructions that encode your team's product behavior, refusal policy, retrieval scaffolding, and brand voice land in a public GitHub repository within the week.

The repositories already exist. A widely-circulated GitHub project tracks extracted system prompts from Claude, ChatGPT, Gemini, Grok, Perplexity, Cursor, and v0.dev — updated as new model versions ship, often within hours of release. Anthropic's full Claude prompt clocks in at over 24,000 tokens including tools, and you can read it. The companies most invested in prompt secrecy are the ones whose prompts leak most reliably, because they are also the ones whose attackers are most motivated.

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

· 10 min read
Tian Pan
Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

· 10 min read
Tian Pan
Software Engineer

The instinct is always to reach for the biggest model. Pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

LLM Provider Lock-in: The Portability Patterns That Actually Work

· 8 min read
Tian Pan
Software Engineer

Everyone talks about avoiding LLM vendor lock-in. The advice usually boils down to "use an abstraction layer" — as if swapping openai.chat.completions.create for litellm.completion solves the problem. It doesn't. The API call is the easy part. The real lock-in is invisible: it lives in your prompts, your evaluation data, your tool-calling assumptions, and the behavioral quirks you've unconsciously designed around.

Provider portability isn't a boolean. It's a spectrum, and most teams are further from the portable end than they think. The good news is that the patterns for genuine portability are well understood — they just require more discipline than dropping in a wrapper library.

Stateful vs. Stateless AI Features: The Architectural Decision That Shapes Everything Downstream

· 12 min read
Tian Pan
Software Engineer

When a shopping assistant recommends baby products to a user who mentioned a pregnancy two years ago, nobody threw an exception. The system worked exactly as designed. The LLM returned a confident response with HTTP 200. The bug was in the data — a stale memory that was never invalidated — and it was completely invisible until a customer complained. That's the ghost that lives in stateful AI systems, and it behaves nothing like the bugs you're used to debugging.

The decision between stateful and stateless AI features looks deceptively simple on the surface. In practice, it's one of the earliest architectural choices you'll make for an AI product, and it propagates consequences through your storage layer, your debugging toolchain, your security posture, and your operational costs. Most teams make this decision implicitly, by defaulting to one pattern without examining the tradeoffs. This post is about making it explicitly.

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

· 10 min read
Tian Pan
Software Engineer

Here is a number that should give you pause before adopting a reasoning model for your agent: a single query that costs 7 tokens with a standard fast model costs 255 tokens with Claude extended thinking and 603 tokens with an aggressively-configured reasoning model. For an isolated chatbot query, that is manageable. But inside an agent loop that calls the model twelve times per task, you are not paying a 10x premium — you are paying a 10x premium times twelve, compounded further by the growing context window that gets re-fed on every turn. Billing surprises have killed agent projects faster than accuracy problems.

The question is not whether reasoning models are better. On hard tasks, they clearly are. The question is whether they are better for your specific workload, at your specific position in the agent loop, and by a margin that justifies the cost. Most teams answer this incorrectly in both directions — they either apply reasoning models uniformly (burning budget on tasks that don't need them) or avoid them entirely (leaving accuracy gains on the table for the tasks that do).