Skip to main content

86 posts tagged with "architecture"

View all tags

LLM Vendor Lock-In Is a Spectrum, Not a Binary

· 10 min read
Tian Pan
Software Engineer

A team builds a production feature on GPT-4. Months later, they decide to evaluate Claude for cost reasons. They spend two weeks "migrating"—but the core API swap takes an afternoon. The remaining ten days go toward fixing broken system prompts, re-testing refusal edge cases, debugging JSON parsers that choke on unexpected prose, and re-tuning tool-calling schemas that behave differently across providers. Migration estimates that assumed a simple connector swap balloon into a multi-layer rebuild.

This is the LLM vendor lock-in problem in practice. And the teams that get burned aren't the ones who chose the wrong provider—they're the ones who didn't recognize that lock-in exists on multiple axes, each with a different risk profile.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

The Orchestration Framework Trap: When LangChain Makes You Slower to Ship

· 8 min read
Tian Pan
Software Engineer

At some point in 2024, a pattern started appearing in engineering postmortems across AI teams: "We rewrote it without LangChain and shipping became significantly faster." The teams in these postmortems hadn't made a technical mistake in adopting the framework — they'd made a timing mistake. LangChain was the right tool for the prototype and the wrong tool for month seven.

The same story played out enough times that it has a name now: the orchestration framework trap. You adopt a framework that genuinely accelerates early work, and the productivity gain masks a growing structural debt. By the time the debt is visible, you're deep in internals that were never meant to be touched.

Chatbot, Copilot, or Agent: The Taxonomy That Changes Your Architecture

· 10 min read
Tian Pan
Software Engineer

The most expensive architectural mistake in AI engineering is not picking the wrong model. It's picking the wrong interaction paradigm. Teams that should be building an agent spend six months refining a chatbot, then wonder why users can't get anything done. Teams that should be building a copilot wire up full agentic autonomy and spend the next quarter firefighting unauthorized actions and runaway costs.

The taxonomy matters before you write a single line of code, because chatbots, copilots, and agents have fundamentally different trust models, context-window strategies, and error-recovery requirements. Getting this wrong doesn't just produce a worse product — it produces a product that cannot be fixed by tuning prompts or swapping models.

The LLM Pipeline Monolith vs. Chain Trade-off: When Task Decomposition Helps and When It Hurts

· 8 min read
Tian Pan
Software Engineer

Most teams building LLM pipelines reach for chaining almost immediately. A complex task gets split into steps — extract, then classify, then summarize, then format — and each step gets its own prompt. It feels right: smaller prompts are easier to write, easier to debug, and easier to iterate on. But here's what rarely gets asked: is a chain actually more accurate than doing the whole thing in one call? In most codebases I've seen, nobody measured.

The monolith vs. chain trade-off is one of the most consequential architectural decisions in AI engineering, and it's almost always made by instinct. This post breaks down what the empirical evidence says, when decomposition genuinely helps, when it quietly makes things worse, and what signals to watch for in production.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

· 10 min read
Tian Pan
Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.

Compound Failure Modes in AI Pipelines: When Partial Success Isn't Enough

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI pipelines think about each component in isolation: how often does retrieval succeed, how often does the LLM do the right thing, how often does the downstream tool call land. If each answer comes back "95%," the system feels solid.

It isn't. Three components at 95% each give you an 86% reliable system. Add a fourth at 95% and you're at 81%. Add a fifth and you're below 77%. What felt like a solid stack of high-quality components produces a pipeline that fails one in five requests before you've shipped a single feature.

That's the compound failure problem, and it's the calculation most AI engineering teams skip until users start filing tickets.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

Multi-Model Consistency: When Your Pipeline's Sequential LLM Calls Contradict Each Other

· 9 min read
Tian Pan
Software Engineer

Your summarization step decides a customer complaint is about billing. Your extraction step pulls "subscription tier: Pro." Your generation step writes a follow-up email referencing their "Enterprise plan." Three LLM calls, one pipeline, one completely broken output — and no error was raised anywhere along the way.

This is multi-model consistency failure: the silent killer of compound AI systems. It doesn't look like an exception. It doesn't trigger your error rate SLO. It just ships confidently wrong content to users.