Skip to main content

46 posts tagged with "architecture"

View all tags

The LLM Pipeline Monolith vs. Chain Trade-off: When Task Decomposition Helps and When It Hurts

· 8 min read
Tian Pan
Software Engineer

Most teams building LLM pipelines reach for chaining almost immediately. A complex task gets split into steps — extract, then classify, then summarize, then format — and each step gets its own prompt. It feels right: smaller prompts are easier to write, easier to debug, and easier to iterate on. But here's what rarely gets asked: is a chain actually more accurate than doing the whole thing in one call? In most codebases I've seen, nobody measured.

The monolith vs. chain trade-off is one of the most consequential architectural decisions in AI engineering, and it's almost always made by instinct. This post breaks down what the empirical evidence says, when decomposition genuinely helps, when it quietly makes things worse, and what signals to watch for in production.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

· 10 min read
Tian Pan
Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.

Compound Failure Modes in AI Pipelines: When Partial Success Isn't Enough

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI pipelines think about each component in isolation: how often does retrieval succeed, how often does the LLM do the right thing, how often does the downstream tool call land. If each answer comes back "95%," the system feels solid.

It isn't. Three components at 95% each give you an 86% reliable system. Add a fourth at 95% and you're at 81%. Add a fifth and you're below 77%. What felt like a solid stack of high-quality components produces a pipeline that fails one in five requests before you've shipped a single feature.

That's the compound failure problem, and it's the calculation most AI engineering teams skip until users start filing tickets.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

Multi-Model Consistency: When Your Pipeline's Sequential LLM Calls Contradict Each Other

· 9 min read
Tian Pan
Software Engineer

Your summarization step decides a customer complaint is about billing. Your extraction step pulls "subscription tier: Pro." Your generation step writes a follow-up email referencing their "Enterprise plan." Three LLM calls, one pipeline, one completely broken output — and no error was raised anywhere along the way.

This is multi-model consistency failure: the silent killer of compound AI systems. It doesn't look like an exception. It doesn't trigger your error rate SLO. It just ships confidently wrong content to users.

Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

· 10 min read
Tian Pan
Software Engineer

Most teams that build LLM-powered scientific tools make the same architectural mistake: they reach for a coding agent framework, swap in domain-specific tools, and call it a research agent. It isn't. Coding agents and research agents share surface-level mechanics — both call tools, both iterate — but their fundamental assumptions about success, state, and termination are almost perfectly inverted. Deploying a coding agent architecture in a scientific workflow doesn't just produce worse results; it produces confidently wrong results, and does so in ways that are nearly impossible to catch after the fact.

The distinction matters urgently now because research agent benchmarks are proliferating, teams are racing to build scientific AI, and the "just use a coding agent" shortcut is generating a wave of plausible-sounding tools that fail in production scientific contexts for reasons their builders don't fully understand.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

The Ambient AI Coherence Problem: When Every Feature Is AI-Powered, Nothing Feels Like One Product

· 9 min read
Tian Pan
Software Engineer

Most AI products get the individual features right and the product wrong. Search returns plausible results. The summary is coherent. The chat assistant gives reasonable advice. But when a user searches for "best plan for small teams," gets a recommendation in the sidebar, asks the assistant a follow-up question, and then reads an auto-generated summary of their options — and all four contradict each other — none of the features feel trustworthy anymore. This is the ambient AI coherence problem: not hallucination in isolation, but contradiction at the product level.

The failure mode is subtle enough that teams often miss it entirely. Individual feature evals look fine. The search team measures recall and precision. The summarization team measures faithfulness. The chat team measures task completion. Nobody measures whether the AI-powered features of the product tell the same story about the same facts.

The Inference Gateway Pattern: Why Every Production AI Team Builds the Same Middleware

· 8 min read
Tian Pan
Software Engineer

Every team shipping LLM-powered features goes through the same arc. First, you hardcode an OpenAI API call. Then you add a retry loop. Then someone asks how much you're spending. Then a provider goes down on a Friday afternoon, and suddenly you're building a gateway.

This isn't accidental. The inference gateway is an emergent architectural pattern — a middleware layer between your application and LLM providers that consolidates rate limiting, failover, cost tracking, prompt logging, and routing into a single chokepoint. It's the load balancer of the AI era, and if you're running models in production, you either have one or you're building one without realizing it.