Skip to main content

79 posts tagged with "architecture"

View all tags

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.

Disconnected Agent Mode: Designing for the Network You Don't Have

· 11 min read
Tian Pan
Software Engineer

A flight attendant asks you to switch to airplane mode. The customer-support agent your team shipped last quarter is mid-conversation in a tab, and the next user turn returns a spinner that never resolves. The agent isn't broken in any interesting way. It just assumed, in a hundred unwritten places, that the network exists.

That assumption is the most expensive line of code your product team never wrote down. It governs how you store conversation state, how you call tools, how you surface errors, what you eval against, and what your users do when the connection drops in the middle of work that mattered to them. Disconnected agent mode is the discipline of pulling that assumption out of the foundation, looking at it, and deciding — explicitly — what should happen when the round trip to a hosted API isn't available.

Runtime Prompt Hot-Reload: Why Your Prompts Shouldn't Be Locked Behind a Build

· 11 min read
Tian Pan
Software Engineer

The first AI incident at most companies follows a script: a prompt-engineer notices the model is misclassifying a category that just started showing up in real traffic, opens a PR with a one-line tweak to the system prompt, and watches the build queue for the next 23 minutes while the model continues to misclassify in production. The fix is a string. The deployment is a binary. The mismatch is not a tooling oversight — it is an architectural decision the team made implicitly the day they put the system prompt in a .py file alongside the application code.

Coupling prompt changes to the deploy pipeline is a constraint you imposed on yourself. There is no law of distributed systems that says the model's behavior contract has to ship inside the same artifact as the orchestration code. The runtime prompt hot-reload pattern severs that coupling by treating prompts the way you already treat feature flags, routing rules, and pricing tables — as configuration pulled from a versioned store at request time, with a short-lived local cache and well-defined safety primitives around it. The payoff is incident-response measured in seconds rather than build minutes, and the cost is an honest accounting of a third deployment surface your release process probably ignores.

Skills as Modules: When Your Agent Stack Needs an Import System

· 10 min read
Tian Pan
Software Engineer

A team I talked to last month hit a bug that any seasoned package-manager user would recognize on sight. Two skills in their agent shipped the same search_orders capability — one came from a billing toolpack, one came from a CRM toolpack. Whichever had been added to the manifest most recently won. The agent silently called the wrong one for three weeks. Refunds went to the wrong customer IDs. Their fix, they told me, was a meeting with the CRM and billing engineers to "agree on naming." A meeting. To resolve a name conflict between two installable modules.

That's the moment I realized what's happening in agent runtimes right now. The runtime-loadable capability pattern — skills, tool packs, prompt fragments, retrieval providers, MCP servers — is converging on the same problem languages solved with import systems decades ago. Name resolution. Version pinning. Dependency graphs. Conflict detection. Lazy loading. And most agent runtimes are reinventing each one badly, or not at all, and shipping the bill to their users in the form of meetings.

When to Skip Real-Time LLM Inference: The Production Case for Async Batch Pipelines

· 10 min read
Tian Pan
Software Engineer

There's a team somewhere right now watching their LLM spend grow 10x month-over-month while their p99 latency hovers around four seconds. The engineers added more retries. The retries hit rate limits. The rate limits triggered fallbacks. The fallbacks are also LLM calls. Nobody paused to ask: does this feature actually need to respond in real time?

Most AI product teams architect for the happy path — user sends a message, model responds, user sees it. The synchronous call pattern is what the API SDK demonstrates in its first code sample, and so that's what ships. But a surprisingly large share of production LLM workloads have nothing to do with a user waiting at a keyboard. They're document enrichment jobs, content classification pipelines, embedding generation tasks, nightly digest generation, and background quality scoring. For those workloads, real-time inference is the wrong tool — and the price you pay for using it anyway is real money, cascading failures, and operational complexity you'll spend months untangling.

Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing

· 11 min read
Tian Pan
Software Engineer

Most agent memory systems collapse a four-layer problem into two layers and then act surprised when the seams show. There is the conversation buffer that gets truncated when it overflows the context window, and there is the vector store of "long-term memory" that everything older than the buffer gets dumped into. That is not a memory architecture. That is a queue and a junk drawer.

The agent that re-asks a regular user the same onboarding question three Mondays in a row is not failing because the model is bad. It is failing because there is no place in the system that holds "things this user has told me across sessions" with a different lifetime than "things every user has ever told me about how the product works." Those are different memories. They have different access patterns, different privacy contracts, and different rules for when to forget. Conflating them is the architectural mistake — and it has a fix.

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

· 10 min read
Tian Pan
Software Engineer

Most "agents" in production today are background jobs wearing a chat interface. They do not need a user typing into them. They need a trigger, a state file, and a way to resume after the inevitable timeout. The conversational loop — request, tool call, request, tool call, indefinitely — is a demo affordance that quietly became the default execution model, and it is the wrong model for the majority of agentic work that ships.

The decision is not philosophical. It shows up on the bill, in the on-call pager, and in the percentage of runs that finish at all. A conversational loop holds a model session open across many turns, accumulates context, and dies if any link in the chain fails. A scheduled trigger fires at a deterministic boundary, runs to completion or to a checkpoint, and writes its state somewhere durable before exiting. One is a phone call. The other is a job queue. Treating the two as interchangeable is how a $200/month feature becomes a $40,000/month feature without anyone changing the prompt.

Browser-Native AI Is a Per-Feature Decision: Four Axes Your Team Hasn't Priced

· 12 min read
Tian Pan
Software Engineer

The model-in-the-tab story used to be easy to dismiss: small models, novelty demos, a cute Whisper transcription that ran for thirty seconds before the laptop fan turned on. That story is dead. Quantization improved, WebGPU shipped in every major browser, on-device caches got a persistent quota, and 4-bit 3B models now stream tokens at a rate users perceive as "snappy" on a $500 laptop. The "should this run server-side?" question is no longer a default — it is a load-bearing architectural decision your product team is making by accident every time they accept the platform team's first answer.

The mistake that follows is bigger than the demo getting worse. Teams pick one backend — usually server inference, sometimes browser inference — for the entire product, and then pay the wrong tax on every feature that doesn't fit. The privacy-sensitive feature loses to the latency-sensitive one because the architecture forces a single answer. Or worse, the team picks browser-native because the demo was magical, then ships a fleet experience where 30% of users on the long-tail device population get a degraded product the dashboard can't see.

Browser-native AI is not a faster TensorFlow.js. It is a different runtime with a different SRE story, a different cost model, and a four-axis trade-off that does not collapse into a single answer. Treating it as "the cheap version of the API call" is the architectural mistake of 2026.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

· 11 min read
Tian Pan
Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Agent Memory Drift: Why Reconciliation Is the Loop You're Missing

· 11 min read
Tian Pan
Software Engineer

The most dangerous thing your long-running agent does is also the thing it does most confidently: answer from memory. The customer's address changed last Tuesday. The ticket the agent thinks is "open" was closed yesterday by a human. The product feature the agent has tidy explanatory notes about shipped in a different shape than the spec the agent read three weeks ago. None of this is hallucination in the textbook sense — the model is recalling exactly what it stored. The world simply moved while the agent was looking elsewhere.

Most teams treat memory like a write problem: what should the agent remember, how do we summarize, what's the embedding strategy, how do we keep the store from blowing up. That framing produces architectures that grow more confident as they grow more wrong. The harder problem — the one that determines whether your agent stays useful past week three — is reconciliation: the explicit, ongoing loop that compares what the agent thinks is true against what the underlying systems say is true right now.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.