87 posts tagged with "architecture"

When to Skip Real-Time LLM Inference: The Production Case for Async Batch Pipelines

May 2, 2026 · 10 min read

Software Engineer

There's a team somewhere right now watching their LLM spend grow 10x month-over-month while their p99 latency hovers around four seconds. The engineers added more retries. The retries hit rate limits. The rate limits triggered fallbacks. The fallbacks are also LLM calls. Nobody paused to ask: does this feature actually need to respond in real time?

Most AI product teams architect for the happy path — user sends a message, model responds, user sees it. The synchronous call pattern is what the API SDK demonstrates in its first code sample, and so that's what ships. But a surprisingly large share of production LLM workloads have nothing to do with a user waiting at a keyboard. They're document enrichment jobs, content classification pipelines, embedding generation tasks, nightly digest generation, and background quality scoring. For those workloads, real-time inference is the wrong tool — and the price you pay for using it anyway is real money, cascading failures, and operational complexity you'll spend months untangling.

Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing

May 1, 2026 · 11 min read

Tian Pan

Software Engineer

Most agent memory systems collapse a four-layer problem into two layers and then act surprised when the seams show. There is the conversation buffer that gets truncated when it overflows the context window, and there is the vector store of "long-term memory" that everything older than the buffer gets dumped into. That is not a memory architecture. That is a queue and a junk drawer.

The agent that re-asks a regular user the same onboarding question three Mondays in a row is not failing because the model is bad. It is failing because there is no place in the system that holds "things this user has told me across sessions" with a different lifetime than "things every user has ever told me about how the product works." Those are different memories. They have different access patterns, different privacy contracts, and different rules for when to forget. Conflating them is the architectural mistake — and it has a fix.

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

April 30, 2026 · 10 min read

Tian Pan

Software Engineer

Most "agents" in production today are background jobs wearing a chat interface. They do not need a user typing into them. They need a trigger, a state file, and a way to resume after the inevitable timeout. The conversational loop — request, tool call, request, tool call, indefinitely — is a demo affordance that quietly became the default execution model, and it is the wrong model for the majority of agentic work that ships.

The decision is not philosophical. It shows up on the bill, in the on-call pager, and in the percentage of runs that finish at all. A conversational loop holds a model session open across many turns, accumulates context, and dies if any link in the chain fails. A scheduled trigger fires at a deterministic boundary, runs to completion or to a checkpoint, and writes its state somewhere durable before exiting. One is a phone call. The other is a job queue. Treating the two as interchangeable is how a $200/month feature becomes a $40,000/month feature without anyone changing the prompt.

Browser-Native AI Is a Per-Feature Decision: Four Axes Your Team Hasn't Priced

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The model-in-the-tab story used to be easy to dismiss: small models, novelty demos, a cute Whisper transcription that ran for thirty seconds before the laptop fan turned on. That story is dead. Quantization improved, WebGPU shipped in every major browser, on-device caches got a persistent quota, and 4-bit 3B models now stream tokens at a rate users perceive as "snappy" on a $500 laptop. The "should this run server-side?" question is no longer a default — it is a load-bearing architectural decision your product team is making by accident every time they accept the platform team's first answer.

The mistake that follows is bigger than the demo getting worse. Teams pick one backend — usually server inference, sometimes browser inference — for the entire product, and then pay the wrong tax on every feature that doesn't fit. The privacy-sensitive feature loses to the latency-sensitive one because the architecture forces a single answer. Or worse, the team picks browser-native because the demo was magical, then ships a fleet experience where 30% of users on the long-tail device population get a degraded product the dashboard can't see.

Browser-native AI is not a faster TensorFlow.js. It is a different runtime with a different SRE story, a different cost model, and a four-axis trade-off that does not collapse into a single answer. Treating it as "the cheap version of the API call" is the architectural mistake of 2026.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Agent Memory Drift: Why Reconciliation Is the Loop You're Missing

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The most dangerous thing your long-running agent does is also the thing it does most confidently: answer from memory. The customer's address changed last Tuesday. The ticket the agent thinks is "open" was closed yesterday by a human. The product feature the agent has tidy explanatory notes about shipped in a different shape than the spec the agent read three weeks ago. None of this is hallucination in the textbook sense — the model is recalling exactly what it stored. The world simply moved while the agent was looking elsewhere.

Most teams treat memory like a write problem: what should the agent remember, how do we summarize, what's the embedding strategy, how do we keep the store from blowing up. That framing produces architectures that grow more confident as they grow more wrong. The harder problem — the one that determines whether your agent stays useful past week three — is reconciliation: the explicit, ongoing loop that compares what the agent thinks is true against what the underlying systems say is true right now.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

Chat History Is a Database. Stop Treating It Like Scrollback.

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Your Agent Has Two Release Pipelines, Not One

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

About Tian Pan