96 posts tagged with "architecture"

Browser-Native AI Is a Per-Feature Decision: Four Axes Your Team Hasn't Priced

April 28, 2026 · 12 min read

Software Engineer

The model-in-the-tab story used to be easy to dismiss: small models, novelty demos, a cute Whisper transcription that ran for thirty seconds before the laptop fan turned on. That story is dead. Quantization improved, WebGPU shipped in every major browser, on-device caches got a persistent quota, and 4-bit 3B models now stream tokens at a rate users perceive as "snappy" on a $500 laptop. The "should this run server-side?" question is no longer a default — it is a load-bearing architectural decision your product team is making by accident every time they accept the platform team's first answer.

The mistake that follows is bigger than the demo getting worse. Teams pick one backend — usually server inference, sometimes browser inference — for the entire product, and then pay the wrong tax on every feature that doesn't fit. The privacy-sensitive feature loses to the latency-sensitive one because the architecture forces a single answer. Or worse, the team picks browser-native because the demo was magical, then ships a fleet experience where 30% of users on the long-tail device population get a degraded product the dashboard can't see.

Browser-native AI is not a faster TensorFlow.js. It is a different runtime with a different SRE story, a different cost model, and a four-axis trade-off that does not collapse into a single answer. Treating it as "the cheap version of the API call" is the architectural mistake of 2026.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Agent Memory Drift: Why Reconciliation Is the Loop You're Missing

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The most dangerous thing your long-running agent does is also the thing it does most confidently: answer from memory. The customer's address changed last Tuesday. The ticket the agent thinks is "open" was closed yesterday by a human. The product feature the agent has tidy explanatory notes about shipped in a different shape than the spec the agent read three weeks ago. None of this is hallucination in the textbook sense — the model is recalling exactly what it stored. The world simply moved while the agent was looking elsewhere.

Most teams treat memory like a write problem: what should the agent remember, how do we summarize, what's the embedding strategy, how do we keep the store from blowing up. That framing produces architectures that grow more confident as they grow more wrong. The harder problem — the one that determines whether your agent stays useful past week three — is reconciliation: the explicit, ongoing loop that compares what the agent thinks is true against what the underlying systems say is true right now.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

Chat History Is a Database. Stop Treating It Like Scrollback.

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Your Agent Has Two Release Pipelines, Not One

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

April 27, 2026 · 12 min read

Tian Pan

Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything

April 27, 2026 · 12 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped a customer-support agent with the system prompt living in a Postgres row, one row per tenant. The pitch was sensible: enterprise customers had asked for tone customization, and "make the prompt editable" was the cheapest way to deliver it. Six months later, three things had happened. The eval suite had ballooned from 200 cases to 11,000 because every tenant's prompt now needed its own regression set. The prompt-update workflow had quietly become a write path with no review, because product owners had been given direct access to the table. And a single broken UTF-8 character in a Korean-language tenant prompt had taken that tenant's chatbot offline for two days before anyone noticed, because the deploy pipeline had no idea the prompt had changed.

None of these outcomes were forced by the requirements. They were forced by an architecture decision that nobody made deliberately: where does the system prompt live? In the code? In a config file? In a database row? The team picked "database" because it was the fastest path to a feature, and the consequences cascaded into every adjacent system over the following months.

About Tian Pan