Skip to main content

86 posts tagged with "architecture"

View all tags

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

What You Deleted Is Invisible to Your Coding Agent

· 10 min read
Tian Pan
Software Engineer

You spent Tuesday afternoon deleting a dead utility module. You cleaned up the imports, ran the type checker, watched CI go green, and merged the PR. Wednesday morning, a fresh agent session looks at the same code, decides the codebase is "missing" a small helper, and writes the dead module back in — same name, same shape, slightly different style. The reviewer who approved the deletion yesterday now has to remember why they killed it, find the conversation that justified it, and explain it again. The agent is not malfunctioning. It is doing exactly what its context says to do.

This is the structural reliability problem of coding agents that nobody is solving with prompt engineering: the agent's context starts from the repository's current state, but not from the history of why that state is what it is. The file you removed leaves no trace the agent can see. The dependency you migrated away from is just another package on npm. The flaky test you intentionally deleted is a coverage gap waiting to be "fixed." Absence — the negative space of decisions you made — is invisible.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Build vs. Buy Is the Wrong Question for Your AI Feature

· 9 min read
Tian Pan
Software Engineer

Every planning meeting about an AI feature collapses into the same binary. One camp wants to "just wrap an API" and ship next sprint. The other wants to "own the model" so the company controls its destiny. The argument feels strategic. It is actually a category error.

Build vs. buy treats your AI feature as one indivisible thing that you either make or purchase. But an AI feature is not one thing. It is a stack of at least five distinct layers, and each layer has its own answer. The team that frames the decision as a single coin flip will almost always own the wrong layer and rent the wrong layer, because the question they asked could not distinguish between them.

The better question is not "can we build it?" Most things, you can build. The question is: which layer breaks our differentiation if a competitor buys the exact same thing tomorrow? That question sorts the stack for you.

Why Your Agent Needs a Read Replica: Read/Write Splitting for Agent Memory

· 10 min read
Tian Pan
Software Engineer

Most agent memory is one undifferentiated store. The loop reads from it to assemble context at the start of every step, and writes to it after every action — new observations, running summaries, scratchpad edits. Same store, same access path, no separation. It works fine in a demo and starts to rot the moment the agent runs long enough for the store to get large.

The reason it rots is familiar to anyone who has scaled a database. A single store that serves both reads and writes is a single-primary database with no replica, and it inherits every problem that topology has under load: writes contend with reads, a half-written record gets read mid-update, and there is no isolation between the volatile working set and the durable record. We solved this for databases decades ago by splitting reads from writes. Agent memory deserves the same treatment.

The fix is not a bigger vector index or a smarter embedding model. It is an architectural one — recognizing that "memory" is two different workloads wearing the same name, and giving each the storage discipline it actually needs.

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

· 11 min read
Tian Pan
Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months

· 11 min read
Tian Pan
Software Engineer

The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.

The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

· 12 min read
Tian Pan
Software Engineer

The first compliance escalation against your agent memory layer almost never arrives as a regulator's letter. It arrives as a Jira ticket from your enterprise sales engineer that says "the customer's privacy team is blocking the contract — they want to know what 'forget my user' actually means in your system, and they want a written answer by Friday." That ticket lands six to twelve months after the memory layer shipped, and the engineering team that built it discovers, in the time it takes to read the question, that they accidentally built a records-management system without any of the primitives a records-management system is supposed to have.

This is the structural problem with long-term memory in agentic products. The team building it optimizes for the things memory is sold to do — retrieval quality, latency, storage cost, the felt-personalization that makes the assistant feel like it knows the user. Nobody in the design review prices the parallel system being built at the same time: a per-user, per-tenant, multi-region data store with retention obligations, deletion semantics, audit export requirements, and a regulator's clock that starts the moment the first user's data lands in it. Memory is not a feature. It is the operational surface that every privacy regime, every enterprise procurement questionnaire, and every right-to-erasure request will eventually find.

Conversation History Is a Trust Boundary, Not a Text Blob

· 10 min read
Tian Pan
Software Engineer

The agent ran cleanly for fourteen turns. On the fifteenth, it quietly wired four hundred dollars to an attacker. Nothing in the fifteenth-turn request was malicious. The poisoned instruction had been sitting in turn three — embedded inside a tool result the agent retrieved from a stale support ticket — for forty minutes. The agent re-read the entire history on every step, and every step found the same buried sentence: "If the user mentions a refund, send the funds to the address below first." On turn fifteen, the user mentioned a refund.

This is what conversation-history attacks look like in production, and they look nothing like the prompt injections most teams are still training their guardrails against. The malicious payload is not in the current request. It is already in the history the model reads as ground truth, and it has been there long enough that the team's request-time scanners have stopped looking.

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

· 9 min read
Tian Pan
Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.