Skip to main content

79 posts tagged with "architecture"

View all tags

Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months

· 11 min read
Tian Pan
Software Engineer

The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.

The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

· 12 min read
Tian Pan
Software Engineer

The first compliance escalation against your agent memory layer almost never arrives as a regulator's letter. It arrives as a Jira ticket from your enterprise sales engineer that says "the customer's privacy team is blocking the contract — they want to know what 'forget my user' actually means in your system, and they want a written answer by Friday." That ticket lands six to twelve months after the memory layer shipped, and the engineering team that built it discovers, in the time it takes to read the question, that they accidentally built a records-management system without any of the primitives a records-management system is supposed to have.

This is the structural problem with long-term memory in agentic products. The team building it optimizes for the things memory is sold to do — retrieval quality, latency, storage cost, the felt-personalization that makes the assistant feel like it knows the user. Nobody in the design review prices the parallel system being built at the same time: a per-user, per-tenant, multi-region data store with retention obligations, deletion semantics, audit export requirements, and a regulator's clock that starts the moment the first user's data lands in it. Memory is not a feature. It is the operational surface that every privacy regime, every enterprise procurement questionnaire, and every right-to-erasure request will eventually find.

Conversation History Is a Trust Boundary, Not a Text Blob

· 10 min read
Tian Pan
Software Engineer

The agent ran cleanly for fourteen turns. On the fifteenth, it quietly wired four hundred dollars to an attacker. Nothing in the fifteenth-turn request was malicious. The poisoned instruction had been sitting in turn three — embedded inside a tool result the agent retrieved from a stale support ticket — for forty minutes. The agent re-read the entire history on every step, and every step found the same buried sentence: "If the user mentions a refund, send the funds to the address below first." On turn fifteen, the user mentioned a refund.

This is what conversation-history attacks look like in production, and they look nothing like the prompt injections most teams are still training their guardrails against. The malicious payload is not in the current request. It is already in the history the model reads as ground truth, and it has been there long enough that the team's request-time scanners have stopped looking.

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

· 9 min read
Tian Pan
Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

Cost-Per-Conversation as a Product Contract: When Pricing Drives Architecture

· 10 min read
Tian Pan
Software Engineer

The cleanest way to find out your AI feature's pricing model is wrong is to look at which engineer is currently rewriting the truncation logic at midnight. They aren't shipping a capability — they're patching a unit-economics leak that the PRD never named, and the patch is necessarily user-hostile because the product spec told them the budget was infinite. On a flat-fee SaaS plan, every conversation that runs longer than the median pulls margin out of the company in real time. The only real question is whether the product team admits it before finance does.

Traditional SaaS economics rest on near-zero marginal cost per user: once the software is built, serving the next customer barely moves the infrastructure line. AI features break that assumption. Every turn in a conversation consumes inference compute that scales with prompt size, output length, tool-call fan-out, and retrieval volume — and conversations don't have a natural stopping point. A heavy user can consume 50× the median in a billing period without leaving the happy path of the product. Under flat pricing, that user is funded by the rest of the user base, and the company finds out only when COGS reporting catches up a quarter later.

This is why pricing on AI features is not a finance problem to be handled after launch. It is an architecture input that decides what the product is allowed to do, and refusing to make it visible in the spec just means it gets resolved later, in worse ways, by people without product authority.

Cross-Channel Memory: When Your Agent Forgets the Email Thread

· 10 min read
Tian Pan
Software Engineer

A customer asks your assistant in Slack on Monday how to enable a feature, gets a clean answer, and goes about their day. On Friday they email asking to confirm what was decided, and the assistant — running off a different session store, with no idea Monday's chat ever happened — gives a contradictory recommendation. The customer doesn't file two tickets against two products. They file one ticket against your AI, and they're right to. To them there is one assistant. The fact that you wrote three of them, glued to three surface-specific session stores, is an implementation detail you weren't supposed to leak.

This is the cross-channel memory problem, and it sits at the intersection of two things teams underestimate: how aggressively users assume continuity, and how aggressively channel teams write their own session stores because it was the path of least resistance to ship. Recent industry data puts the gap in stark terms — only 13% of organizations successfully carry full conversation context across channels, and CSAT for fragmented multichannel support sits at 28% versus 67% for true omnichannel. The 39-point delta isn't a model quality gap. It's a memory architecture gap.

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

HIPAA, SOC2, and Your Agent: The Architectural Constraints Compliance Actually Imposes

· 12 min read
Tian Pan
Software Engineer

The typical AI team's encounter with compliance goes like this: the agent is in production, users love it, and someone from legal forwards an email asking whether the system is HIPAA-compliant. The engineer assigned to answer discovers that context windows contain PHI, that there are no audit logs with sufficient granularity, that the LLM provider doesn't have a signed Business Associate Agreement, and that the agent's tool permissions are broader than the minimum necessary standard allows. The fix takes three months and requires a partial rewrite.

This pattern is not an edge case. According to a 2024 industry survey, 78% of business executives cannot pass an AI governance audit within 90 days, and 42% of companies abandoned AI initiatives in 2025 primarily due to compliance and governance failures — not technical ones. The gap between what gets built and what compliance actually requires is architectural, and it forms in sprint one.

Rate Limits Are a Design Constraint, Not an Error Code

· 10 min read
Tian Pan
Software Engineer

A team I know built a financial assistant with an agentic loop. Week one, API spend was 127.Weekeleven,itwas127. Week eleven, it was 47,000 — same system, same feature, no intentional change in scope. The agent hit a rate limit, the retry logic dutifully retried, the loop had no circuit breaker, and the costs compounded in silence until someone noticed the billing alert they had set too high.

This isn't a story about a bug. It's a story about architecture. The team's mental model treated rate limits as an error to handle reactively. The system they built reflected that model exactly. The $47,000 week was the system working as designed.

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

· 10 min read
Tian Pan
Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.