Skip to main content

86 posts tagged with "architecture"

View all tags

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a customer-support agent with the system prompt living in a Postgres row, one row per tenant. The pitch was sensible: enterprise customers had asked for tone customization, and "make the prompt editable" was the cheapest way to deliver it. Six months later, three things had happened. The eval suite had ballooned from 200 cases to 11,000 because every tenant's prompt now needed its own regression set. The prompt-update workflow had quietly become a write path with no review, because product owners had been given direct access to the table. And a single broken UTF-8 character in a Korean-language tenant prompt had taken that tenant's chatbot offline for two days before anyone noticed, because the deploy pipeline had no idea the prompt had changed.

None of these outcomes were forced by the requirements. They were forced by an architecture decision that nobody made deliberately: where does the system prompt live? In the code? In a config file? In a database row? The team picked "database" because it was the fastest path to a feature, and the consequences cascaded into every adjacent system over the following months.

The 95% Reliability Illusion: Why Your 10-Step Agent Fails 40% of the Time

· 12 min read
Tian Pan
Software Engineer

There is a moment in almost every agent project review that ends the conversation. Someone draws a small chart: end-to-end task success rate on the y-axis, number of tool-using steps on the x-axis. The line slopes down hard. The room goes quiet because everyone in it had been arguing about prompts, models, and retrieval strategies — and the chart is saying that none of those debates matter as much as the simple fact that the chain has too many links.

The math is one of the oldest results in reliability engineering, ported into a domain that pretends it is new. If every step in a pipeline succeeds independently with probability p, then n steps in series succeed with probability p to the n. Plug in numbers that sound healthy on a status report: 95% per-step reliability, ten steps, end-to-end success rate of 60%. Twenty steps gets you 36%. Thirty steps gets you 21%. The agent that "works 95% of the time" is the same agent that fails on a third of real user requests, because real user requests are not single steps.

DLP Belongs in Your AI Gateway, Not Bolted Into Every App

· 11 min read
Tian Pan
Software Engineer

The first internal LLM gateway is almost always built for the boring reasons: cost attribution so finance can answer "which team spent the inference budget," rate limiting so one runaway script doesn't burn the monthly quota, provider failover so an OpenAI hiccup doesn't take down the assistant. Data loss prevention shows up on the slide deck, but it ships as "each app team should redact sensitive fields before they call the model." Six months later there are nine apps in production, three half-maintained redaction libraries with subtly different regex sets, two prototypes that bypass the gateway entirely "just for testing," and a customer-data-in-prompt incident that everyone's middleware was supposed to prevent because nobody's middleware was the canonical egress point.

This is not a tooling problem. It is an architectural mistake. DLP is an egress control, and egress controls only work when the path is mandatory. The moment you let app teams own redaction, you've ceded the property that makes DLP function — that there is exactly one place sensitive data can leave, and you can prove what crossed it. The 2025 LayerX Security report puts the scale of the problem in numbers most teams haven't internalized: GenAI-related DLP incidents more than doubled in early 2025 and now make up 14% of all data-security incidents across SaaS traffic, with employees averaging 6.8 pastes into GenAI tools per day, more than half of which contain corporate information. The shadow path is winning by default.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

Design Your Agent State Machine Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

EU AI Act Compliance Is an Engineering Problem: The Audit Trail You Have to Ship

· 10 min read
Tian Pan
Software Engineer

Most engineering teams building AI systems in 2026 understand that the EU AI Act exists. Very few understand what it actually requires them to build. The regulation's core obligations for high-risk AI systems — automatic event logging, human oversight mechanisms, risk management systems, technical documentation — are not policy artifacts that a legal team can produce on a deadline. They are engineering deliverables that require architectural decisions made at the start of a project, not in the final sprint before a compliance audit.

The hard deadline is August 2, 2026. High-risk AI systems deployed in the EU must be in full compliance with Articles 9 through 15. Organizations deploying AI in employment screening, credit scoring, benefits allocation, healthcare prioritization, biometric identification, or critical infrastructure management are in scope. If your system makes decisions that materially affect people in those domains and touches EU residents, it is almost certainly high-risk. And realistic compliance implementation timelines run 8 to 14 months — which means if you haven't started, you're already late.

The Model Portability Tax: How to Architect AI Systems You Can Actually Migrate

· 9 min read
Tian Pan
Software Engineer

You inherited an AI feature built on GPT-4-turbo. The model is being deprecated. Your manager wants to cut costs by switching to a newer, cheaper model. You run a quick test, metrics look passable, you ship it — and a week later, accuracy on your core use case drops 22%. Support tickets climb. You're now in a crisis migration rather than a planned one.

This is the model portability tax: the hidden engineering cost that accumulates every time you couple your application logic tightly to a specific foundation model. Every team pays it. Most don't realize how large the bill has gotten until the invoice arrives.

Multi-User AI Sessions: The Context Ownership Problem Nobody Designs For

· 9 min read
Tian Pan
Software Engineer

In August 2024, security researchers discovered that Slack AI would pull both public and private channel content into the same context window when answering a query. An attacker in a public channel could craft a message that, when ingested by Slack AI, would inject instructions into a victim's session — and since Slack AI doesn't cite its sources, the resulting data exfiltration was nearly untraceable. The attack could leak API keys embedded in private DMs. Slack patched it after responsible disclosure.

This wasn't a bug in the traditional sense. It was a consequence of treating context as a shared mutable resource with no per-user access control. And it's a mistake that most teams building shared AI assistants are making right now, just more quietly.

Agent Protocol Fragmentation: Designing for A2A, MCP, and What Comes Next

· 9 min read
Tian Pan
Software Engineer

Most teams picking an agent protocol are actually making three separate decisions at once — and treating them as one is why so many integrations break the moment a second framework enters the picture.

The three decisions are: how your agent talks to tools and data (vertical integration), how your agent collaborates with other agents (horizontal coordination), and how your agent surfaces state to a human interface (interaction layer). Google's A2A, Anthropic's MCP, and OpenAPI-based REST solve for different layers of this stack. When engineers conflate them, they either over-engineer a single-agent setup with multi-agent machinery, or under-engineer a multi-agent workflow with single-agent tooling. Both failures are expensive to refactor once in production.

The EU AI Act Is Now Your Engineering Backlog

· 12 min read
Tian Pan
Software Engineer

Most engineering teams discovered the GDPR through a legal email that arrived three weeks before the deadline. The EU AI Act is repeating that pattern, and the August 2, 2026 enforcement date for high-risk AI systems is close enough that "we'll deal with compliance later" is no longer an option. The difference between GDPR and the AI Act is that GDPR compliance was mostly about data handling policies. AI Act compliance requires building new system components — components that don't exist yet in most production AI systems.

What the regulation calls "human oversight obligations" and "audit trail requirements" are, translated into engineering language, a dashboard, an event log, and a data lineage system. This article treats the EU AI Act as an engineering specification rather than a legal document and walks through what you actually need to build.