Skip to main content

49 posts tagged with "system-design"

View all tags

The Co-Pilot Trap: Why Full Autopilot Ships Faster but Fails Harder

· 9 min read
Tian Pan
Software Engineer

There's a pattern in how AI features die in production: they start as copilots and get promoted to autopilots. The promotion happens for obvious reasons—cost reduction, scale, reduced headcount—and the reasoning sounds solid at demo time. Then the edge cases accumulate. A user-facing recommendation becomes a user-facing decision. A suggestion becomes an action. And when the first systematic failure lands, the engineering team discovers that the error tolerance assumptions baked into the original design were never re-evaluated.

This is the co-pilot trap: building an AI feature for one tier of the automation spectrum, then promoting it to a higher tier without rebuilding the failure model that tier requires.

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

The CAP Theorem for AI Agents: Why Your Agent Fails Completely When It Should Degrade Gracefully

· 9 min read
Tian Pan
Software Engineer

Your AI agent works perfectly until it doesn't. One tool goes down — maybe the search API is rate-limited, maybe the database is slow, maybe the code execution sandbox times out — and the entire agent collapses. Not a partial answer, not a degraded response. A complete failure. A blank screen or a hallucinated mess.

This is not a bug. It is a design choice, and almost nobody made it deliberately. The agent architectures we are building today implicitly choose "fail completely" because nobody designed the partial-availability path. If you have built distributed systems before, this pattern should feel painfully familiar. It is the CAP theorem, showing up in a new disguise.

The Caching Hierarchy for Agentic Workloads: Five Layers Most Teams Stop at Two

· 11 min read
Tian Pan
Software Engineer

Most teams deploying AI agents implement prompt caching, maybe add a semantic cache, and call it done. They're leaving 40-60% of their potential savings on the table. The reason isn't laziness — it's that agentic workloads create caching problems that don't exist in simple request-response LLM calls, and the solutions require thinking in layers that traditional web caching never needed.

A single agent task might involve a 4,000-token system prompt, three tool calls that each return different-shaped data, a multi-step plan that's structurally identical to yesterday's plan, and session context that needs to persist across a conversation but never across users. Each of these represents a different caching opportunity with different TTL requirements, different invalidation triggers, and different failure modes when the cache goes stale.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

· 11 min read
Tian Pan
Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

The $100M Telemetry Bug: What OpenAI's Outage Teaches Us About System Design

· 3 min read

On December 11, 2024, OpenAI experienced a catastrophic outage that took down ChatGPT, their API, and Sora for over four hours. While outages happen to every company, this one is particularly fascinating because it reveals a critical lesson about modern system design: sometimes the tools we add to prevent failures become the source of failures themselves.

The Billion-Dollar Irony

Here's the fascinating part: The outage wasn't caused by a hack, a failed deployment, or even a bug in their AI models. Instead, it was caused by a tool meant to improve reliability. OpenAI was adding better monitoring to prevent outages when they accidentally created one of their biggest outages ever.

It's like hiring a security guard who accidentally locks everyone out of the building.

The Cascade of Failures

The incident unfolded like this:

  1. OpenAI deployed a new telemetry service to better monitor their systems
  2. This service overwhelmed their Kubernetes control plane with API requests
  3. When the control plane failed, DNS resolution broke
  4. Without DNS, services couldn't find each other
  5. Engineers couldn't fix the problem because they needed the control plane to remove the problematic service

But the most interesting part isn't the failure itself – it's how multiple safety systems failed simultaneously:

  1. Testing didn't catch the issue because it only appeared at scale
  2. DNS caching masked the problem long enough for it to spread everywhere
  3. The very systems needed to fix the problem were the ones that broke

Three Critical Lessons

1. Scale Changes Everything

The telemetry service worked perfectly in testing. The problem only emerged when deployed to clusters with thousands of nodes. This highlights a fundamental challenge in modern system design: some problems only emerge at scale.

2. Safety Systems Can Become Risk Factors

OpenAI's DNS caching, meant to improve reliability, actually made the problem worse by masking the issue until it was too late. Their Kubernetes control plane, designed to manage cluster health, became a single point of failure.

3. Recovery Plans Need Recovery Plans

The most damning part? Engineers couldn't fix the problem because they needed working systems to fix the broken systems. It's like needing a ladder to reach the ladder you need.

The Future of System Design

OpenAI's response plan reveals where system design is headed:

  1. Decoupling Critical Systems: They're separating their data plane from their control plane, reducing interdependencies
  2. Improved Testing: They're adding fault injection testing to simulate failures at scale
  3. Break-Glass Procedures: They're building emergency access systems that work even when everything else fails

What This Means for Your Company

Even if you're not operating at OpenAI's scale, the lessons apply:

  1. Test at scale, not just functionality
  2. Build emergency access systems before you need them
  3. Question your safety systems – they might be hiding risks

The future of reliable systems isn't about preventing all failures – it's about ensuring we can recover from them quickly and gracefully.

Remember: The most dangerous problems aren't the ones we can see coming. They're the ones that emerge from the very systems we build to keep us safe.

Quick Intro to Optimism Architecture

· 4 min read

What is Optimism?

Optimism is an EVM equivalent, optimistic rollup protocol designed to scale Ethereum.

  • Scaling Ethereum means increasing the number of useful transactions the Ethereum network can process.
  • Optimistic rollup is a layer 2 scalability technique which increases the computation & storage capacity of Ethereum without sacrificing security or decentralization.
  • EVM Equivalence is complete compliance with the state transition function described in the Ethereum yellow paper, the formal definition of the protocol.

Optimistic rollup works by bundling multiple transactions into a single transaction, which is then verified by a smart contract on the Ethereum network. This process is called "rolling up" because the individual transactions are combined into a larger transaction that is submitted to the Ethereum network. The term "optimistic" refers to the fact that the system assumes that transactions are valid unless proven otherwise, which allows for faster and more efficient processing of transactions.

Overall Architecture

Optimism Architecture

op-node + op-geth

The rollup node can run either in validator or sequencer mode:

  1. validator (aka verifier): Similar to running an Ethereum node, it simulates L2 transactions locally, without rate limiting. It also lets the validator verify the work of the sequencer, by re-deriving output roots and comparing them against those submitted by the sequencer. In case of a mismatch, the validator can perform a fault proof.
  2. sequencer: The sequencer is a priviledged actor, which receives L2 transactions from L2 users, creates L2 blocks using them, which it then submits to data availability provider (via a batcher). It also submits output roots to L1. There is only one sequencer in the entire stack for now, and it's where people critisize that OP stack is not decenralized.

op-batcher

The batch submitter, also referred to as the batcher, is the entity submitting the L2 sequencer data to L1, to make it available for verifiers.

op-proposer

Proposer generates and submitting L2 Output checkpoints to the L2 output oracle contract on Ethereum. After finalization period has passed, this data enables withdrawals.

Both batcher and proposer submit states to L1. Why are they separated?

Batcher collect and submit tx data into L1 with a batch, while proposer submits the commitments (output roots) to the L2's state, which finalizes the view of L2 account states. They are decoupled so that they can work in parallel for efficiency.

contracts-bedrock

Various contracts for L2 to interact with the L1:

  • OptimismPortal: A feed of L2 transactions which originated as smart contract calls in the L1 state.
  • Batch inbox: An L1 address to which the Batch Submitter submits transaction batches.
  • L2 output oracle: A smart contract that stores L2 output roots for use with withdrawals and fault proofs.

Optimism components

How to deposit?

How to withdraw?

Feedback to Optimism's Documentation

Understanding the OP stack can be challenging due to a number of factors. One such factor is the numerous components that are referred to multiple times with slightly different names in code and documentation. For example, the terms "op-batcher" and "batch-submitter" / "verifiers" and "validators" may be used interchangeably, leading to confusion and difficulty in understanding the exact function of each component.

Another challenge in understanding the OP stack is the evolving architecture, which may result in some design elements becoming deprecated over time. Unfortunately, the documentation may not always be updated to reflect these changes. This can lead to further confusion and difficulty in understanding the system, as users may be working with outdated or inaccurate information.

To overcome these challenges, it is important to carefully review all available documentation, to keep concepts consistently across places, and to stay up-to-date with any changes or updates to the OP stack. This may require additional research and collaboration with other users or developers, but it is essential in order to fully understand and effectively utilize this complex system.