Skip to main content

6 posts tagged with "llm-production"

View all tags

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

· 10 min read
Tian Pan
Software Engineer

The instinct is always to reach for the biggest model. Pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

The Hidden Token Tax: Where 30-60% of Your Context Window Disappears Before Users Say a Word

· 8 min read
Tian Pan
Software Engineer

You're paying for a 200K-token context window. Your users get maybe 80K of it. The rest vanishes before their first message arrives — consumed by system prompts, tool definitions, safety preambles, and chat history padding. This is the hidden token tax, and most teams don't realize they're paying it until they hit context limits in production.

The gap between advertised context window and usable context window is one of the most expensive blind spots in production LLM systems. It compounds across multi-turn conversations, inflates latency through attention overhead, and silently degrades output quality as useful information gets pushed into the "lost in the middle" zone where models stop paying attention.

The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget

· 10 min read
Tian Pan
Software Engineer

Every backend engineer knows that retries are essential. Every distributed systems engineer knows that retries are dangerous. When you put an LLM agent in charge of retrying tool calls, you get both problems at once — plus a new one: every retry burns tokens. A single flaky API endpoint can turn a $0.01 agent task into a $2 meltdown in under a minute.

The retry storm problem isn't new. Distributed systems have dealt with thundering herds and cascading failures for decades. But agentic systems amplify the problem in ways that microservice patterns don't fully address, because the retry logic lives inside a probabilistic reasoning engine that doesn't understand backpressure.

Why Your Thumbs-Down Data Is Lying to You: Selection Bias in Production AI Feedback Loops

· 9 min read
Tian Pan
Software Engineer

You shipped a thumbs-up/thumbs-down button on your AI feature six months ago. You have thousands of ratings. You built a dashboard. You even fine-tuned on the negative examples. And your product is getting worse in ways your feedback data cannot explain.

The problem isn't that users are wrong about what they dislike. The problem is that the users who click your feedback buttons are a systematically unrepresentative sample of your actual user base — and every decision you make from that data inherits their biases.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 10 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

The Anatomy of an Agent Harness

· 8 min read
Tian Pan
Software Engineer

There's a 100-line Python agent that scores 74–76% on SWE-bench Verified — only 4–6 percentage points behind state-of-the-art systems built by well-funded teams. The execution loop itself isn't where the complexity lives. World-class teams invest six to twelve months building the infrastructure around that loop. That infrastructure has a name: the harness.

The formula is simple: Agent = Model + Harness. The model handles reasoning. The harness handles everything else — tool execution, context management, safety enforcement, error recovery, state persistence, and human-in-the-loop workflows. If you've been spending months optimizing prompts and model selection while shipping brittle agents, you've been optimizing the wrong thing.