Skip to main content

458 posts tagged with "ai-engineering"

View all tags

The Streaming Infrastructure Behind Real-Time Agent UIs

· 12 min read
Tian Pan
Software Engineer

Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.

This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.

When the Generalist Beats the Specialists: The Case for Unified Single-Agent Architectures

· 9 min read
Tian Pan
Software Engineer

The prevailing wisdom in AI engineering is that complex tasks require specialized agents: a researcher agent, a writer agent, a critic agent, each handling its narrow domain and handing off to the next. This architectural instinct feels correct — it mirrors how human teams work, how microservices are built, and how we decompose problems in software engineering. The problem is that empirical data increasingly says otherwise.

A 2025 study from Google DeepMind and MIT evaluated 180 configurations across five agent architectures and three LLM families. For sequential reasoning tasks — the category that covers most real knowledge work — every single multi-agent coordination variant degraded performance by 39 to 70 percent compared to a well-configured single agent. Not broke-even. Degraded.

This is not an argument against multi-agent systems categorically. There are workloads where coordination yields genuine returns. But the default instinct to reach for specialization is costing production teams real money, real latency, and real reliability — often for no measurable accuracy gain.

The Agent Planning Module: A Hidden Architectural Seam

· 10 min read
Tian Pan
Software Engineer

Most agentic systems are built with a single architectural assumption that goes unstated: the LLM handles both planning and execution in the same inference call. Ask it to complete a ten-step task, and the model decides what to do, does it, checks the result, decides what to do next—all in one continuous ReAct loop. This feels elegant. It also collapses under real workloads in a way that's hard to diagnose because the failure mode looks like a model quality problem rather than a design problem.

The agent planning module—the component responsible purely for task decomposition, dependency modeling, and sequencing—is the seam most practitioners skip. It shows up only when things get hard enough that you can't ignore it.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 10 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

GraphRAG in Production: When Vector Search Hits Its Ceiling

· 9 min read
Tian Pan
Software Engineer

Your vector search looks great on benchmarks. Users are still frustrated.

The failure mode is subtle: a user asks "Which of our suppliers have been involved in incidents that affected customers in the same region as the Martinez account?" Your embeddings retrieve the incident records. They retrieve the supplier contracts. They retrieve the customer accounts. But they retrieve them as disconnected documents, and the LLM has to figure out the relationships in context — relationships that span three hops across your entity graph. At five or more entities per query, accuracy without relational structure drops toward zero. With it, performance stays stable.

This is the ceiling that knowledge graph augmented retrieval — GraphRAG — is built to address. It is not a drop-in replacement for vector search. It is a different system with a different cost structure, different failure modes, and a different class of queries where it wins decisively.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

Structured Concurrency for AI Pipelines: Why asyncio.gather() Isn't Enough

· 9 min read
Tian Pan
Software Engineer

When an LLM returns three tool calls in a single response, the obvious thing is to run them in parallel. You reach for asyncio.gather(), fan the calls out, collect the results, return them to the model. The code works in testing. It works in staging. Six weeks into production, you start noticing your application holding open HTTP connections it should have released. Token quota is draining faster than usage metrics suggest. Occasionally, a tool that sends an email fires twice.

The underlying issue is not the LLM or the tool — it's the concurrency primitive. asyncio.gather() was not designed for the failure modes that multi-step agent pipelines produce, and using it as the backbone of parallel tool execution creates problems that are invisible until they compound.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.