Skip to main content

269 posts tagged with "ai-engineering"

View all tags

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

· 11 min read
Tian Pan
Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

· 9 min read
Tian Pan
Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking

· 8 min read
Tian Pan
Software Engineer

Most teams building on reasoning models treat privacy as a two-surface problem: sanitize the prompt going in, sanitize the response coming out. The reasoning trace in between gets logged wholesale for observability, surfaced to downstream systems for debugging, and sometimes passed back to users who asked to "see the thinking." That middle layer is where the real exposure lives — and most production deployments are not treating it like the liability it is.

Research from early 2026 quantified what practitioners have been observing anecdotally: large reasoning models (LRMs) leak personally identifiable information in their intermediate reasoning steps more often than in their final answers. In one study testing five open-source models across medical and financial scenarios, the finding was unambiguous — intermediate reasoning reliably surfaces PII that the final response had successfully withheld. The final answer is sanitized; the trace is not.

The Streaming Infrastructure Behind Real-Time Agent UIs

· 12 min read
Tian Pan
Software Engineer

Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.

This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.

When the Generalist Beats the Specialists: The Case for Unified Single-Agent Architectures

· 9 min read
Tian Pan
Software Engineer

The prevailing wisdom in AI engineering is that complex tasks require specialized agents: a researcher agent, a writer agent, a critic agent, each handling its narrow domain and handing off to the next. This architectural instinct feels correct — it mirrors how human teams work, how microservices are built, and how we decompose problems in software engineering. The problem is that empirical data increasingly says otherwise.

A 2025 study from Google DeepMind and MIT evaluated 180 configurations across five agent architectures and three LLM families. For sequential reasoning tasks — the category that covers most real knowledge work — every single multi-agent coordination variant degraded performance by 39 to 70 percent compared to a well-configured single agent. Not broke-even. Degraded.

This is not an argument against multi-agent systems categorically. There are workloads where coordination yields genuine returns. But the default instinct to reach for specialization is costing production teams real money, real latency, and real reliability — often for no measurable accuracy gain.

The Agent Planning Module: A Hidden Architectural Seam

· 10 min read
Tian Pan
Software Engineer

Most agentic systems are built with a single architectural assumption that goes unstated: the LLM handles both planning and execution in the same inference call. Ask it to complete a ten-step task, and the model decides what to do, does it, checks the result, decides what to do next—all in one continuous ReAct loop. This feels elegant. It also collapses under real workloads in a way that's hard to diagnose because the failure mode looks like a model quality problem rather than a design problem.

The agent planning module—the component responsible purely for task decomposition, dependency modeling, and sequencing—is the seam most practitioners skip. It shows up only when things get hard enough that you can't ignore it.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 10 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

GraphRAG in Production: When Vector Search Hits Its Ceiling

· 9 min read
Tian Pan
Software Engineer

Your vector search looks great on benchmarks. Users are still frustrated.

The failure mode is subtle: a user asks "Which of our suppliers have been involved in incidents that affected customers in the same region as the Martinez account?" Your embeddings retrieve the incident records. They retrieve the supplier contracts. They retrieve the customer accounts. But they retrieve them as disconnected documents, and the LLM has to figure out the relationships in context — relationships that span three hops across your entity graph. At five or more entities per query, accuracy without relational structure drops toward zero. With it, performance stays stable.

This is the ceiling that knowledge graph augmented retrieval — GraphRAG — is built to address. It is not a drop-in replacement for vector search. It is a different system with a different cost structure, different failure modes, and a different class of queries where it wins decisively.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

Structured Concurrency for AI Pipelines: Why asyncio.gather() Isn't Enough

· 9 min read
Tian Pan
Software Engineer

When an LLM returns three tool calls in a single response, the obvious thing is to run them in parallel. You reach for asyncio.gather(), fan the calls out, collect the results, return them to the model. The code works in testing. It works in staging. Six weeks into production, you start noticing your application holding open HTTP connections it should have released. Token quota is draining faster than usage metrics suggest. Occasionally, a tool that sends an email fires twice.

The underlying issue is not the LLM or the tool — it's the concurrency primitive. asyncio.gather() was not designed for the failure modes that multi-step agent pipelines produce, and using it as the backbone of parallel tool execution creates problems that are invisible until they compound.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.