Skip to main content

702 posts tagged with "llm"

View all tags

Graph Reasoning Gaps in LLMs: Scaffolding Relational Tasks That Fool Sequence-Trained Models

· 9 min read
Tian Pan
Software Engineer

A common mistake in AI system design is asking a language model to reason over a graph as if it were reading a document. The model will generate a confident, fluent answer. The answer will be wrong in a way that looks right — it will name real nodes, reference plausible paths, and describe relationships that almost exist. Then you discover your org-chart traversal hallucinates skip-level managers, your dependency resolution misses cycles in graphs over ten nodes, and your three-hop knowledge graph query has a 60% error rate at step two.

This is not a prompt quality problem. It is an architecture problem, and you can diagnose it before writing a single prompt.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

Knowledge Age Routing: Matching Queries to the Right Temporal Layer in Production AI

· 9 min read
Tian Pan
Software Engineer

Here is a scenario that surfaces in production more often than anyone likes to admit. A user asks your AI assistant what the current interest rate policy is. Your RAG system fetches a highly relevant Federal Reserve document—semantically it scores 0.91 similarity—and the model confidently returns an answer. The answer is six months out of date. The RAG index was last refreshed in October. The parametric knowledge is older still. A live API call would have returned the correct current figure in 400 milliseconds, but nobody wired up the routing logic to ask: how old is this question's answer allowed to be?

That failure is not a retrieval failure. It is a temporal routing failure. The system had access to correct information somewhere in its stack. It just sent the query to the wrong layer.

The Confident Hallucinator: Runtime Patterns for Knowledge Boundary Signaling in LLMs

· 10 min read
Tian Pan
Software Engineer

GPT-4 achieves roughly 62% AUROC when its own confidence scores are used to separate correct answers from incorrect ones. That's barely above the 50% baseline of flipping a coin. The model sounds certain and polished in both cases. If you're building a production system that assumes high-confidence responses are reliable, you're working with a signal that's nearly random.

This is the knowledge boundary signaling problem, and it sits at the center of most real-world LLM quality failures. The model doesn't know what it doesn't know — or more precisely, it knows internally but can't be trusted to express it. The engineering challenge isn't getting models to refuse more; it's designing systems that make uncertainty actionable without making your product feel broken.

Why Your AI Sounds Wrong Even When It's Technically Correct

· 9 min read
Tian Pan
Software Engineer

A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.

This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.

LLM-as-Classifier in Production: Why Accuracy Is the Wrong Metric

· 11 min read
Tian Pan
Software Engineer

A team ships an LLM-based intent classifier. Evaluation accuracy: 94%. Two weeks into production, support volume is up 30% — not because the model is failing to classify, but because it's routing edge cases to the wrong queue with very high confidence. Nobody built a circuit breaker for "the model is wrong and doesn't know it." The 94% figure never surfaced that risk.

This failure pattern repeats across content moderation pipelines, routing systems, and entity extractors. The LLM gets a high score on the holdout set. The team ships. Something breaks quietly in production.

The issue isn't that accuracy is a bad metric. It's that accuracy answers the wrong question. Production classification has a different set of requirements, and most evaluation pipelines don't test for them.

The Output Coupling Trap: Why Multi-Agent Systems Fail Silently at Interface Boundaries

· 9 min read
Tian Pan
Software Engineer

Your multi-agent pipeline finished. No exceptions were raised. The orchestrator reported success. And yet, the answer is wrong in a way that makes no sense — the executor skipped two steps, the summarizer collapsed three sections into one non-sequitur, and the output looks like it came from a different task entirely. There's no stack trace to follow. No error code to search. Just a quietly incorrect result.

This is the output coupling trap. It's not a model quality problem. It's an interface engineering problem, and it's the leading cause of silent production failures in multi-agent systems.

The Consistency Gap: Why Parallel LLM Calls Contradict Each Other and How to Fix It

· 10 min read
Tian Pan
Software Engineer

Imagine a multi-agent pipeline that processes a user's support ticket. Agent A reads the ticket history and decides the user is a power user who needs an advanced response. Agent B reads the same ticket history in a parallel call and decides the user is a beginner who needs step-by-step guidance. Both agents finish at the same time and hand their outputs to a composer agent—which now has to reconcile two fundamentally incompatible mental models of the same person.

This isn't a rare edge case. Research analyzing production multi-agent failures found that 36.9% of failures are caused by inter-agent misalignment: conflicting outputs, context loss during handoffs, and incompatible conclusions reached independently. The consistency gap—the tendency for parallel LLM calls to contradict each other about shared entities—is one of the most underappreciated failure modes in agentic systems.

The Words You Choose in Your System Prompt Change What Your Agent Will Risk

· 8 min read
Tian Pan
Software Engineer

Here is something that shouldn't be surprising but is: when you tell an agent "avoid making mistakes" versus "prioritize accuracy," you are not giving it the same instruction. The observable behavior at ambiguous decision points diverges measurably — agents prompted with loss-avoidance framing hedge more, escalate more, and complete fewer tasks end-to-end. Agents prompted with gain-seeking framing complete more tasks but introduce more errors. The difference isn't philosophical; it shows up in eval logs.

This is the behavioral economics of agents, and most engineering teams haven't thought about it systematically. They write system prompts as documentation — a description of what the agent is — when system prompts are actually decision-shaping instruments that encode a risk posture whether the author intended that or not.

The Provider Behavioral Fingerprint: What Doesn't Survive a Model Switch

· 8 min read
Tian Pan
Software Engineer

When a cost spike, a model deprecation notice, or a competitor's benchmark forces you to swap providers, engineering teams typically evaluate the candidate on capability benchmarks and call it a migration plan. That process catches about half the problems. The other half aren't capability problems — they're behavioral ones: the invisible layer of formatting habits, refusal patterns, serialization quirks, and output conventions your production code has silently wired itself to over months of iteration.

The capability benchmark tells you whether the new model can do the task. The behavioral fingerprint tells you whether your codebase can survive the replacement.

The Rollout Sequencing Problem: Why Co-Deploying Model and Infrastructure Changes Destroys Observability

· 9 min read
Tian Pan
Software Engineer

Three weeks into your quarter, a production alert fires. Accuracy on a core task dropped eight percentage points. You open the dashboard and immediately notice three things that all landed in the same deploy window: a context length increase from 8k to 32k tokens, a model version upgrade from gpt-4-turbo-preview to gpt-4o, and a batch size change your infrastructure team pushed to improve throughput. None of the three changes individually was considered high-risk. Combined, they've created a debugging problem no one can solve cleanly.

Welcome to the rollout sequencing problem.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

· 9 min read
Tian Pan
Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.