Skip to main content

702 posts tagged with "llm"

View all tags

When AI Sounds Right but Isn't: LLM Confabulation in Technical and Scientific Domains

· 9 min read
Tian Pan
Software Engineer

The insidious thing about LLM confabulation in technical domains isn't that the model produces obviously wrong answers. It's that the model produces beautifully structured, confidently stated, technically plausible answers that are subtly wrong in ways that only domain experts catch — and often only after the damage is done.

A Monte Carlo physics simulation that initializes correctly but resamples particle positions from scratch at each step rather than making incremental updates. A chemical formula that follows the right naming conventions but has an incorrect oxidation state. An engineering specification that cites the right standard, references the right units, and has exactly the wrong load coefficient. Each output looks right. Each sounds authoritative. Each is wrong in ways that won't surface until someone runs the experiment, stress-tests the component, or critically reads the derivation.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

Why AI Engineering Training Programs Are Perpetually Behind the Models

· 9 min read
Tian Pan
Software Engineer

In early 2023, a flood of corporate AI training programs launched with the same selling point: we will teach your engineers prompt engineering. By the time most of them finished their first cohort, the specific techniques they were teaching had already been automated away by the models themselves. By 2025, the role of "prompt engineer" — briefly advertised at $200,000 salaries — was effectively obsolete. The training programs are still running.

This is the AI curriculum trap. It is not a problem of effort or budget. Organizations invest heavily in structured AI training, certification programs, and hiring rubrics built around tool proficiency. But the tools change faster than any curriculum can track, and the result is a permanent, structural lag: training programs are always teaching the AI engineering of 18 months ago.

AI-Native Logging: Capture Decisions, Not Just I/O

· 10 min read
Tian Pan
Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

The AI Onboarding Gap: Why Engineers Can't Learn What They Can't Test

· 11 min read
Tian Pan
Software Engineer

A new engineer joins an AI-heavy team. On their third day, they see a prompt with an awkward double negation in the system instructions. It looks like a bug. They clean it up — the kind of small polish any reasonable person would do. Two hours later, customer-facing classification accuracy on a critical pipeline drops from 91% to 74%. Nobody has any idea why.

This scenario plays out in some form at almost every team building on LLMs. The new engineer isn't careless. The prompt did look wrong. But that double negation was load-bearing in a way that only the person who wrote it — after weeks of experimentation — actually understood. And they never wrote that understanding down.

This is the AI onboarding gap: the chasm between what an AI codebase appears to do and what it actually does, and why that gap is invisible until someone falls into it.

AI Pipeline Exception Handling: Hallucinations, Refusals, and Format Violations Are First-Class Errors

· 10 min read
Tian Pan
Software Engineer

Your AI pipeline reported zero errors last night. The output was completely wrong.

That's not a hypothetical. A recent industry report found that roughly 1 in 20 production LLM requests fail in ways that never surface as exceptions — valid HTTP 200, well-formed JSON, fluent prose, factually wrong. The observability stack stays green while the pipeline quietly lies to its users.

The root cause is an architectural assumption borrowed from traditional service engineering: that HTTP status codes and parse errors cover the failure space. They don't. LLM pipelines have at least four failure types that the underlying infrastructure cannot see — hallucinations, refusals, format violations, and context overflow — and treating them as edge cases instead of first-class error types is how production AI systems ship invisible bugs at scale.

The Compound Hallucination Problem: How Multi-Stage AI Pipelines Amplify Errors

· 10 min read
Tian Pan
Software Engineer

Most hallucination research focuses on what comes out of a single model call. That framing misses the scarier problem: what happens in a four-stage pipeline where each stage unconditionally trusts the previous output. A single hallucinated fact in Stage 1 doesn't just persist—it becomes the load-bearing premise for every subsequent inference. By Stage 4, the pipeline delivers a confident, internally coherent answer that happens to be entirely wrong.

This isn't a capability problem that better models will solve. It's a systems architecture problem, and it requires a systems-level fix.

Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing

· 10 min read
Tian Pan
Software Engineer

Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.

Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

The Context Limit Is a UX Problem: Why Silent Truncation Erodes User Trust

· 8 min read
Tian Pan
Software Engineer

A user spends an hour in a long coding session with an AI assistant. They've established conventions, shared codebase context, described a multi-file refactor in detail. Then, about 40 messages in, the AI starts giving advice that ignores everything it "knows." It recommends an approach they already rejected twenty minutes ago. When pressed, it seems confused.

No error was shown. No warning appeared. The model just quietly dropped earlier messages to make room for newer ones — and the user concluded the AI was unreliable.

This is not a model failure. It is a product design failure.

The Context Window Is an API Surface: Treat Your Prompt Structure as a Contract

· 9 min read
Tian Pan
Software Engineer

Six months into a production LLM feature, an engineer files a bug: the model started giving incorrect output sometime last quarter. Nobody remembers changing the prompt. The git blame shows it was "cleaned up for readability." The previous version is gone. Debugging begins from scratch.

This is the moment teams discover that their context window was never really engineered — it was just assembled.

The context window is the contract between your system and the model. Every token that enters it — system instructions, retrieved documents, conversation history, tool schemas, the user query — is input to a function call that costs money, takes time, and produces non-deterministic output. Yet most teams treat context composition as an implementation detail rather than an API surface. Prompts get edited in place, without versioning. Sections grow by accumulation. Nobody owns the layout. Changes propagate silently. The debugging experience is worse than anything from the pre-LLM era, because at least stack traces tell you what changed.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

· 10 min read
Tian Pan
Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.