Skip to main content

639 posts tagged with "llm"

View all tags

AI-Native Logging: Capture Decisions, Not Just I/O

· 10 min read
Tian Pan
Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

The AI Onboarding Gap: Why Engineers Can't Learn What They Can't Test

· 11 min read
Tian Pan
Software Engineer

A new engineer joins an AI-heavy team. On their third day, they see a prompt with an awkward double negation in the system instructions. It looks like a bug. They clean it up — the kind of small polish any reasonable person would do. Two hours later, customer-facing classification accuracy on a critical pipeline drops from 91% to 74%. Nobody has any idea why.

This scenario plays out in some form at almost every team building on LLMs. The new engineer isn't careless. The prompt did look wrong. But that double negation was load-bearing in a way that only the person who wrote it — after weeks of experimentation — actually understood. And they never wrote that understanding down.

This is the AI onboarding gap: the chasm between what an AI codebase appears to do and what it actually does, and why that gap is invisible until someone falls into it.

AI Pipeline Exception Handling: Hallucinations, Refusals, and Format Violations Are First-Class Errors

· 10 min read
Tian Pan
Software Engineer

Your AI pipeline reported zero errors last night. The output was completely wrong.

That's not a hypothetical. A recent industry report found that roughly 1 in 20 production LLM requests fail in ways that never surface as exceptions — valid HTTP 200, well-formed JSON, fluent prose, factually wrong. The observability stack stays green while the pipeline quietly lies to its users.

The root cause is an architectural assumption borrowed from traditional service engineering: that HTTP status codes and parse errors cover the failure space. They don't. LLM pipelines have at least four failure types that the underlying infrastructure cannot see — hallucinations, refusals, format violations, and context overflow — and treating them as edge cases instead of first-class error types is how production AI systems ship invisible bugs at scale.

The Compound Hallucination Problem: How Multi-Stage AI Pipelines Amplify Errors

· 10 min read
Tian Pan
Software Engineer

Most hallucination research focuses on what comes out of a single model call. That framing misses the scarier problem: what happens in a four-stage pipeline where each stage unconditionally trusts the previous output. A single hallucinated fact in Stage 1 doesn't just persist—it becomes the load-bearing premise for every subsequent inference. By Stage 4, the pipeline delivers a confident, internally coherent answer that happens to be entirely wrong.

This isn't a capability problem that better models will solve. It's a systems architecture problem, and it requires a systems-level fix.

Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing

· 10 min read
Tian Pan
Software Engineer

Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.

Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

The Context Limit Is a UX Problem: Why Silent Truncation Erodes User Trust

· 8 min read
Tian Pan
Software Engineer

A user spends an hour in a long coding session with an AI assistant. They've established conventions, shared codebase context, described a multi-file refactor in detail. Then, about 40 messages in, the AI starts giving advice that ignores everything it "knows." It recommends an approach they already rejected twenty minutes ago. When pressed, it seems confused.

No error was shown. No warning appeared. The model just quietly dropped earlier messages to make room for newer ones — and the user concluded the AI was unreliable.

This is not a model failure. It is a product design failure.

The Context Window Is an API Surface: Treat Your Prompt Structure as a Contract

· 9 min read
Tian Pan
Software Engineer

Six months into a production LLM feature, an engineer files a bug: the model started giving incorrect output sometime last quarter. Nobody remembers changing the prompt. The git blame shows it was "cleaned up for readability." The previous version is gone. Debugging begins from scratch.

This is the moment teams discover that their context window was never really engineered — it was just assembled.

The context window is the contract between your system and the model. Every token that enters it — system instructions, retrieved documents, conversation history, tool schemas, the user query — is input to a function call that costs money, takes time, and produces non-deterministic output. Yet most teams treat context composition as an implementation detail rather than an API surface. Prompts get edited in place, without versioning. Sections grow by accumulation. Nobody owns the layout. Changes propagate silently. The debugging experience is worse than anything from the pre-LLM era, because at least stack traces tell you what changed.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

· 10 min read
Tian Pan
Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.

Data-Sensitivity-Tier Model Routing: Governing Which Model Sees Which Data

· 11 min read
Tian Pan
Software Engineer

Your AI system routed a patient query to a self-hosted model at 9 AM. At 11 AM, that model's pod restarted during a deployment. The request queue backed up, the router detected a timeout, and it fell back to the cloud LLM you use for generic queries. The query completed successfully. No alerts fired. Your monitoring dashboard showed green. Somewhere in that exchange, protected health information traveled to a vendor with whom you have no Business Associate Agreement.

That's not a hypothetical. It's the default behavior of nearly every AI routing stack that wasn't explicitly designed to prevent it.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

The Eval Fatigue Cycle: Why AI Quality Measurement Collapses After Launch

· 9 min read
Tian Pan
Software Engineer

There's a predictable arc to how teams treat AI evaluation. Sprint zero: everyone agrees evals are critical. Launch week: the suite runs clean, the demo looks great. Week six: the CI job starts getting skipped. Week ten: someone raises the failure threshold to stop the alerts. Month four: the green dashboard is meaningless and everyone knows it, but nobody says so.

This is the eval fatigue cycle, and it's nearly universal. Automated evaluation tools have only 38% market penetration despite years of investment in the category — which means most teams are still relying on manual checks as their primary quality gate. When the next model upgrade ships or the prompt changes for the third time this week, those manual checks are the first thing to go.