Skip to main content

861 posts tagged with "insider"

View all tags

Building Trust Recovery Flows: What Happens After Your AI Makes a Visible Mistake

· 9 min read
Tian Pan
Software Engineer

When Google's AI Overview told users to add glue to pizza sauce and eat rocks for digestive health, it didn't just embarrass a product team — it exposed a systemic gap in how we think about AI reliability. The failure wasn't just that the model was wrong. The failure was that the model was confidently wrong, in a high-visibility context, with no recovery path for the users it misled.

Trust in AI systems doesn't erode gradually. Research shows it follows a cliff-like collapse pattern: a single noticeable error can produce a disproportionate trust decline with measurable effect sizes. Only 29% of developers say they trust AI tools — an 11-point drop from the previous year, even as adoption climbs to 84%. We're building systems that people use but don't trust. That gap matters when your product ships agentic features that act on behalf of users.

This post is about what engineers and product builders should do after the mistake happens — not just how to prevent it.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

The Compound Hallucination Problem: How Multi-Stage AI Pipelines Amplify Errors

· 10 min read
Tian Pan
Software Engineer

Most hallucination research focuses on what comes out of a single model call. That framing misses the scarier problem: what happens in a four-stage pipeline where each stage unconditionally trusts the previous output. A single hallucinated fact in Stage 1 doesn't just persist—it becomes the load-bearing premise for every subsequent inference. By Stage 4, the pipeline delivers a confident, internally coherent answer that happens to be entirely wrong.

This isn't a capability problem that better models will solve. It's a systems architecture problem, and it requires a systems-level fix.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

The Context Limit Is a UX Problem: Why Silent Truncation Erodes User Trust

· 8 min read
Tian Pan
Software Engineer

A user spends an hour in a long coding session with an AI assistant. They've established conventions, shared codebase context, described a multi-file refactor in detail. Then, about 40 messages in, the AI starts giving advice that ignores everything it "knows." It recommends an approach they already rejected twenty minutes ago. When pressed, it seems confused.

No error was shown. No warning appeared. The model just quietly dropped earlier messages to make room for newer ones — and the user concluded the AI was unreliable.

This is not a model failure. It is a product design failure.

The Context Window Is an API Surface: Treat Your Prompt Structure as a Contract

· 9 min read
Tian Pan
Software Engineer

Six months into a production LLM feature, an engineer files a bug: the model started giving incorrect output sometime last quarter. Nobody remembers changing the prompt. The git blame shows it was "cleaned up for readability." The previous version is gone. Debugging begins from scratch.

This is the moment teams discover that their context window was never really engineered — it was just assembled.

The context window is the contract between your system and the model. Every token that enters it — system instructions, retrieved documents, conversation history, tool schemas, the user query — is input to a function call that costs money, takes time, and produces non-deterministic output. Yet most teams treat context composition as an implementation detail rather than an API surface. Prompts get edited in place, without versioning. Sections grow by accumulation. Nobody owns the layout. Changes propagate silently. The debugging experience is worse than anything from the pre-LLM era, because at least stack traces tell you what changed.

The Data Flywheel Assumption: When AI Features Compound and When They Just Accumulate Noise

· 9 min read
Tian Pan
Software Engineer

Every AI pitch deck includes a slide about the data flywheel. The story is appealing: users interact with your AI feature, that interaction generates data, the data trains a better model, the better model attracts more users, and the cycle repeats. Scale long enough and you have an insurmountable competitive moat.

The problem is that most teams shipping AI features don't have a flywheel. They have a log file. A very large, expensive-to-store log file that has never improved their model and never will—because the three preconditions for a real flywheel are missing and nobody has asked whether they're present.

Dead Letters for Agents: What to Do When No Agent Can Complete a Task

· 10 min read
Tian Pan
Software Engineer

A team building a multi-agent research tool discovered, on day eleven of a runaway job, that two of their agents had been cross-referencing each other's outputs in a loop the entire time. The bill: $47,000. No human had seen the results. No alarm had fired. The system simply kept running, confident it was making progress, because nothing in the architecture asked the question: what happens when a task genuinely cannot be completed?

Message queues solved this problem decades ago with the dead-letter queue (DLQ). A message that exceeds its delivery retry limit gets routed to a holding area where operators can inspect it, fix the root cause, and replay it when the system is ready. The pattern is simple, battle-tested, and almost entirely missing from production agent systems today.

Diffusion Models in Production: The Engineering Stack Nobody Discusses After the Demo

· 10 min read
Tian Pan
Software Engineer

Your image generation feature just went viral. 100,000 requests are coming in daily. The API provider's rate limit technically accommodates it. Latency crawls to 12 seconds at p95. Your NSFW classifier is flagging legitimate medical illustrations. A compliance audit surfaces that California's AI Transparency Act required watermarking since September 2024. Support has 50 open tickets from users whose content was silently blocked. By the time you realize you need a real production stack, you've already burned two weeks in crisis mode.

This is the moment "just call the API" fails—not because the API is bad, but because the demo's success exposes every assumption you made about inference latency, content policy, moderation fairness, and regulatory compliance. The engineering work nobody shows you in tutorials lives here.

Epistemic Trust in Agent Chains: How Uncertainty Compounds Through Multi-Step Delegation

· 10 min read
Tian Pan
Software Engineer

Most teams building multi-agent systems spend a lot of time thinking about authorization trust: what is Agent B allowed to do, which tools can it call, what data can it access. That's an important problem. But there's a second trust problem that doesn't get nearly enough attention, and it's the one that actually kills production systems.

The problem is epistemic: when Agent A delegates a task to Agent B and gets back an answer, how much should A believe what B returned?

This isn't a question of whether B was authorized to answer. It's a question of whether B actually could.

Feature Interaction Failures in AI Systems: When Two Working Pieces Break Together

· 10 min read
Tian Pan
Software Engineer

Your streaming works. Your retry logic works. Your safety filter works. Your personalization works. Deploy them together, and something strange happens: a rate-limit error mid-stream leaves the user staring at a truncated response that the system records as a success. The retry mechanism fires, but the stream is already gone. The personalization layer serves a customized response that the safety filter would have blocked — except the filter saw a sanitized version of the prompt, not the one the personalization layer acted on.

Each feature passed every test you wrote. The system failed the user anyway.

This is the feature interaction failure, and it is the most underdiagnosed class of production bug in AI systems today.

The Federated AI Team: Why Centralizing AI Expertise Creates the Problems It Was Supposed to Solve

· 10 min read
Tian Pan
Software Engineer

The central AI team was supposed to be the answer. Hire the best ML engineers into a single group, standardize the tooling, establish governance, and let product teams consume AI capabilities without needing to understand them. It's a compelling architecture — clean on an org chart, defensible in a board presentation. In practice, it reliably produces a failure mode that looks exactly like the fragmentation it was created to eliminate.

The central AI team becomes a bottleneck. Product teams queue behind it. The AI it ships feels generic to every domain that needs something specific. The ML engineers who built the platform don't know the product metrics. The product engineers who need help can't debug AI behavior without filing a ticket. A 3-month pilot succeeds; a 9-month security review buries it.

Companies in 2025 reported abandoning the majority of their AI initiatives at more than twice the rate they did in 2024. Many of those failures happened at the transition from proof of concept to production — precisely where an overstretched, disconnected central team shows its seams.