Skip to main content

722 posts tagged with "insider"

View all tags

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

Your AI Feature's Quiet Quitters: How to Detect Silent User Distrust

· 10 min read
Tian Pan
Software Engineer

The McDonald's drive-thru AI didn't fail because users complained. It failed because users stopped using the drive-thru. For three years the system logged healthy "acceptance rates" while viral videos showed customers pleading with it to remove 260 chicken nuggets from their order. When the partnership ended, the official reason was that the technology "wasn't yet ready." The real signal had been sitting in foot traffic data the whole time — unread, unmeasured, unreported.

This is the shape of most AI feature failures in production. Users don't disable your feature. They don't file tickets. They don't leave one-star reviews. They quietly route around it, and your dashboards keep showing green.

Training Your AI on Production Data Without Triggering a Legal Blocker

· 11 min read
Tian Pan
Software Engineer

Your AI feature launched. Users are engaging with it. The gap between what it does and what it should do is visible in every session replay, every thumbs-down, every request that returns a wrong answer. You have the signal. The question is whether you can legally act on it.

This is where teams hit the compliance wall. Not a theoretical wall — a concrete one. In 2024 alone, European regulators issued over €1.2 billion in GDPR fines, with OpenAI, Meta, and LinkedIn among the named defendants. The common thread across most enforcement actions: using behavioral data in ways that weren't explicitly scoped at collection time, or collecting more than was necessary to operate the feature. The fact that your intent is model improvement rather than advertising doesn't move regulators the way engineers assume it does.

API Documentation Is Reliability Infrastructure: How Your Docs Determine Agent Success Rates

· 10 min read
Tian Pan
Software Engineer

Most engineering teams think of API documentation as a developer experience concern — something you improve to reduce support tickets and onboarding time. That framing made sense when your primary consumer was a human reading docs in a browser. It is no longer adequate.

When an AI agent calls your API via tool use, your documentation stops being a guide and becomes runtime behavior. A vague parameter description isn't a UX inconvenience — it is a direct instruction to the model that produces hallucinated values. A missing error code isn't a gap in your reference docs — it is an ambiguous signal that can send an agent into a retry loop with no exit condition. The documentation you wrote three years ago for a human audience is now being parsed by a stateless language model that will execute confidently regardless of whether it understood correctly.

Code-Specific RAG: Why General Retrieval Fails for Codebases

· 10 min read
Tian Pan
Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

· 9 min read
Tian Pan
Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

The Domain Expert Bottleneck in RAG: Why Knowledge Curation Breaks Production AI

· 7 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend their first month on the pipeline — chunking strategy, embedding model selection, vector store configuration, retrieval tuning. They get that working. The demo passes. Stakeholders are impressed.

Then six months later, the system starts quietly degrading. Support tickets reference wrong procedures. The bot cites a pricing tier that was retired in Q3. A customer gets a confident answer about a product feature that was deprecated before they even signed up. The pipeline is fine. The knowledge base is the problem.

Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails

· 9 min read
Tian Pan
Software Engineer

When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.

The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.

Enterprise AI's Last Mile Problem: Why Most Pilots Never Reach Production

· 8 min read
Tian Pan
Software Engineer

A model that scores 94% on your internal benchmark, impresses stakeholders in a demo, and passes every offline evaluation can still reach production and drop to 7% effective accuracy on real customer data. This isn't a hypothetical. It's a documented outcome from multiple enterprise AI deployments, and it's one symptom of a broader pattern: the gap between "pilot success" and "production value" is where most enterprise AI quietly dies.

Across industries, roughly 85–88% of enterprise AI pilots never reach production. For every 33 PoCs an organization starts, only four ship. That ratio has barely moved in three years despite massive increases in model capability. The failure mode has nothing to do with whether the model is good enough — it's almost always about what happens between the successful demo and the moment a real user relies on the system to do real work.

Explanation Debt: Why Users Deserve to Know What Your AI Did

· 8 min read
Tian Pan
Software Engineer

A loan application gets rejected. A candidate gets filtered out of a hiring pipeline. A medical imaging tool flags a scan as abnormal. In each case, an AI system made a decision that matters—and the user has no idea why.

Teams building these systems often spent months tuning precision, recall, and output quality. They ran A/B tests, iterated on prompts, and shipped a model that gets the right answer 94% of the time. But they never built the layer that tells users what happened. This is explanation debt: the accumulated cost of shipping AI decisions without the attribution, confidence signals, and recourse affordances that make those decisions interpretable.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.