Skip to main content

702 posts tagged with "llm"

View all tags

AI Output Volatility Is a Business Risk You're Probably Underpricing

· 9 min read
Tian Pan
Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

The Context Format Decision Most Teams Make Accidentally: JSON vs Markdown vs Plain Text

· 9 min read
Tian Pan
Software Engineer

Most teams pick a context format once, early in development, and never revisit it. A developer reaches for JSON because it looks structured and machine-readable. Another grabs markdown because it's what they use in README files. Plain text gets chosen when nothing else seems necessary. These are not engineering decisions — they're habits. And they silently shape how your model reasons.

The format you pass to an LLM is not inert packaging. It is an instruction. Structured JSON context primes the model toward schema-following behavior. Markdown encourages hierarchical synthesis. Plain text opens up more flexible inference. Getting this wrong by even one format category can degrade accuracy by 40% or more — and you won't see the error in your logs.

The AI Code Feedback Loop: How Today's Generated Code Trains Tomorrow's Models

· 9 min read
Tian Pan
Software Engineer

About 41% of all new code merged globally in 2025 was AI-generated. Most of that code flows into production repositories that are publicly indexed, scraped, and eventually fed back into the next round of training data for AI coding tools. The implication is straightforward but its consequences are still unfolding: AI models are increasingly being trained on the outputs of prior AI models, with no structured record of which code came from where.

This is the context pollution problem. It is not hypothetical. The feedback loop is already operating at scale, the quality effects are measurable, and the failure mode is unusual enough that it can look like improvement in the short term while the underlying distribution quietly degrades.

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

· 9 min read
Tian Pan
Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails

· 9 min read
Tian Pan
Software Engineer

When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.

The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

Gradual Context Replacement: Managing Long AI Conversations Without Losing Quality

· 9 min read
Tian Pan
Software Engineer

Your chatbot works perfectly for the first fifteen turns. Then something goes wrong. It contradicts an earlier decision. It asks for information the user already provided. It loses the thread of a multi-step task that was clearly defined at the start. The conversation history is technically all there—you haven't deleted anything—but the model is behaving as if it wasn't.

This is context rot: the gradual degradation of output quality as conversation histories grow. A 2024 evaluation of 18 state-of-the-art models across nearly 200,000 controlled calls found that reliability decreases significantly beyond 30,000 tokens, even in models with much larger nominal windows. High-performing models become as unreliable as much smaller ones in extended dialogues. The problem isn't that your context window ran out. It's that transformer attention is quadratic—100,000 tokens means 10 billion pairwise relationships—and the model is forced to distribute focus so thinly that important earlier content gets effectively ignored.

When teams hit this wall, they usually reach for one of two fixes: truncation or summarization. Both make things worse in predictable ways.

The Helpful AI Paradox: Why Instruction-Following Is a Security Vulnerability

· 9 min read
Tian Pan
Software Engineer

There's an uncomfortable truth about LLMs that doesn't get discussed enough in product reviews: the property that makes them useful is identical to the property that makes them exploitable. An LLM that obediently follows instructions — any instructions, from any source, delivered in any format — will follow malicious instructions with the same cheerful compliance it applies to legitimate ones. The model cannot tell the difference.

This isn't a bug that will be patched away. It's an architectural reality. And as these systems take on more agentic roles — reading emails, browsing the web, executing code, calling APIs — the exposure surface grows in ways that most engineering teams haven't mapped.

LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed

· 9 min read
Tian Pan
Software Engineer

Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.

This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

· 11 min read
Tian Pan
Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.