Skip to main content

720 posts tagged with "llm"

View all tags

The Ambient AI Coherence Problem: When Every Feature Is AI-Powered, Nothing Feels Like One Product

· 9 min read
Tian Pan
Software Engineer

Most AI products get the individual features right and the product wrong. Search returns plausible results. The summary is coherent. The chat assistant gives reasonable advice. But when a user searches for "best plan for small teams," gets a recommendation in the sidebar, asks the assistant a follow-up question, and then reads an auto-generated summary of their options — and all four contradict each other — none of the features feel trustworthy anymore. This is the ambient AI coherence problem: not hallucination in isolation, but contradiction at the product level.

The failure mode is subtle enough that teams often miss it entirely. Individual feature evals look fine. The search team measures recall and precision. The summarization team measures faithfulness. The chat team measures task completion. Nobody measures whether the AI-powered features of the product tell the same story about the same facts.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 9 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

· 10 min read
Tian Pan
Software Engineer

In 2021, GPT-3 cost 60permilliontokens.Byearly2026,youcouldbuyequivalentperformancefor60 per million tokens. By early 2026, you could buy equivalent performance for 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from 11.5billionto11.5 billion to 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.

The Inference-Time Personalization Trap: When User Context Costs More Than It Earns

· 9 min read
Tian Pan
Software Engineer

There's a pattern that shows up in nearly every AI product once it hits a few hundred thousand active users: the team adds personalization — injects user history, preference signals, behavioral data into every prompt — and watches the product get slightly better while the infrastructure bill gets significantly worse. When they finally pull the logs and measure the quality delta per added token, the curve is almost always the same shape: steep gains early, then a long plateau, then diminishing returns you're paying full price for.

Most teams don't run that analysis until they're already in the hole. This post is about why the trap exists, where personalization stops paying, and what the architectures that actually work look like in production.

The Instruction Position Problem: Where You Place Things in Your Prompt Is an Architecture Decision

· 9 min read
Tian Pan
Software Engineer

You wrote a clear system prompt. You tested it in the playground and it worked. You deployed it. Three weeks later, a user figures out that your safety constraint doesn't reliably fire — not because of a clever jailbreak, but because you placed the constraint after a 400-token context block that you added in the last sprint. The model just… forgot it was there.

This is the instruction position problem, and it's not a bug in your prompt. It's a structural property of how transformer-based models process sequences. Every token in your prompt does not receive equal attention. Where you place an instruction determines, in a measurable way, whether the model will follow it.

The LLM Forgery Problem: When Your Model Builds a Convincing Case for the Wrong Answer

· 10 min read
Tian Pan
Software Engineer

Your model wrote a detailed, well-structured analysis. Every sentence was grammatically correct and internally consistent. The individual facts it cited were accurate. And yet the conclusion was wrong — not because the model lacked the information to get it right, but because it had already decided on the answer before it started reasoning.

This is not hallucination. Hallucination is when a model fabricates facts. The forgery problem is subtler and, in production systems, harder to catch: the model reaches a conclusion first, then constructs a plausible-sounding chain of evidence to support it. The facts are real. The synthesis is a lie.

Engineers who haven't encountered this failure mode yet will. It shows up in every domain where LLMs are asked to do analysis — code review, document summarization, risk assessment, question answering over a knowledge base. The model sounds authoritative. It cites real evidence. And it has quietly ignored everything that pointed the other way.

The Requirements Gap: How to Write Specs for AI Features When 'Correct' Is a Distribution

· 10 min read
Tian Pan
Software Engineer

Here is a spec that ships broken AI features on a predictable schedule: "The assistant should accurately answer customer questions and maintain a helpful tone." Every stakeholder nodded, the PRD was approved, and six months later the team is arguing in a post-mortem about whether an 87% accuracy rate was acceptable — a question nobody thought to answer before launch.

The failure is not technical. The model may have been fine. The failure is that the requirements format imported directly from traditional software left no room for the defining property of AI outputs: they are probabilistic. "Correct" is not a state; it is a distribution. And you cannot specify a distribution with a user story.

The Second Opinion Economy: When Dual-Model Verification Actually Pays Off

· 10 min read
Tian Pan
Software Engineer

The most seductive idea in AI engineering is that you can make any LLM system more reliable by running a second LLM to check the first one's work. On paper, it's obvious. In practice, teams that deploy this pattern naively often end up with 2x inference costs and a false sense of confidence — their "verification" is just the original model's biases running twice.

Done right, dual-model verification produces real accuracy gains: 6–18% on reasoning tasks, measurable improvements in RAG faithfulness, and meaningful catches in code correctness. Done wrong, two models agreeing on the same wrong answer is worse than one model failing, because now you've also disabled your uncertainty signal.

This post is about knowing the difference.

Treating Your LLM Provider as an Unreliable Upstream: The Distributed Systems Playbook for AI

· 11 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Response times look fine. Error rates are near zero. And yet your users are filing tickets about garbage answers, your agent is making confidently wrong decisions, and your support queue is filling up with complaints that don't correlate with any infrastructure alert you have.

Welcome to the unique hell of depending on an LLM API in production. It's an upstream service that can fail you while returning a perfectly healthy 200 OK.

The Inference Gateway Pattern: Why Every Production AI Team Builds the Same Middleware

· 8 min read
Tian Pan
Software Engineer

Every team shipping LLM-powered features goes through the same arc. First, you hardcode an OpenAI API call. Then you add a retry loop. Then someone asks how much you're spending. Then a provider goes down on a Friday afternoon, and suddenly you're building a gateway.

This isn't accidental. The inference gateway is an emergent architectural pattern — a middleware layer between your application and LLM providers that consolidates rate limiting, failover, cost tracking, prompt logging, and routing into a single chokepoint. It's the load balancer of the AI era, and if you're running models in production, you either have one or you're building one without realizing it.

Knowledge Graphs Are Back: Why RAG Teams Are Adding Structure to Their Retrieval

· 8 min read
Tian Pan
Software Engineer

Your RAG pipeline answers single-fact questions beautifully. Ask it "What is our refund policy?" and it nails it every time. But ask "Which customers on the enterprise plan filed support tickets about the billing API within 30 days of their contract renewal?" and it falls apart. The answer exists in your data — scattered across three different document types, connected by relationships that cosine similarity cannot see.

This is the multi-hop reasoning problem, and it's the reason a growing number of production RAG teams are grafting knowledge graphs onto their vector retrieval pipelines. Not because graphs are trendy again, but because they've hit a concrete accuracy ceiling that no amount of chunk-size tuning or reranking can fix.

LLM Provider Lock-in: The Portability Patterns That Actually Work

· 8 min read
Tian Pan
Software Engineer

Everyone talks about avoiding LLM vendor lock-in. The advice usually boils down to "use an abstraction layer" — as if swapping openai.chat.completions.create for litellm.completion solves the problem. It doesn't. The API call is the easy part. The real lock-in is invisible: it lives in your prompts, your evaluation data, your tool-calling assumptions, and the behavioral quirks you've unconsciously designed around.

Provider portability isn't a boolean. It's a spectrum, and most teams are further from the portable end than they think. The good news is that the patterns for genuine portability are well understood — they just require more discipline than dropping in a wrapper library.