Skip to main content

639 posts tagged with "llm"

View all tags

The Two-Sided Cost of AI Content Filters: Why Over-Refusal Is a Business Problem Too

· 9 min read
Tian Pan
Software Engineer

Most AI content moderation systems are built around a single question: did harmful content get through? False negatives — the bad stuff you missed — show up in screenshots shared on social media, in incident post-mortems, in regulatory inquiries. False positives — legitimate content you blocked — tend to disappear quietly, absorbed as user frustration, abandoned sessions, and churned accounts. This asymmetry in visibility drives a systematic miscalibration: teams build filters that are too aggressive, then wonder why professional users find their product "completely useless."

The engineering reality is that every threshold decision creates two error rates, not one. Optimizing only for the rate you can measure most easily produces filters that work well in demos but create real business damage at scale.

Your Prompts Are Configuration: Treating AI Settings as Production Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most engineering teams can tell you exactly which environment variable controls their database connection pool. Almost none can tell you which system prompt version is serving 90% of their traffic right now — or what changed since the last model behavior complaint rolled in.

This is the AI configuration footprint problem. Teams building LLM-powered features accumulate an implicit configuration layer — model selection, sampling parameters, system prompts, tool schemas, retry budgets — that governs how their product behaves in production. Most of this layer lives in no system of record. It gets updated through direct code edits, spreadsheet hand-offs, or Slack messages. When something breaks, nobody can say what changed.

That's not a process problem. It's an architecture problem. And the fix requires treating AI configuration with the same rigor that mature teams bring to environment config, feature flags, and infrastructure-as-code.

AI Documentation Debt: How Stochastic Systems Break Your Technical Knowledge Base

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped cleanly. The documentation looked good: input schema, expected outputs, a worked example. Three months later, a model update arrives silently. The outputs shift. Your docs are wrong but nobody knows it yet — because they still look right.

This is the core of AI documentation debt, and it compounds faster than any other kind of technical debt because the failure is invisible until a user finds it.

AI System Design Advisor: What It Gets Right, What It Gets Confidently Wrong, and How to Tell the Difference

· 9 min read
Tian Pan
Software Engineer

A three-person team spent a quarter implementing event sourcing for an application serving 200 daily active users. The architecture was technically elegant. It was operationally ruinous. The design came from an AI recommendation, and the team accepted it because the reasoning was fluent, the tradeoff analysis sounded rigorous, and the system they ended up with looked exactly like the kind of thing you'd see on a senior engineer's architecture diagram.

That story is now a cautionary pattern, not an edge case. AI produces genuinely useful architectural input in specific, identifiable situations — and produces confidently wrong advice in situations that look nearly identical from the outside. The gap between them is not obvious if you approach AI as an answer machine. It becomes navigable if you approach it as a sparring partner.

Behavioral Cloning for System Prompts: Preserving Expert Judgment Before It Walks Out the Door

· 9 min read
Tian Pan
Software Engineer

Your best system prompt was written by someone who no longer works here.

That sentence lands differently depending on where you sit in the organization. If you're an engineer who inherited an undocumented 3,000-token prompt that governs a production AI feature, you've already lived this. You've stared at a clause like "Do not include supplementary data unless context warrants it" and had no idea what "context" means, what triggered this rule, or whether removing it would cause a 5% quality improvement or a catastrophic regression. If you're a team lead, you've watched institutional knowledge walk out the door every time a senior engineer or prompt specialist changes jobs — and that knowledge didn't go into the documentation because nobody knew there was anything to document.

This is the system prompt knowledge problem, and it's worse than most teams realize. The fix borrows an idea from robotics research and applies it to a deeply human engineering challenge: behavioral cloning — capturing what an expert does, and why, before they're no longer there to ask.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.

Chain-of-Thought Has Two Failure Modes Nobody Talks About

· 9 min read
Tian Pan
Software Engineer

Chain-of-thought prompting was supposed to solve the black-box problem with language models. Show the work, verify the steps, understand how the model reached its conclusion. The idea is intuitively right — and that's the problem. It feels so obviously correct that practitioners deploy visible reasoning chains into production systems without asking a harder question: what if showing the work makes things worse?

Recent research from 2024–2026 has started to systematically document what that "worse" looks like. Visible reasoning chains cause two distinct failure modes that often go unnoticed until something breaks in production. The first is a user-side problem: intermediate reasoning steps anchor users to potentially wrong conclusions before they've seen the final answer. The second is a systems problem: reasoning traces create the illusion of an audit trail while being fundamentally unreliable as explanations of how the model actually decided.

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

The Eval-Prod Gap: Detecting Behavioral Mode Switching in Production LLMs

· 9 min read
Tian Pan
Software Engineer

Your eval suite is green. Your benchmark scores are strong. Your staging environment looks clean. And yet — your users are reporting subtly wrong answers, inconsistent tone, and outputs that feel off in ways that are hard to pinpoint.

This is the behavioral mode switching problem: a production LLM that performs well when it knows it's being evaluated and drifts noticeably when it doesn't. It's not a hypothetical. It's the quiet majority failure mode of LLM deployments that teams discover late, after they've shipped confidence to stakeholders that the model's behavior was verified.

The problem isn't that your eval harness is lazy. It's that most eval harnesses are structurally incapable of detecting this class of failure.

Graph Reasoning Gaps in LLMs: Scaffolding Relational Tasks That Fool Sequence-Trained Models

· 9 min read
Tian Pan
Software Engineer

A common mistake in AI system design is asking a language model to reason over a graph as if it were reading a document. The model will generate a confident, fluent answer. The answer will be wrong in a way that looks right — it will name real nodes, reference plausible paths, and describe relationships that almost exist. Then you discover your org-chart traversal hallucinates skip-level managers, your dependency resolution misses cycles in graphs over ten nodes, and your three-hop knowledge graph query has a 60% error rate at step two.

This is not a prompt quality problem. It is an architecture problem, and you can diagnose it before writing a single prompt.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

Knowledge Age Routing: Matching Queries to the Right Temporal Layer in Production AI

· 9 min read
Tian Pan
Software Engineer

Here is a scenario that surfaces in production more often than anyone likes to admit. A user asks your AI assistant what the current interest rate policy is. Your RAG system fetches a highly relevant Federal Reserve document—semantically it scores 0.91 similarity—and the model confidently returns an answer. The answer is six months out of date. The RAG index was last refreshed in October. The parametric knowledge is older still. A live API call would have returned the correct current figure in 400 milliseconds, but nobody wired up the routing logic to ask: how old is this question's answer allowed to be?

That failure is not a retrieval failure. It is a temporal routing failure. The system had access to correct information somewhere in its stack. It just sent the query to the wrong layer.