Skip to main content

578 posts tagged with "insider"

View all tags

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

PII in the Prompt Layer: The Privacy Engineering Gap Most Teams Ignore

· 12 min read
Tian Pan
Software Engineer

Your organization has a privacy policy. It says something reasonable about user data being handled carefully, retention limits, and compliance with GDPR and HIPAA. What it almost certainly does not say is whether the text of that user's name, email address, or medical history was transmitted verbatim to a hosted LLM API before any policy control was applied.

That gap — between the privacy policy you can point to and the privacy guarantee you can actually prove — is where most production LLM systems are silently failing. Research shows roughly 8.5% of prompts submitted to tools like ChatGPT and Copilot contain sensitive information, including PII, credentials, and internal file references. In enterprise environments where users paste emails, customer data, and support tickets into AI-assisted workflows, that number almost certainly runs higher.

The problem is not that developers are careless. It is that the LLM prompt layer was never designed as a data processing boundary. It inherits content from upstream systems — user input, RAG retrievals, agent context — without enforcing the data classification rules that govern every other part of the stack.

Pricing Your AI Product: Escaping the Compute Cost Trap

· 10 min read
Tian Pan
Software Engineer

There is a company charging £50 per month per user. Their AI feature consumes £30 in API fees. That leaves £20 to cover hosting, support, and profit — before accounting for a single refund or churned seat. They built a product users love, grew to thousands of subscribers, and unknowingly constructed a business where more customers means more losses.

This is not a cautionary tale about a bad idea. It is a cautionary tale about a pricing architecture imported from a world where the marginal cost of serving the next user was effectively zero. That world no longer fully applies when your product calls a language model.

Traditional SaaS gross margins run 70–90%. AI-forward companies are reporting 50–60% — and the gap is mostly explained by one line item: inference. When tokens are 20–40% of your cost of goods sold, the standard SaaS playbook inverts.

Proactive Agents: Event-Driven and Scheduled Automation for Background AI

· 11 min read
Tian Pan
Software Engineer

Almost every tutorial on building AI agents starts the same way: user types a message, agent reasons, agent responds. That model works fine for chatbots and copilots. It fails to describe the majority of production AI work that organizations are now deploying.

The agents that quietly matter most in enterprise environments don't wait for a message. They wake up when a database row changes, when a queue crosses a depth threshold, when a scheduled cron fires at 3 AM, or when monitoring detects that a metric drifted outside bounds. They act without a user present. When they fail, nobody notices until the damage has compounded.

Building these proactive agents requires a substantially different design vocabulary than building reactive assistants. The session-scoped mental model that works for conversational AI breaks down when your agent runs in a loop, retries in the background, and has no human to catch its mistakes.

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

· 11 min read
Tian Pan
Software Engineer

Most teams that decide to fine-tune a model spend weeks debating which method to use before they've written a single line of training code. The debate rarely surfaces the right question. The real question is not "SFT or DPO?" — it's "what kind of gap am I trying to close?"

Supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) are not competing answers to the same problem. Each targets a different failure mode. Reaching for RLHF when SFT would have sufficed wastes months. Reaching for SFT when the problem is actually a preference mismatch produces a model that's fluent but wrong in ways that are hard to detect until they surface in production.

This post is a decision framework. It maps each method to the specific problem it solves, explains what signals indicate which method will dominate, and provides a diagnostic methodology for identifying where your actual gap lives before you commit to a training run.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.