Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Prompt Injection Is Not Primarily an Attacker Problem

· 9 min read
Tian Pan
Software Engineer

Most teams defending against prompt injection picture an attacker: someone crafting a carefully engineered string to override an AI's instructions. That framing is wrong, and it's costing them. The harder version of this problem doesn't require attackers at all.

Every time your AI application ingests user-generated content — a product review, a support ticket, a document upload, a CRM note — it faces the same structural vulnerability. No malicious intent needed. The ordinary text that ordinary users produce for ordinary reasons can, at scale, behave identically to a deliberate injection. If your application is only defended against the adversarial case, you're defended against the minority case.

The Prompt Surface Area Problem: Why Adding a Tool Is Never Just Adding a Tool

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an LLM-powered agent has been tempted by a simple mental model: a tool is a function. Adding a tool means the agent can do one more thing. The cost is a few lines of documentation in the system prompt, maybe a schema definition, maybe one new entry in a tool registry. It feels additive — linear.

It isn't. Each new tool doesn't expand what the agent can do in isolation; it expands what the agent can do in combination with every tool already there. That distinction is the source of a class of production failures that no amount of prompt tweaking can fix after the fact, because the problem is architectural. The prompt surface area problem is real, it compounds quickly, and most teams don't see it until they're already deep in it.

Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself

· 8 min read
Tian Pan
Software Engineer

Your RAG system is probably indexing its own outputs. You just don't know it yet.

It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.

This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

· 11 min read
Tian Pan
Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

Your Refusal Logs Are a Product Backlog in Disguise

· 9 min read
Tian Pan
Software Engineer

Every AI product team has a security dashboard somewhere showing refused requests. Filters triggered, jailbreaks blocked, policy violations caught. The operational teams look at it to make sure the guardrails are holding. Nobody else looks at it at all.

That's a mistake. The requests your AI refuses are the most concentrated, honest user research signal you have access to. A user who tries three different phrasings to get your product to do something it won't do is telling you, with extraordinary clarity, exactly what they want and can't have. Treating that signal as a security artifact rather than a product artifact is leaving the richest feedback you'll ever collect on the floor.

The Retrograde Accuracy Problem: Why AI Features Degrade as Your Product Grows

· 10 min read
Tian Pan
Software Engineer

Your AI feature ships clean. Accuracy on the eval set: 91%. Latency: acceptable. The team is proud. Six months later, users are complaining that the feature feels "dumb," support tickets are climbing, and your aggregate metrics are quietly 8% worse than launch day. Nobody changed the model. The underlying data pipeline is intact. What happened?

This is the retrograde accuracy problem. As your product grows — new features, new user segments, new edge cases, new flows — the input distribution your AI sees in production quietly drifts away from the distribution it was trained on. No model update. No data pipeline failure. The product itself outgrew the model.

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

The SLA Illusion: Why 99.9% Uptime Means Nothing for AI-Powered Features

· 9 min read
Tian Pan
Software Engineer

Your dashboards are green. Latency is nominal. Error rate is 0.2%. Uptime is 99.97% for the month. And your AI assistant is confidently telling users the wrong thing, in the wrong format, at twice the expected length — and has been doing so for eleven days.

This is the SLA illusion: the infrastructure contract that covers the pipe, not the water flowing through it. For AI-powered features, the gap between "is it responding?" and "is it responding well?" is the gap where product quality quietly dies.

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

· 9 min read
Tian Pan
Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.

When LLMs Grade Their Own Homework: The Feedback Loops Breaking AI Evaluation

· 10 min read
Tian Pan
Software Engineer

Here is a finding most AI teams don't want to sit with: in a large-scale study that generated over 150,000 evaluation instances across 22 tasks, roughly 40% of LLM-as-judge comparisons showed measurable bias. That bias wasn't random noise—it was systematic, reproducible, and correlated with how models were trained. When you use a model to generate your eval set and then use the same model (or a close relative) to grade it, you're not measuring quality. You're measuring how well a system agrees with itself.

Synthetic eval data has become standard practice for good reasons. Human annotation is slow, expensive, and hard to scale. LLM-generated test cases let teams spin up thousands of examples overnight. The problem surfaces when the generator and the judge share a common ancestor—which, in 2025, is almost always the case. The result is an eval pipeline that confidently reports high scores while hiding the exact failure modes you built it to catch.

Tool Call Convergence: Designing Agents That Know When to Stop

· 10 min read
Tian Pan
Software Engineer

A LangChain analyzer/verifier agent pair ran for 264 hours straight and racked up $47,000 in API costs. It produced nothing useful. The verifier kept rejecting the analyzer's output without saying what was wrong. The analyzer defaulted to trying again. No one had written a stopping criterion. The loop ran until someone noticed the invoice.

This is the failure mode that doesn't make it into architecture diagrams: agents that know how to call tools but don't know when to stop. The canonical agent loop is a while True that asks the model "should I call a tool?" — but that question has no built-in answer for "I've seen enough." Without convergence logic, you're not building an agent. You're building an expensive polling function.

AI Writes Code in Seconds. Your Team Reviews It for Hours. The Math Isn't Working.

· 8 min read
Tian Pan
Software Engineer

The ROI pitch for AI coding tools is irresistible on paper: developers complete tasks 55% faster in controlled experiments, ship 98% more pull requests, and report saving 3.6 hours per week. But when organizations look at their actual delivery metrics — bug rates, release cycle times, incident frequency — the numbers barely move. Something is absorbing all those gained hours, and it's not hard to find.

AI generates code in seconds. Engineers still review it at the same pace they always have.