Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

The Refusal Audit: Why a Single Refusal Rate Hides Half the Failure Distribution

· 10 min read
Tian Pan
Software Engineer

Open the safety dashboard for any production LLM feature and you will see refusal rate plotted as a single line, color-coded so that down is bad and up is good. The implicit story: refusals are the system saying no to things it shouldn't do, so a higher number means a safer product. That story is half the picture, and the missing half is where most of the silent quality damage in deployed assistants actually lives.

Refusal rate is a two-sided distribution. The right tail is the one safety teams obsess over: the model agreeing to write malware, fabricate medical dosages, or generate content the policy explicitly forbids. The left tail is the inverse failure — false refusals where the model declines a benign request because some surface feature pattern-matched to a forbidden category. A customer asking how to dispute a charge gets a "I can't give financial advice" boilerplate. A nurse asking about a drug interaction gets routed to "consult a healthcare professional." A developer asking how to parse an email header gets refused because the prompt contained the word "exploit."

Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline

· 9 min read
Tian Pan
Software Engineer

A user asks your support bot when the refund window closes. The bot answers "60 days" with cheerful confidence and a citation. The policy page that says "60 days" was deleted from the CMS three months ago. The new policy is 14. Nobody on your team knows the bot is wrong until a customer escalates.

This is a retrieval cascade failure: the document is gone from the source of truth, but its embedding is still in the index, still ranking high on cosine similarity, still feeding the model a ghost. RAG pipelines treat embedding indexes as caches of source content, but most teams build the cache without building the invalidation. Inserts get all the engineering attention. Deletes get a TODO comment.

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

· 10 min read
Tian Pan
Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

Streaming Structured Output: Why Your Parser Hangs on Token 47

· 11 min read
Tian Pan
Software Engineer

The first time a team builds a streaming AI feature with structured output, the bug is always the same. The model is generating fine. The chunks are arriving fine. But somewhere around token 47, the parser hangs, the UI freezes, or — worse — a half-formed enum value gets routed to a downstream tool that quietly does the wrong thing. The team adds a try/catch around JSON.parse, considers themselves done, and ships. Two weeks later, a sibling team complains that the streaming UI feels janky after the response gets long. A quarter later, an incident review asks why a "Delete" tool call fired on a record that the model was still describing as "DeleteIfEmpty."

The bug is not in any single token. The bug is that token-streaming and structured output are architecturally at odds, and most frameworks paper over the conflict with prayer. A schema says "this is a complete object." A token stream says "here are the bytes one at a time." Every intermediate state between those two endpoints is, by definition, invalid against the schema. The team's job is to decide what to do during those intermediate states — and most teams have not made that decision explicitly.

The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy

· 9 min read
Tian Pan
Software Engineer

A tax assistant tells the user they owe $4,200. The user clicks thumbs-down. A code reviewer flags a real bug in the user's PR. Thumbs-down. A calendar agent correctly says no slot is available before Friday. Thumbs-down. Six months later, the team's prompt iteration has converged on an agent that hedges, equivocates, and cheerfully suggests the math might be off — and CSAT is up.

The thumbs-down button does not measure quality. It measures the conjunction of quality and palatability, and a feedback-driven optimization loop that does not separate those two things will train sycophancy and call it product-market fit. This is not a hypothetical risk. In April 2025, OpenAI rolled back a GPT-4o update after admitting that a new reward signal based on thumbs-up/down feedback "weakened the influence of our primary reward signal, which had been holding sycophancy in check." A model that endorsed stopping medication and praised obvious nonsense had passed every internal preference metric.

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

The Attack Vector You Ship With Every Open RAG System

· 9 min read
Tian Pan
Software Engineer

Five carefully crafted documents. A corpus of 2.6 million. A 97% success rate at manipulating specific AI responses. That's the benchmark result from PoisonedRAG, presented at USENIX Security 2025 — and the attack didn't require model access, prompt injection at inference time, or any direct interaction with the system at all. The attacker simply contributed content to the knowledge base.

If your RAG system lets users add content — helpdesk tickets, wiki edits, customer feedback, shared notes — you've already shipped the attack vector. The question is whether you've also shipped the defenses.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.