Skip to main content

720 posts tagged with "llm"

View all tags

Hidden SDK Retries: Why You're Paying Twice and Don't Know It

· 10 min read
Tian Pan
Software Engineer

Open the OpenAI Python SDK source and you will find a quiet line: DEFAULT_MAX_RETRIES = 2. The Anthropic SDK ships the same default. Most TypeScript SDKs match. Two retries, exponential backoff, automatic on connection errors, 408, 409, 429, and any 5xx — fired before your code ever sees the failure. You do not configure this. You do not opt in. You usually do not know it is happening, because the metric your app records is request_count, not attempt_count, and the only span your tracer ever sees is the outer one the SDK closes after the final attempt.

This is fine, mostly, until it is not. Add an application-level retry decorator on top of that SDK call — the kind every team writes after their first 429 — and you have built a 3×3 storm: the SDK tries three times, your wrapper tries three times around the SDK, and a single user request fans out to nine inference calls during a provider degradation. The provider's bill counts every attempt. Your dashboards count one. The reconciliation, when someone finally runs it, is a quarter-end conversation nobody enjoys.

Negative Prompts Are Code Smells: Why Every 'Don't' in Your System Prompt Is Technical Debt

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any production AI feature that has been live for more than three months. Count the negative clauses — the "do not," "never say," "avoid," "under no circumstances," "you must not." If the count is in the double digits, you are not looking at a system prompt. You are looking at a graveyard. Each tombstone marks a specific user complaint, a specific incident report, a specific Slack message from a stakeholder who saw the model do something embarrassing. The team patched the surface and moved on, and now the prompt reads like a legal disclaimer with a personality grafted onto the front.

Negative prompts are code smells. Not in the metaphorical sense — in the literal one. They are the prompt-engineering equivalent of a try/except block that swallows an exception, a config flag with no documentation, a // TODO: refactor this from 2022. They work, kind of, until they don't. And the failure mode they hide is almost always more interesting than the failure they were added to suppress.

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline

· 9 min read
Tian Pan
Software Engineer

A user asks your support bot when the refund window closes. The bot answers "60 days" with cheerful confidence and a citation. The policy page that says "60 days" was deleted from the CMS three months ago. The new policy is 14. Nobody on your team knows the bot is wrong until a customer escalates.

This is a retrieval cascade failure: the document is gone from the source of truth, but its embedding is still in the index, still ranking high on cosine similarity, still feeding the model a ghost. RAG pipelines treat embedding indexes as caches of source content, but most teams build the cache without building the invalidation. Inserts get all the engineering attention. Deletes get a TODO comment.

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

· 11 min read
Tian Pan
Software Engineer

A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.

The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

· 8 min read
Tian Pan
Software Engineer

A finance-led mandate to "switch to the smaller model" is one of the most reliable ways to raise your LLM bill quarter-over-quarter. The dashboard the procurement team is watching — cost per call, average tokens per request — keeps trending down. Meanwhile the invoice keeps trending up. By the time someone reconciles the two, six months of prompt iteration has been spent compensating for a model that's worse at the task, and the team is in too deep to walk it back without admitting the original switch was a mistake.

The mistake isn't about pricing. It's about the unit. Per-token price is a misleading axis when reasoning depth, retry count, and prompt size all vary by model. The right metric is tokens-per-successful-completion, and on that axis the cheaper model often loses.

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

· 10 min read
Tian Pan
Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

Streaming Structured Output: Why Your Parser Hangs on Token 47

· 11 min read
Tian Pan
Software Engineer

The first time a team builds a streaming AI feature with structured output, the bug is always the same. The model is generating fine. The chunks are arriving fine. But somewhere around token 47, the parser hangs, the UI freezes, or — worse — a half-formed enum value gets routed to a downstream tool that quietly does the wrong thing. The team adds a try/catch around JSON.parse, considers themselves done, and ships. Two weeks later, a sibling team complains that the streaming UI feels janky after the response gets long. A quarter later, an incident review asks why a "Delete" tool call fired on a record that the model was still describing as "DeleteIfEmpty."

The bug is not in any single token. The bug is that token-streaming and structured output are architecturally at odds, and most frameworks paper over the conflict with prayer. A schema says "this is a complete object." A token stream says "here are the bytes one at a time." Every intermediate state between those two endpoints is, by definition, invalid against the schema. The team's job is to decide what to do during those intermediate states — and most teams have not made that decision explicitly.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

· 10 min read
Tian Pan
Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.

The Attack Vector You Ship With Every Open RAG System

· 9 min read
Tian Pan
Software Engineer

Five carefully crafted documents. A corpus of 2.6 million. A 97% success rate at manipulating specific AI responses. That's the benchmark result from PoisonedRAG, presented at USENIX Security 2025 — and the attack didn't require model access, prompt injection at inference time, or any direct interaction with the system at all. The attacker simply contributed content to the knowledge base.

If your RAG system lets users add content — helpdesk tickets, wiki edits, customer feedback, shared notes — you've already shipped the attack vector. The question is whether you've also shipped the defenses.