Skip to main content

702 posts tagged with "llm"

View all tags

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

· 11 min read
Tian Pan
Software Engineer

The moment your agent fans out five parallel tool calls — search across three indexes, query two databases, hit one external API — you have crossed an invisible line. You are no longer writing prompt-and-response code. You are writing a concurrent program. Most agent frameworks pretend you are not, and the bill arrives at 2 AM.

The pretense is comfortable. The planner emits a list of tool calls, the runtime fires them off, the runtime collects whatever comes back, the planner consumes the aggregate. From a thousand feet up it looks like a fan-out / fan-in pipeline, and most teams treat it that way until production teaches them otherwise. The problem is that twenty years of concurrent-programming research — partial-failure semantics, structured cancellation, backpressure, deterministic error attribution — already solved the failure modes you are about to rediscover. Your agent framework, by default, did not import any of it.

System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a customer-support agent with the system prompt living in a Postgres row, one row per tenant. The pitch was sensible: enterprise customers had asked for tone customization, and "make the prompt editable" was the cheapest way to deliver it. Six months later, three things had happened. The eval suite had ballooned from 200 cases to 11,000 because every tenant's prompt now needed its own regression set. The prompt-update workflow had quietly become a write path with no review, because product owners had been given direct access to the table. And a single broken UTF-8 character in a Korean-language tenant prompt had taken that tenant's chatbot offline for two days before anyone noticed, because the deploy pipeline had no idea the prompt had changed.

None of these outcomes were forced by the requirements. They were forced by an architecture decision that nobody made deliberately: where does the system prompt live? In the code? In a config file? In a database row? The team picked "database" because it was the fastest path to a feature, and the consequences cascaded into every adjacent system over the following months.

Tokenizer Churn: The Silent Breaking Change Inside Your 'Compatible' Model Upgrade

· 11 min read
Tian Pan
Software Engineer

The vendor said the upgrade was a drop-in replacement. The API contract held. The model name in your config barely changed. A week later, your context-window guard starts triggering on prompts it never tripped on before, your stop-sequence regex matches in the wrong place, and one of your few-shot examples started producing a confidently wrong answer that your eval suite happens not to cover. Nobody touched the prompt. Nobody touched the temperature. Somebody quietly retrained the tokenizer.

Tokenizer changes are the breaking change vendors don't call breaking. The API surface is byte-stable, the SDK didn't bump a major version, and the release notes mention "improved instruction following" — but the function from your input string to the integer sequence the model actually sees has been replaced. Every assumption your code made about how text becomes tokens is now subtly wrong. The cost of that invisibility is two weeks of "the model feels different" before someone re-runs a canonical prompt through count_tokens and finds the answer.

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

The Vendor-Portability Tax: Why 'We Can Swap Models' Is a Quarterly Cost Line, Not a Checkbox

· 11 min read
Tian Pan
Software Engineer

Every team I have audited in the last six months claims to be vendor-agnostic. None of them are. The system prompt that scored highest on the eval suite did so because it leaned into a single vendor's tokenizer behavior, JSON-mode contract, refusal cadence, and stop-sequence handling — and the team that wrote it could not name which of those biases were doing the work. When the CFO asks why the cheaper model on the procurement deck cannot just be dropped in, the honest answer is two engineer-quarters of prompt re-tuning and a complete re-baseline of every eval. That is not a checkbox. It is a quarterly cost line.

The mental model that keeps biting teams is treating vendor portability as a one-time architecture decision. You add an abstraction layer, you write a model: field in your config, you congratulate yourself, and you move on. Then a year later the vendor raises prices, ships a deprecation notice, or has a bad week of refusals on a category you care about, and you discover that the abstraction was a thin wrapper around a prompt that only works on one model. The portability you bought was syntactic. The portability you needed was behavioral, and behavioral portability decays the moment you stop paying for it.

Your Model Update Is a Breaking Change: The Behavioral Changelog You Owe Your Integrators

· 12 min read
Tian Pan
Software Engineer

A vendor pushes a "minor refresh" to a model alias on a Tuesday afternoon. By Thursday, four customer companies are running incident response. None of them deployed code that week. None of their dashboards show a regression in latency, error rate, or any other infra-shaped metric. What changed is that the model behind their pinned alias started returning slightly different sentences, slightly different JSON, and slightly different refusals — and every prompt their team wrote against the old behavior is now a contract that nobody honored.

The asymmetry is the entire story. The provider treated the rollout as a deploy: tested internally, gated on a few aggregate evals, ramped to 100% within a maintenance window. The consumer surface received it as a semver violation: a dependency upgraded itself in production without changing its version string, and the bug reports started rolling in from end users with the cheerful subject line "nothing changed on our side."

The Inference Budget Committee: Governance When Token Spend Crosses Seven Figures

· 12 min read
Tian Pan
Software Engineer

At $50,000 a month, the "compute + tokens" line on your infra bill is rounding error. At $5,000,000 a month, it is a CFO question. The transition between those two states is not gradual — it is a phase change in how an organization talks about model spend, and most engineering orgs are unprepared for the social and political work that follows. The bill stays a single line; the conversation around it does not.

What changes is who has standing to ask "why." When three product teams share one API key and one capacity reservation, every quota argument has the same structure: someone is currently winning at the expense of someone else, and there is no neutral party to call it. The first time a team's launch is throttled because another team shipped a chatty agent, the absence of a governance body is felt by the entire engineering org at once. Calling a meeting and inventing a process under pressure is the worst time to design one.

The Local-Maximum Trap in Prompt Iteration: How to Tell You're Tweaking the Wrong Thing

· 10 min read
Tian Pan
Software Engineer

There is a moment, six weeks into a serious LLM project, where the prompt iteration log starts to look like a therapy journal. Each tweak swaps one failure mode for another. Add a stricter "do not" clause and the model becomes evasive on cases it used to handle. Soften the tone and a different category of hallucination returns. The eval scoreboard hovers in a band three or four points wide, refusing to break out. Someone says, "let me try one more reordering," and another half day evaporates.

This is the local-maximum trap. The team is climbing a hill, but the hill does not go higher. The cruel part is that the hill is real — every prompt change does produce a measurable delta on some subset of cases, which is exactly the signal that keeps everyone tweaking. What's missing is the recognition that the ceiling above is not a prompt ceiling at all.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.