Skip to main content

720 posts tagged with "llm"

View all tags

Onboarding an Agent Like a Junior Engineer Is a Category Error

· 9 min read
Tian Pan
Software Engineer

When an agent joins your team, the nearest analogy in every engineering manager's head is the new hire. So the playbook writes itself: give it a sandbox and read-only logs, scope the first tasks small, pair with it, expect a ramp-up period, and grow it into bigger work as trust accumulates. It feels responsible. It feels like the same patient management that turned your last junior into a senior.

It is also a category error — not a slightly imperfect analogy, but a wrong one. A junior engineer is a person who does not yet know your system. An agent is a stateless function that will never know your system, no matter how many times it touches it. Those are different kinds of things, and the management instincts that work for one quietly misallocate your attention on the other.

The reason this matters is that the metaphor doesn't just mislead — it tells you to invest in the wrong place. "Grow the agent" is not a strategy. The agent is fixed. Everything you can actually change lives outside of it.

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

· 11 min read
Tian Pan
Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

Prompt Injection Is a Confused Deputy, Not a Content-Filtering Problem

· 10 min read
Tian Pan
Software Engineer

The most common post-incident finding for a prompt injection breach is some variation of "the model got tricked." A retrieved document contained hidden instructions, the agent followed them, customer data left the building. The fix that follows is almost always a content filter: scan the input, classify the malicious instruction, strip it out before it reaches the model. Ship the filter, close the ticket.

That finding is wrong, and the filter is a treadmill. "The model got tricked" describes the symptom, not the vulnerability. The vulnerability is that an agent holding real privileges — a database token, a send-email capability, filesystem write — accepted instructions from a source that should never have been allowed to command those privileges. That is not a new class of bug. It is a confused deputy, and operating systems named and largely solved it almost forty years ago.

If you treat prompt injection as a detection problem, you are signing up for an arms race against every attacker who can phrase a sentence. If you treat it as an authority problem, you get to reuse decades of security engineering that already works.

The Quadratic Cost of a Conversation: Why AI Chat Spend Grows Superlinearly

· 8 min read
Tian Pan
Software Engineer

A ten-turn conversation does not cost ten times a single turn. It costs closer to fifty-five times. This is the first thing most teams get wrong when they model the unit economics of an AI feature, and it is the reason a product that looks profitable in a spreadsheet bleeds money in production.

The mistake is treating a conversation as a sequence of independent requests. It is not. Because LLM APIs are stateless, every turn re-sends the entire accumulated history. Turn one sends one unit of context. Turn two sends two. Turn ten sends ten. The total tokens billed across the session is the sum 1 + 2 + ... + N, which grows as N²/2 — quadratically — while your product almost certainly charges a flat, linear price.

The users who love your product most are the ones holding the longest conversations. They are also the ones quietly destroying your margins.

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

Shadow AI: The Agents Your Team Already Shipped

· 10 min read
Tian Pan
Software Engineer

Shadow IT used to mean a marketing team expensing a SaaS subscription, or an engineer spinning up an unsanctioned S3 bucket. It was annoying, it was a procurement headache, and it was mostly survivable. Shadow AI is the same instinct — route around the slow official path — except the blast radius is larger and the entry cost has collapsed to almost nothing.

An engineer can wire an LLM API call into a production workflow in an afternoon. A support lead can stand up a no-code triage agent before lunch. A data analyst can paste a quarter's worth of customer records into a chat window to "just summarize this real quick." None of it passes through review, none of it shows up in an architecture diagram, and your governance program cannot protect a system it does not know exists.

The uncomfortable part is the scale. A 2025 UpGuard survey found that more than 80% of workers — and nearly 90% of security professionals — use unapproved AI tools at work. Your security team is doing it. Your executives are doing it. The question is not whether you have shadow AI. It is whether you can see any of it.

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.

The Streaming Token the User Acted On Too Soon

· 9 min read
Tian Pan
Software Engineer

A user asked your assistant whether a config change was safe to ship. The model streamed back: "Yes, you can deploy this safely." Three hundred milliseconds later it continued: "— except in the us-east region, where the old connection pool is still draining." But the user had already read the first half, felt the relief of a green light, and clicked deploy. The qualification arrived to an empty room.

Nobody made a mistake here. The model was correct. The user read what was on screen. The renderer faithfully displayed every token the moment it arrived. And yet the outcome was a bad deploy, because streaming turned the model's intermediate state into something the user treated as final.

Structured Output Is Not Validated Output

· 9 min read
Tian Pan
Software Engineer

The day your team turns on schema-constrained decoding feels like a milestone. The parsing errors stop. The JSONDecodeError alerts go quiet. The flaky regex that scraped fields out of prose gets deleted. Someone says "the model returns valid JSON now" in standup, and the structured-output ticket gets closed.

That sentence is where the trouble starts. "The model returns valid JSON now" is the beginning of correctness work, not the end of it. JSON mode and constrained decoding guarantee the shape of a response — that quantity is an integer, that status is one of three enum values, that the object has the keys you asked for. They guarantee nothing about whether quantity is the right number, whether status reflects what actually happened, or whether the sku field points at a product that exists in your catalog.

Your System Prompt Grows After Every Incident — and Nobody Deletes a Line

· 8 min read
Tian Pan
Software Engineer

Open the system prompt of any agent that has been in production for a year. Scroll to the bottom. You will find a sediment layer of sentences that read like apologies: "Never invent order numbers." "Do not promise refunds you cannot confirm." "If the user is in Germany, do not mention the legacy plan." Each one is a fossil. Each one marks the exact moment something went wrong in production, someone got paged, and the fastest available fix was to add a sentence.

Nobody deletes those sentences. Not because they are still earning their place, but because deleting one means proving a negative — proving the model will not regress on a bug that may have been fixed three model versions ago. No one can prove that, so the line stays. The system prompt becomes an append-only log of past incidents, and it costs you tokens on every single call, forever.

This is the quietest form of technical debt in an AI system, because it does not look like debt. It looks like diligence.

The Agent That Remembers What You Took Back: Deletion as a First-Class Memory Operation

· 10 min read
Tian Pan
Software Engineer

In March, a user told your agent to stop recommending restaurants with outdoor seating — they had moved to an apartment with a baby and late nights were over. In September, the agent suggests a rooftop bar for their anniversary. The user is annoyed, and you are confused, because you watched the March correction land. It got written to memory. It is still there. The problem is that it is sitting next to the original preference, which is also still there, and retrieval surfaced the older one because it had a slightly better embedding match for "anniversary dinner."

This is the failure mode nobody designs for. Teams spend weeks on memory writes — extraction, summarization, embedding, namespacing — and treat deletes as a someday problem. Long-term memory makes adding a fact almost free, so facts accumulate. But a memory store is not a diary. A diary is allowed to contain things that used to be true. A memory store that an agent reads from to make decisions is not, because the agent cannot tell the difference between a fact and a fossil.