639 posts tagged with "llm"

Streaming Structured Output: Why Your Parser Hangs on Token 47

May 9, 2026 · 11 min read

Software Engineer

The first time a team builds a streaming AI feature with structured output, the bug is always the same. The model is generating fine. The chunks are arriving fine. But somewhere around token 47, the parser hangs, the UI freezes, or — worse — a half-formed enum value gets routed to a downstream tool that quietly does the wrong thing. The team adds a try/catch around JSON.parse, considers themselves done, and ships. Two weeks later, a sibling team complains that the streaming UI feels janky after the response gets long. A quarter later, an incident review asks why a "Delete" tool call fired on a record that the model was still describing as "DeleteIfEmpty."

The bug is not in any single token. The bug is that token-streaming and structured output are architecturally at odds, and most frameworks paper over the conflict with prayer. A schema says "this is a complete object." A token stream says "here are the bytes one at a time." Every intermediate state between those two endpoints is, by definition, invalid against the schema. The team's job is to decide what to do during those intermediate states — and most teams have not made that decision explicitly.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.

The Attack Vector You Ship With Every Open RAG System

May 8, 2026 · 9 min read

Tian Pan

Software Engineer

Five carefully crafted documents. A corpus of 2.6 million. A 97% success rate at manipulating specific AI responses. That's the benchmark result from PoisonedRAG, presented at USENIX Security 2025 — and the attack didn't require model access, prompt injection at inference time, or any direct interaction with the system at all. The attacker simply contributed content to the knowledge base.

If your RAG system lets users add content — helpdesk tickets, wiki edits, customer feedback, shared notes — you've already shipped the attack vector. The question is whether you've also shipped the defenses.

Fine-Tune Orphan: Recovering Domain Expertise When the Base Model Is Deprecated

May 8, 2026 · 9 min read

Tian Pan

Software Engineer

On January 4, 2024, OpenAI retired the /fine-tunes endpoint. Every fine-tuned Ada, Babbage, Curie, and Davinci model stopped responding. Teams that had spent months building production systems on these models — careful prompt design, annotated datasets, labeling pipelines — woke up to HTTP 404s. The fine-tunes didn't migrate. The learned behaviors didn't transfer. The domain expertise was gone.

This wasn't a fringe edge case. Google followed in August 2024 by completely decommissioning the PaLM API, with zero backwards-compatible grace period. Unlike OpenAI, which at least let existing GPT-3.5 fine-tunes keep running while blocking new training runs, Google's shutdown meant production inference stopped the same day. If your fine-tuned PaLM model was in the critical path, you had a service outage.

Statistical Watermarking for LLM Output: How Token Logit Bias Creates Detectable Signatures

May 8, 2026 · 9 min read

Tian Pan

Software Engineer

Google has been watermarking Gemini output for every user since October 2024 — 20 million users, no perceptible quality degradation, algorithmically detectable. OpenAI has a working prototype that requires only a few hundred tokens to produce a reliable signal. Anthropic says it's on the roadmap. The EU AI Act's Article 50 mandates machine-readable marking of AI-generated content for covered providers. And yet: a $0.88-per-million-token attack achieves ~100% evasion success against seven recent watermarking schemes simultaneously.

This is the actual state of LLM text watermarking. The gap between what's deployed, what the papers claim, and what adversaries can do is wider than most teams realize — and the engineering decisions you make about watermarking depend heavily on which side of that gap you're standing on.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The AI Efficiency Paradox: When Your Best Feature Kills Your Revenue

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

In early 2026, Atlassian reported something that hadn't happened in the company's history: a decline in enterprise seat counts. For a company whose entire growth model rests on expansion revenue — selling more seats as customer organizations grow — this was a structural alarm, not a blip. The proximate cause wasn't churn or product failure. It was that Atlassian's own AI features had made teams so much more productive that fewer seats were needed to do the same amount of work.

This is the AI efficiency paradox: build a feature that genuinely saves users time, and you may be training them to need less of your product. The more useful your AI, the faster your pricing model breaks.

Story Points Don't Survive First Contact With an LLM

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

Here is a failure mode that happens quietly, at every company with a mature Agile practice that decides to add an LLM feature: the team estimates the work in story points, assigns it to a two-week sprint, and then spends three sprints in a row reporting "70% done" while the engineering manager stares at a burndown chart that refuses to burn down. Nobody lied. The feature is genuinely hard to finish — because the conditions that make story points a useful planning tool don't exist for AI features, and nobody noticed until they were already committed.

The problem is not that engineers are bad at estimating. The problem is that story points encode assumptions about the nature of software work — assumptions that LLM features violate structurally, not accidentally.

AI Feature Dependency Graphs: Resilience Engineering When Your Services Share a Model

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your embedding model goes down at 3 PM on a Tuesday. Within thirty seconds, your support chat stops answering questions, your personalized recommendation engine starts returning empty results, your document search returns nothing, and your onboarding assistant stops working. Your on-call engineer opens the incident channel and sees fifteen simultaneous alerts from features that have no visible relationship to each other. There is no stack trace pointing to the root cause. It looks like a distributed systems outage — but it isn't. It's a single shared dependency failing, and you didn't know fifteen features shared it.

This is the AI feature dependency problem: the infrastructure layer underneath your product features is deeply interconnected, but your architecture diagrams show each feature as an isolated box. When the coupling is invisible, failure propagation is invisible too — until it isn't.

AI Output Volatility Is a Business Risk You're Probably Underpricing

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

When companies talk about AI risk, the conversation usually gravitates toward the obvious failures: hallucinated facts, biased outputs, legal liability from generated content. What gets far less attention is a quieter structural problem: you've made commercial commitments — pricing tiers, SLAs, customer-facing accuracy claims — on top of a system whose outputs are inherently probabilistic. Every time the model generates a response, it's sampling from a distribution. The contract doesn't mention distributions.

This is a business risk that most teams discover late, when a customer complains that the same document review workflow gave completely different results on Monday and Friday. Or when a regulator asks for reproducibility guarantees that the system architecturally cannot provide.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

The Context Format Decision Most Teams Make Accidentally: JSON vs Markdown vs Plain Text

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams pick a context format once, early in development, and never revisit it. A developer reaches for JSON because it looks structured and machine-readable. Another grabs markdown because it's what they use in README files. Plain text gets chosen when nothing else seems necessary. These are not engineering decisions — they're habits. And they silently shape how your model reasons.

The format you pass to an LLM is not inert packaging. It is an instruction. Structured JSON context primes the model toward schema-following behavior. Markdown encourages hierarchical synthesis. Plain text opens up more flexible inference. Getting this wrong by even one format category can degrade accuracy by 40% or more — and you won't see the error in your logs.

About Tian Pan