Skip to main content

567 posts tagged with "llm"

View all tags

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.

Structured Output Is Not Structured Thinking: The Semantic Validation Layer Most Teams Skip

· 11 min read
Tian Pan
Software Engineer

A medical scheduling system receives a valid JSON object from its LLM extraction layer. The schema passes. The types check out. The required fields are present. Then a downstream job tries to book an appointment and finds that the end_time is three hours before the start_time. Both fields are correctly formatted ISO timestamps. Neither violates the schema. The booking silently fails, and the patient gets no appointment — no error surfaced, no alert fired.

This is what it looks like when schema validation is mistaken for correctness validation. The model followed the format. It did not follow the logic.

What Structured Outputs Actually Cost You: The JSON Mode Quality Tax

· 9 min read
Tian Pan
Software Engineer

Most teams adopt structured outputs because they're tired of writing brittle regex to extract data from model responses. That's a reasonable motivation. What they don't anticipate is discovering months later, when they finally measure task accuracy, that their "reliability improvement" also degraded the quality of the underlying content by 10 to 15 percent on reasoning-heavy tasks. The syntactic problem was solved. A semantic one was introduced.

This post is about understanding that tradeoff precisely — what constrained decoding actually costs, when the tax is worth paying, and how to build the evals that tell you whether it's hurting your system before you ship.

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

The Quality Tax of Over-Specified System Prompts

· 9 min read
Tian Pan
Software Engineer

Most engineering teams discover the same thing on their first billing spike: their system prompt has quietly grown to 4,000 tokens of carefully reasoned instructions, and the model has quietly started ignoring half of them. The fix is rarely to add more instructions. It's almost always to delete them.

The instinct to be exhaustive is understandable. More constraints feel like more control. But there's a measurable quality degradation that kicks in as system prompts bloat — and it compounds with cost in ways that aren't visible until they hurt. Research consistently finds accuracy drops at around 3,000 tokens of input, well before hitting any nominal context limit. The model doesn't refuse to comply; it just starts underperforming in ways that are hard to pin down.

This post is about making that degradation visible, understanding why it happens, and building a trimming discipline that doesn't require hoping nothing breaks.

Your RAG Knows the Docs. It Doesn't Know What Your Engineers Know.

· 10 min read
Tian Pan
Software Engineer

Your enterprise just deployed a RAG system. You indexed every Confluence page, every runbook, every architecture doc. Six months later, a senior engineer leaves — the one who knows why the payment service has that unusual retry pattern, why you never scale the cache past 80%, and which vendor never to call on Fridays. That knowledge was never written down. Your RAG system has no idea it existed.

This is the tacit knowledge problem, and it's why most enterprise AI systems underperform not because of retrieval quality or hallucination, but because the knowledge they need was never captured in the first place. Sixty percent of employees report that it's difficult or nearly impossible to get crucial information from colleagues. Ninety percent of organizations say departing employees cause serious knowledge loss. The documents your RAG can index are only the tip.

Temperature Is a Product Decision, Not a Model Knob

· 9 min read
Tian Pan
Software Engineer

When a new LLM feature ships, someone eventually asks: "what temperature should we use?" The answer is almost always the same: "I don't know, let's leave it at 0.7." Then the conversation moves on and nobody touches it again.

That's a product decision made by default. Temperature doesn't just control how "random" the model sounds — it shapes whether users trust outputs, whether they re-run queries, whether they feel helped or overwhelmed. Getting it right matters more than most teams realize, and getting it wrong in the wrong direction is hard to diagnose because the failure mode looks like bad model behavior rather than bad configuration.

Text-to-SQL at Scale: What Nobody Tells You Before Production

· 11 min read
Tian Pan
Software Engineer

Text-to-SQL demos are deceptively easy to build. You paste a schema into a prompt, ask GPT-4 a question, get back a clean SELECT statement, and suddenly your Slack is full of "what if we built this into our data platform?" messages. Then you try to actually ship it. The benchmark says 85% accuracy. Your internal data team reports that about half the answers are wrong. Your security team asks who reviewed the generated queries before they hit production. Nobody has a good answer.

This is the gap between text-to-SQL as a research problem and text-to-SQL as an engineering problem. The research problem is about getting models to produce syntactically valid SQL. The engineering problem is about schema ambiguity, access control, query validation, and the fact that your enterprise database looks nothing like Spider or BIRD.

The Transcript Layer Lie: Why Your Multimodal Pipeline Hallucinates Downstream

· 9 min read
Tian Pan
Software Engineer

Your ASR system returned "the patient takes metaformin twice daily." The correct word was metformin. The transcript looked clean — no [INAUDIBLE] markers, no error flags. Confidence was 0.73 on that word. Your pipeline discarded that number and handed clean text to the LLM. The LLM, treating it as ground truth, reasoned about a medication that doesn't exist.

This is the transcript layer lie: the implicit assumption that intermediate text representations — whether produced by speech recognition, OCR, or vision models parsing a document — are reliable enough to pass downstream without qualification. They aren't. But almost every production pipeline treats them as if they are.

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

AI-Native API Design: Why REST Breaks When Your Backend Thinks Probabilistically

· 11 min read
Tian Pan
Software Engineer

Most backend engineers can recite the REST contract from memory: client sends a request, server processes it, server returns a status code and body. A 200 means success. A 4xx means the client did something wrong. A 5xx means the server broke. The response is deterministic, the timeout is predictable, and idempotency keys guarantee safe retries.

LLM backends violate every one of those assumptions. A 200 OK can mean your model hallucinated the entire response. A successful request can take twelve minutes instead of twelve milliseconds. Two identical requests with identical parameters will return different results. And if your server times out mid-inference, you have no idea whether the model finished or not.

Teams that bolt LLMs onto conventional REST APIs end up with a graveyard of hacks: timeouts that kill live agent tasks, clients that treat hallucinated 200s as success, retry logic that charges a user's credit card three times because idempotency keys weren't designed for probabilistic operations. This post walks through where the mismatch bites hardest and what the interface patterns that actually hold up in production look like.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.