720 posts tagged with "llm"

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

April 16, 2026 · 10 min read

Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

The Implicit API Contract: What Your LLM Provider Doesn't Document

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

The Intent Classification Layer Most Agent Routers Skip

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Keeping Synthetic Eval Data Honest

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

A safety model scored 85.3% accuracy on its public benchmark test set. When researchers tested it on novel adversarial prompts not derived from public datasets, that number dropped to 33.8%. The model hadn't learned to reason about safety. It had learned to recognize the evaluation distribution.

This is the problem at the center of synthetic eval data: when the same model family generates both your training data and your test cases, passing the eval means conforming to a shared statistical prior—not demonstrating actual capability. It's a feedback loop that looks like quality assurance until production traffic arrives and the numbers don't hold.

The failure is structural, not incidental. And fixing it requires more than adding more synthetic examples.

Knowledge Graphs as a RAG Alternative: When Structured Retrieval Beats Embeddings

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

Most RAG implementations fail in exactly the same way: the vector search retrieves something plausible but not what the user actually needed, the LLM wraps it in confident prose, and the user gets an answer that's approximately right but specifically wrong. The frustrating part is that the failure mode is invisible — cosine similarity scores look fine, the retrieved passages mention the right topics, but the answer is still wrong because the question required reasoning across relationships, not just semantic proximity.

Vector embeddings are excellent at one thing: finding text that sounds like your query. That's a powerful capability, and it covers an enormous range of production use cases. But it breaks predictably when the question depends on how entities connect to each other rather than how closely their descriptions match. For those queries, a knowledge graph — a property graph you traverse with Cypher or SPARQL — is not an optimization. It's a fundamentally different kind of retrieval that solves a different class of problem.

LLM Confidence Calibration in Production: Measuring and Fixing the Overconfidence Problem

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Your model says "I'm highly confident" and is wrong 40% of the time. That's not a hallucination — that's a calibration failure, and it's a harder problem to detect, measure, and fix in production.

Hallucination gets all the press. But overconfident wrong answers are often more dangerous: the model produces a plausible, fluent response with high expressed confidence, and there is no signal to the downstream consumer that anything is wrong. Hallucination detectors, RAG grounding checks, and fact-verification pipelines all help with fabricated content. They do almost nothing for the scenario where the model knows a fact but has systematically miscalibrated beliefs about how certain it is.

Most teams shipping LLM-powered features treat confidence as an afterthought. This post covers why calibration fails, how to measure it, and the production patterns that actually move the metric.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

LLMs in the Security Operations Center: Acceleration Without Liability

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

A senior analyst I respect described her team's first six months with an LLM-powered triage agent like this: "It made the easy alerts disappear, and made the hard ones harder to trust." The phrase has stayed with me because it captures the actual shape of the trade. AI in the security operations center is not a productivity story. It is a confidence calibration story, and most teams are getting the calibration wrong in the same direction.

The seductive version goes: drop a model in front of the alert queue, let it cluster duplicates, summarize raw events, and auto-close obvious noise. The MTTR graph drops. The pager quiets. The Tier-1 backlog evaporates. The version that actually gets you breached goes: the model confidently mis-attributes a real intrusion as a benign backup job, and a tired analyst — told that "the AI already triaged this, it's clean" — never opens the case. The first version is real. So is the second. They are the same system viewed at different confidence levels.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

Your AI Feature Should Lose to a Regex First

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

A team spends three weeks integrating a foundation model to classify incoming support tickets into routing categories. The model reaches 87% accuracy in testing. They ship it. Six months later, an engineer notices that 70% of tickets contain a product name in the subject line and that a simple lookup table would have handled those with 99% accuracy. The LLM is running on the hard 30% and making it up the rest of the time.

This is not an unusual story. It happens because teams treat "use an LLM" as the first implementation choice rather than the last. The fix is a required gate: your AI feature must lose to a dumb rule before you are allowed to build the AI version.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Model Routing Is a System Design Problem, Not a Config Option

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

About Tian Pan