Skip to main content

720 posts tagged with "llm"

View all tags

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

· 10 min read
Tian Pan
Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.

The Knowledge Contamination Problem: When Your RAG System Ignores Its Own Retrieval

· 8 min read
Tian Pan
Software Engineer

A team ships a RAG pipeline for internal documentation. Retrieval looks solid — the right passages come back. But in production, users keep getting stale answers. They dig into the logs and find the model is returning facts from its training data, not from the documents it was handed. The retrieval worked. The model just didn't use it.

This is the knowledge contamination problem: the model's parametric memory — the knowledge baked into its weights during training — overrides the retrieved context. It's quiet, it's confident, and it's one of the most common failure modes in production RAG systems.

Knowledge Cutoff Is a Silent Production Bug

· 11 min read
Tian Pan
Software Engineer

Most production AI failures are loud. The model returns a 5xx. The schema validation throws. The eval suite catches the regression before it ships. But there is a category of failure that is completely silent — no error, no exception, no alert fires — because the system is working exactly as designed. It is just working with a snapshot of reality from 18 months ago.

Your LLM has a knowledge cutoff. That cutoff is not a documentation footnote. It is a slowly widening gap between what your model believes to be true and what is actually true, and it compounds every day you keep the same model in production. Teams celebrate launch, then watch user trust quietly erode over the next six months as the world moves and the model stays still.

Live Web Grounding in Production: Why Calling a Search API Is Only the Beginning

· 10 min read
Tian Pan
Software Engineer

Most engineers discover the limits of live web grounding the same way: they wire up a search API in an afternoon, ship it to production, and spend the next three weeks explaining why the latency is six seconds, the answers are wrong about recent events, and users are occasionally getting directed to fake phone numbers.

The underlying assumption — that search-augmented LLMs are just "regular RAG but with fresh data" — is the source of most of the pain. Live web grounding shares almost nothing with static retrieval beyond the word "retrieval." It is a distributed systems problem wearing an NLP hat.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

· 10 min read
Tian Pan
Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

When LLMs Beat Rule-Based Systems for Data Normalization (And When They Don't)

· 11 min read
Tian Pan
Software Engineer

A team I know spent three months building a rule-based address normalizer. It handled the top twenty formats, used a USPS API for verification, and worked great on the data they'd seen. Then they got a new enterprise customer. The first week of data had addresses embedded in freeform notes fields, postal codes missing country prefixes, and cross-border formats their rules had never seen. The normalizer failed silently on 31% of records. They threw an LLM at it as a quick fix, expecting 80% accuracy. They got 94%. The surprise wasn't that the LLM worked — it was that nothing in their evaluation framework had predicted this.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=When%20LLMs%20Beat%20Rule-Based%20Systems%20for%20Data%20Normalization%20(And%20When%20They%20Don't%29)

This is the shape of the problem. Rule-based normalization is predictable, fast, and cheap. It works well when the data distribution stays in-bounds. LLMs handle the long tail — the weird formats, the implicit domain knowledge, the edge cases that rules never enumerate. But LLMs are also expensive, slow, and inconsistent in ways that break production pipelines if you're not careful. The right answer, for almost every team, is a hybrid that uses each approach on the inputs it's actually good at.

Why LLMs Make Confident Mistakes When Analyzing Your Product Data

· 11 min read
Tian Pan
Software Engineer

Product teams have started routing analytical questions directly to LLMs: "What's causing the churn spike?" "Why did conversion drop after the redesign?" "Which cohort should we focus retention spend on?" The outputs land in executive decks, drive roadmap decisions, and get presented to investors. The models answer confidently, in polished prose, with specific numbers. And a significant fraction of those answers are wrong in ways that don't announce themselves.

This isn't a general criticism of LLMs for data work. There are tasks where they genuinely help. The problem is that the failure modes are invisible — the model doesn't hedge, doesn't caveat, and doesn't distinguish between "I computed this from your data" and "I generated something that sounds like what this number should be." Practitioners who understand where the breakdowns happen can capture the genuine value and route around the landmines.

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

LLM Rate Limits Are a Distributed Systems Problem

· 11 min read
Tian Pan
Software Engineer

Your AI product has two surfaces: a user-facing chat feature and a background report generation job. Both call the same LLM API under the same key. One afternoon, a support ticket arrives: "Chat responses are getting cut off halfway." No alerts fired. No 429s in the logs. The API was returning HTTP 200 the entire time.

What happened: the report generation job gradually consumed most of your shared token quota. Chat requests started completing, but only up to your max_tokens limit — semantically truncated, syntactically valid, silently wrong. Your standard monitoring never noticed because there was nothing to notice at the HTTP layer.

This is not an edge case. It is what happens when engineers treat LLM rate limits as a simple throttle problem instead of recognizing the class of distributed systems failure they actually are.

The Hidden Switching Costs of LLM Vendor Lock-In

· 11 min read
Tian Pan
Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

· 10 min read
Tian Pan
Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.