Skip to main content

702 posts tagged with "llm"

View all tags

Foundation Model Vendor Strategy: What Enterprise SLAs Actually Guarantee

· 12 min read
Tian Pan
Software Engineer

Enterprise teams pick LLM vendors based on benchmarks and demos. Then they hit production and discover what the SLA actually says — which is usually much less than they assumed. The 99.9% uptime guarantee you negotiated doesn't cover latency. The data processing agreement your legal team signed doesn't prohibit training on your inputs unless you explicitly added that clause. And the vendor concentration risk that nobody quantified becomes painfully obvious when your core product is down for four hours because a telemetry deployment cascaded through a Kubernetes control plane.

This is not a procurement problem. It's an engineering problem that procurement can't solve alone. The people who build AI systems need to understand what these contracts actually say — and what they don't.

The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks

· 10 min read
Tian Pan
Software Engineer

In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. The AI research community celebrated. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough.

This is the evaluation paradox. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision.

Hallucination Is Not a Root Cause: A Debugging Methodology for AI in Production

· 10 min read
Tian Pan
Software Engineer

When a lawyer cited non-existent court cases in a federal filing, the incident was widely reported as "ChatGPT hallucinated." When a consulting firm's government report contained phantom footnotes, the postmortem read "AI fabricated citations." When a healthcare transcription tool inserted violent language into medical notes, the explanation was simply "the model hallucinated." In each case, an expensive failure got a three-word root cause that made remediation impossible.

"The model hallucinated" is the AI equivalent of writing "unknown error" in a stack trace. It describes what happened without telling you why it happened or how to fix it. Every hallucination has a diagnosable cause — usually one of four categories — and each category demands a different engineering response. Teams that understand this distinction ship AI systems that degrade gracefully. Teams that don't keep playing whack-a-mole with prompts.

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

· 9 min read
Tian Pan
Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

· 11 min read
Tian Pan
Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

Invisible Model Drift: How Silent Provider Updates Break Production AI

· 10 min read
Tian Pan
Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths

· 10 min read
Tian Pan
Software Engineer

A 770-million-parameter model beating a 540-billion-parameter model at its own task sounds impossible. But that is exactly what distilled T5 models achieved against few-shot PaLM—using only 80% of the training examples, a 700x size reduction, and inference that costs a fraction of a cent per call instead of dollars. The trick wasn't a better architecture or a cleverer training recipe. It was generating labeled data from the big model and training the small one on it.

This is knowledge distillation. And you do not need to fine-tune the teacher to make it work.

The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem

· 10 min read
Tian Pan
Software Engineer

There is a pattern that plays out on almost every AI project that runs long enough. The team builds a prototype, the demo looks good, but in production the outputs aren't consistent enough. Someone suggests switching to the latest frontier model — GPT-4o instead of GPT-3.5, Claude Opus instead of Sonnet, Gemini Ultra instead of Pro. Sometimes it helps. Eventually it stops helping. The team finds themselves paying 5–10x more per inference, latency has doubled, and the task accuracy is still 78% instead of the 90% they need.

This is the latent capability ceiling: the point at which the raw scale of the language model you're using is no longer the limiting factor. It's a real phenomenon backed by empirical data, and most teams hit it without recognizing it — because the reflex to "use a bigger model" is cheap, fast, and often works early in a project.

The Idempotency Crisis: LLM Agents as Event Stream Consumers

· 11 min read
Tian Pan
Software Engineer

Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.

LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

Most conversations about LLMs in production orbit around chat interfaces, copilots, and autonomous agents. But if you audit where enterprise LLM tokens are actually being consumed, a different picture emerges: a quiet majority of usage is happening inside batch data pipelines — extracting fields from documents, classifying support tickets, normalizing messy vendor records, enriching raw events with semantic labels. Nobody is writing conference talks about this tier. Nobody is benchmarking it seriously either. And that silence is costing teams real money and real accuracy.

This is the ETL tier that practitioners build first, justify last, and monitor least. It is also, for most organizations, the layer where LLM spend has the highest leverage — and the highest potential for invisible failure.

LLM Vendor Lock-In Is a Spectrum, Not a Binary

· 10 min read
Tian Pan
Software Engineer

A team builds a production feature on GPT-4. Months later, they decide to evaluate Claude for cost reasons. They spend two weeks "migrating"—but the core API swap takes an afternoon. The remaining ten days go toward fixing broken system prompts, re-testing refusal edge cases, debugging JSON parsers that choke on unexpected prose, and re-tuning tool-calling schemas that behave differently across providers. Migration estimates that assumed a simple connector swap balloon into a multi-layer rebuild.

This is the LLM vendor lock-in problem in practice. And the teams that get burned aren't the ones who chose the wrong provider—they're the ones who didn't recognize that lock-in exists on multiple axes, each with a different risk profile.

Long-Session Context Degradation: How Multi-Turn Conversations Go Stale

· 8 min read
Tian Pan
Software Engineer

The first time a user's 80-turn support conversation suddenly started contradicting advice given 60 turns ago, the team blamed a bug. There was no bug. The model was simply lost. Across all major frontier models, multi-turn conversations show an average 39% performance drop compared to single-turn interactions on the same tasks. Most teams never measure this. They assume context windows are roughly as powerful as their token limit suggests, and they build products accordingly.

That assumption is quietly wrong. Long sessions don't just get slower or more expensive — they get unreliable in ways that are nearly impossible to notice until users are already frustrated.