Skip to main content

567 posts tagged with "llm"

View all tags

The Alignment Tax: Measuring the Real Cost of Shipping Safe AI

· 9 min read
Tian Pan
Software Engineer

Teams building production AI systems tend to discover the alignment tax the same way: someone files a latency complaint, someone else traces it to the moderation pipeline, and suddenly a previously invisible cost line becomes very visible. By that point, the safety layers have been stacked — refusal classifier, output filter, toxicity scorer, human-in-the-loop queue — and nobody measured any of them individually. Unpicking them is painful, expensive, and politically fraught because now it looks like you're arguing against safety.

The better path is to treat safety overhead as a first-class engineering metric from day one. The alignment tax is real, it's measurable, and it compounds. A 150ms guardrail check sounds fine until you chain three of them together in an agentic workflow and wonder why your 95th-percentile latency is at four seconds.

API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic

· 9 min read
Tian Pan
Software Engineer

Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.

This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.

API Design for AI-Powered Endpoints: Versioning the Unpredictable

· 8 min read
Tian Pan
Software Engineer

Your /v1/summarize endpoint worked perfectly for eighteen months. Then you upgraded the underlying model. The output format didn't change. The JSON schema was identical. But your downstream consumers started filing bugs: the summaries were "too casual," the bullet points were "weirdly specific," the refusals on edge cases were "different." Nothing broke in the traditional sense. Everything broke in the AI sense.

This is the versioning problem that REST and GraphQL were never designed to solve. Traditional API contracts assume determinism: the same input always produces the same output. An AI endpoint's contract is probabilistic — it includes tone, reasoning style, output length distribution, and refusal thresholds, all of which can drift when you swap or update the underlying model. The techniques that work for database-backed APIs are necessary but not sufficient for AI-backed ones.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

· 10 min read
Tian Pan
Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"

The Confidence-Accuracy Inversion: Why LLMs Are Most Wrong Where They Sound Most Sure

· 9 min read
Tian Pan
Software Engineer

There is a pattern that keeps appearing in production AI deployments, and it runs directly counter to user intuition. When a model says "I'm not sure," users tend to double-check. When a model answers confidently, they tend to trust it. The problem is that frontier LLMs are systematically most confident in exactly the domains where they are most likely to be wrong.

This isn't a fringe failure mode. Models asked to generate 99% confidence intervals on estimation tasks only cover the truth approximately 65% of the time. Expected Calibration Error (ECE) values across major production models range from 0.108 to 0.726 — substantial miscalibration, and measurably worse in high-stakes vertical domains like medicine, law, and finance. The dangerous part isn't the inaccuracy itself; it's the inversion: the same models that show reasonable calibration on general knowledge tasks become confidently, systematically wrong on the tasks where being wrong has real consequences.

The Copyright Exposure in AI-Generated Content: A Risk Framework for Engineering Teams

· 10 min read
Tian Pan
Software Engineer

GPT-4 reproduced exact passages from books in 43% of test prompts when asked to continue a given excerpt. In one 2025 study, researchers extracted nearly an entire book near-verbatim from a production LLM — no jailbreaking required, just a persistent prefix-feeding loop. If your product generates content using a language model, the copyright exposure is not a future risk. It is happening in your users' sessions today, and you probably have no instrumentation to catch it.

This is not primarily a legal article. It's an engineering article about a legal problem that engineering decisions either create or contain. Lawyers will tell you what constitutes infringement. This framework tells you where your system leaks, how to measure it, and what actually reduces risk versus what only looks like it does.

Cultural Calibration for Global AI Products: Why Translation Is 10% of the Problem

· 9 min read
Tian Pan
Software Engineer

There is a quiet failure mode baked into almost every globally deployed AI product. An engineer localizes the UI strings, runs the model outputs through a translation API, has a native speaker spot-check a handful of responses, and ships. The product is technically multilingual. It is not culturally competent. Users in Tokyo, Riyadh, and Chengdu receive outputs that are grammatically correct and culturally wrong — responses that signal disrespect, confusion, or distrust in ways the team will never see in aggregate metrics.

The research is unambiguous: every major LLM tested reflects the worldview of English-speaking, Protestant European societies. Studies testing models against representative data from 107 countries found not a single model that aligned with how people in Africa, Latin America, or the Middle East build trust, show respect, or resolve conflict. Translation patches the surface. The underlying calibration remains Western.

Deadline Propagation in Agent Chains: What Happens to Your p95 SLO at Hop Three

· 10 min read
Tian Pan
Software Engineer

Most engineers building multi-step agent pipelines discover the same problem about two weeks after their first production incident: they set a 5-second timeout on their API gateway, their agent pipeline has four hops, and the system behaves as though there is no timeout at all. The agent at hop three doesn't know the upstream caller gave up three seconds ago. It keeps running, keeps calling tools, keeps generating tokens—and the user is already gone.

This isn't a configuration mistake. It's a structural problem. Latency constraints don't propagate across agent boundaries by default, and none of the major orchestration frameworks make deadline propagation easy. The result is a class of failures that looks like latency problems but is actually a context propagation problem.

The Deprecated API Trap: Why AI Coding Agents Break on Library Updates

· 10 min read
Tian Pan
Software Engineer

Your AI coding agent just generated a pull request. The code looks right. It compiles. Tests pass. You merge it. Two days later, your CI pipeline in staging starts throwing AttributeError: module 'openai' has no attribute 'ChatCompletion'. The agent used an API pattern that was deprecated a year ago and removed in the latest major version.

This is the deprecated API trap, and it bites teams far more often than the conference talks about AI code quality suggest. An empirical study evaluating seven frontier LLMs across 145 API mappings found that most models exhibit API Usage Plausibility (AUP) below 30% across popular Python libraries. When explicitly given deprecated context, all tested models demonstrated 70–90% deprecated usage rates. The problem is structural, not a quirk of a particular model or library.

Document AI in Production: Why PDF Demos Lie and Production Pipelines Don't

· 11 min read
Tian Pan
Software Engineer

A clean PDF, a capable LLM, and thirty lines of code. The demo works. You extract the invoice total, the contract dates, the patient diagnosis. Stakeholders are impressed. Then you push to production, and within a week the pipeline is silently returning wrong data on 15% of documents — and nobody knows.

This is the document AI trap. The failure mode isn't a crash or an exception; it's a pipeline that reports success while producing garbage. Building production document extraction is a fundamentally different problem from building a demo, and most teams don't realize this until they've already shipped.

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.