Skip to main content

484 posts tagged with "llm"

View all tags

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

· 9 min read
Tian Pan
Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

· 9 min read
Tian Pan
Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

· 10 min read
Tian Pan
Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

LLM-Powered Data Migrations: What Actually Works at Scale

· 10 min read
Tian Pan
Software Engineer

The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.

That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Model Portability Tax: How to Architect AI Systems You Can Actually Migrate

· 9 min read
Tian Pan
Software Engineer

You inherited an AI feature built on GPT-4-turbo. The model is being deprecated. Your manager wants to cut costs by switching to a newer, cheaper model. You run a quick test, metrics look passable, you ship it — and a week later, accuracy on your core use case drops 22%. Support tickets climb. You're now in a crisis migration rather than a planned one.

This is the model portability tax: the hidden engineering cost that accumulates every time you couple your application logic tightly to a specific foundation model. Every team pays it. Most don't realize how large the bill has gotten until the invoice arrives.

The Multilingual Quality Cliff: Why Your LLM Works Great in English and Quietly Fails Everyone Else

· 10 min read
Tian Pan
Software Engineer

Your LLM passes every eval you throw at it. Latency is solid, accuracy looks fine, and the team ships with confidence. Then a user in Cairo files a bug: the structured extraction returns malformed JSON. A developer in Seoul notices the assistant ignores complex instructions after a few turns. A product manager in Mumbai realizes the chatbot's summarization is just wrong—subtly, consistently, wrong.

None of this showed up in your benchmarks because your benchmarks are in English.

This is the multilingual quality cliff: a performance drop that is steep, systematic, and almost universally invisible to teams that ship AI products. The gap isn't marginal. In long multi-turn conversations, Arabic and Korean users see accuracy around 40.8% on tasks where English users are at 54.8%—a 14-point gap that compounds with every additional turn. For structured editing tasks, that same gap widens to catastrophic: 32–37% accuracy versus acceptable English performance. The users feel this. Your dashboards don't.

The ORM Impedance Mismatch for AI Agents: Why Your Data Layer Is the Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks tuning prompts and evals, benchmarking model choices, and tweaking temperature — while their actual bottleneck sits one layer below: the data access layer that was designed for human developers, not agents.

The mismatch isn't subtle. ORMs like Hibernate, SQLAlchemy, and Prisma, combined with REST APIs that return paginated, single-entity responses, produce data access patterns exactly wrong for autonomous AI agents. The result is token waste, rate limit failures, cascading N+1 database queries, and agents that hallucinate simply because they can't afford to load the context they need.

This post is about the structural problem — and what an agent-optimized data layer actually looks like.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.

Privacy-Preserving Inference in Practice: The Spectrum Between Cloud APIs and On-Prem

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM privacy as a binary: either you send data to the cloud and accept the risk, or you run everything on-prem and accept the cost. Both framings are wrong. In practice, there is a spectrum of approaches with very different risk profiles and engineering budgets — and most teams are operating at the wrong point on that spectrum without realizing it.

Researchers recently demonstrated they could extract authentic PII from 3,912 individuals at a cost of $0.012 per record with a 48.9% success rate. That statistic tends to get dismissed as academic threat modeling until a security audit or compliance review lands on your desk. The question isn't whether to care about LLM privacy; it's which controls actually move the needle and how much each one costs to implement.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.