Skip to main content

44 posts tagged with "engineering"

View all tags

Sampling Parameters in Production: The Tuning Decisions Nobody Explains

· 11 min read
Tian Pan
Software Engineer

Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.

This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.

The Accessibility Gap in AI Interfaces Nobody Is Shipping Around

· 8 min read
Tian Pan
Software Engineer

Most AI teams run accessibility audits on their landing pages. Almost none run them on the chat interface itself. The gap isn't laziness — it's that the tools don't exist. WCAG 2.2 has no success criterion for streaming content, no standard for non-deterministic outputs, and no guidance for token-by-token delivery. Which means every AI product streaming responses into a <div> right now is operating in a compliance grey zone while breaking the experience for a significant portion of its users.

This isn't a minor edge case. Blind and low-vision users report information-seeking as their top AI use case. Users with dyslexia, ADHD, and cognitive disabilities are actively trying to use AI tools to reduce reading load — and the default implementation pattern actively makes things worse for them.

AI Code Review at Scale: When Your Bot Creates More Work Than It Saves

· 10 min read
Tian Pan
Software Engineer

Most teams that adopt an AI code reviewer go through the same arc: initial excitement, a burst of flagged issues that feel useful, then a slow drift toward ignoring the bot entirely. Within a few months, engineers have developed a muscle memory for dismissing AI comments without reading them. The tool still runs. The comments still appear. Nobody acts on them anymore.

This is not a tooling problem. It is a measurement problem. Teams deploy AI code review without ever defining what "net positive" looks like — and without that baseline, alert fatigue wins.

API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic

· 9 min read
Tian Pan
Software Engineer

Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.

This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.

Pricing AI Features: The Unit Economics Framework Engineering Teams Always Skip

· 11 min read
Tian Pan
Software Engineer

Cursor hit 1billioninrevenuein2025andlost1 billion in revenue in 2025 and lost 150 million doing it. Every dollar customers paid went straight to LLM API providers, with nothing left for engineering, support, or infrastructure overhead. This wasn't a scaling problem—it was a unit economics problem that was invisible until it was catastrophic.

Most engineering teams building AI features make the same mistake: they treat inference cost as a minor line item, ship a flat-rate subscription, and assume the economics will work out later. They don't. Variable inference costs don't behave like any other COGS in software, and the pricing architectures that work for traditional SaaS will bleed you dry the moment your heaviest users find your most expensive feature.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

Your AI Feature Should Lose to a Regex First

· 9 min read
Tian Pan
Software Engineer

A team spends three weeks integrating a foundation model to classify incoming support tickets into routing categories. The model reaches 87% accuracy in testing. They ship it. Six months later, an engineer notices that 70% of tickets contain a product name in the subject line and that a simple lookup table would have handled those with 99% accuracy. The LLM is running on the hard 30% and making it up the rest of the time.

This is not an unusual story. It happens because teams treat "use an LLM" as the first implementation choice rather than the last. The fix is a required gate: your AI feature must lose to a dumb rule before you are allowed to build the AI version.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.

The AI Adoption Paradox: Why the Highest-Value Domains Get AI Last

· 8 min read
Tian Pan
Software Engineer

The teams that stand to gain the most from AI are often the last ones deploying it. A healthcare organization that could use AI to catch medication errors in real time sits at 39% AI adoption, while a software company running AI-powered code review ships at 92%. The ROI differential is not even close — yet the adoption rates are inverted. This is the AI adoption paradox, and it's not an accident.

The instinct is to explain this gap as risk aversion, regulatory fear, or bureaucratic inertia. Those factors exist. But the deeper cause is structural: the accuracy threshold required to unlock value in high-stakes domains is fundamentally higher than what justifies autonomous deployment, and most teams haven't built the architecture to bridge that gap.

The AI Code Review Trap: Why Faster Reviews Are Making Your Codebase Worse

· 10 min read
Tian Pan
Software Engineer

Your team ships more code than ever. PR velocity is up, cycle time is down, and the backlog is shrinking. On every dashboard that a manager looks at, things look great. Meanwhile, your incident count per PR is quietly climbing 23.5% year over year.

This is the AI code review paradox. AI tools make engineers faster at writing code and faster at approving it — but the defects that matter most are slipping through at a higher rate than before. The two sides of this paradox compound each other, and most teams are not measuring the right things to notice it.

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

· 12 min read
Tian Pan
Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.