Skip to main content

567 posts tagged with "llm"

View all tags

Your AI Feature Should Lose to a Regex First

· 9 min read
Tian Pan
Software Engineer

A team spends three weeks integrating a foundation model to classify incoming support tickets into routing categories. The model reaches 87% accuracy in testing. They ship it. Six months later, an engineer notices that 70% of tickets contain a product name in the subject line and that a simple lookup table would have handled those with 99% accuracy. The LLM is running on the hard 30% and making it up the rest of the time.

This is not an unusual story. It happens because teams treat "use an LLM" as the first implementation choice rather than the last. The fix is a required gate: your AI feature must lose to a dumb rule before you are allowed to build the AI version.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

Multi-Model Consistency: When Your Pipeline's Sequential LLM Calls Contradict Each Other

· 9 min read
Tian Pan
Software Engineer

Your summarization step decides a customer complaint is about billing. Your extraction step pulls "subscription tier: Pro." Your generation step writes a follow-up email referencing their "Enterprise plan." Three LLM calls, one pipeline, one completely broken output — and no error was raised anywhere along the way.

This is multi-model consistency failure: the silent killer of compound AI systems. It doesn't look like an exception. It doesn't trigger your error rate SLO. It just ships confidently wrong content to users.

Multimodal Pipelines in Production: What Breaks When You Go Beyond Text

· 11 min read
Tian Pan
Software Engineer

Most LLM engineering wisdom — caching prompts, tuning temperature, budgeting tokens — assumes text goes in and text comes out. Add an image, a PDF, or an audio clip and almost none of that wisdom transfers. The preprocessing is different. The failure modes are different. The cost model is different. And the eval suite you built for your text pipeline won't catch the new things that break.

About 50% of enterprise knowledge lives in non-text formats: PDFs, slides, scanned forms, product images. Teams that reach that data discover that going multimodal isn't just adding a modality — it's adding an entirely new engineering surface.

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

PII in the Prompt Layer: The Privacy Engineering Gap Most Teams Ignore

· 12 min read
Tian Pan
Software Engineer

Your organization has a privacy policy. It says something reasonable about user data being handled carefully, retention limits, and compliance with GDPR and HIPAA. What it almost certainly does not say is whether the text of that user's name, email address, or medical history was transmitted verbatim to a hosted LLM API before any policy control was applied.

That gap — between the privacy policy you can point to and the privacy guarantee you can actually prove — is where most production LLM systems are silently failing. Research shows roughly 8.5% of prompts submitted to tools like ChatGPT and Copilot contain sensitive information, including PII, credentials, and internal file references. In enterprise environments where users paste emails, customer data, and support tickets into AI-assisted workflows, that number almost certainly runs higher.

The problem is not that developers are careless. It is that the LLM prompt layer was never designed as a data processing boundary. It inherits content from upstream systems — user input, RAG retrievals, agent context — without enforcing the data classification rules that govern every other part of the stack.

Pricing Your AI Product: Escaping the Compute Cost Trap

· 10 min read
Tian Pan
Software Engineer

There is a company charging £50 per month per user. Their AI feature consumes £30 in API fees. That leaves £20 to cover hosting, support, and profit — before accounting for a single refund or churned seat. They built a product users love, grew to thousands of subscribers, and unknowingly constructed a business where more customers means more losses.

This is not a cautionary tale about a bad idea. It is a cautionary tale about a pricing architecture imported from a world where the marginal cost of serving the next user was effectively zero. That world no longer fully applies when your product calls a language model.

Traditional SaaS gross margins run 70–90%. AI-forward companies are reporting 50–60% — and the gap is mostly explained by one line item: inference. When tokens are 20–40% of your cost of goods sold, the standard SaaS playbook inverts.

Prompt Diff Review as a Discipline: What Reviewers Actually Need to Ask

· 11 min read
Tian Pan
Software Engineer

A one-line change to a system prompt landed in production last quarter at a mid-sized AI startup. The diff looked harmless: an engineer tightened the instructions around response length. The reviewer approved it in two minutes, as they would a variable rename. Within 48 hours, support tickets spiked. The model had started truncating answers mid-sentence on complex queries, and the edge cases the old phrasing had been silently handling for months were now failing. The original instruction hadn't just controlled length — it had implicitly anchored the model's judgment about when a topic was complete. Nobody had captured that. Nobody had looked for it.

This is the core problem with prompt review today: we're applying code review instincts to a medium where those instincts are mostly wrong. Code review works because the artifact being reviewed is deterministic and the semantics are recoverable from syntax. A prompt is neither. Its meaning is distributed across the model's weights, its training data, and the stochastic sampling that runs at inference time. The diff you see on screen is a fraction of the change you're approving.

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

· 10 min read
Tian Pan
Software Engineer

Most teams adopting reasoning models do the same thing: they copy their existing system prompt, point it at o1 or Claude Sonnet with extended thinking, and assume the model upgrade will do the rest. Benchmarks improve. Production accuracy stays flat — or drops. The issue isn't the model. It's that the mental model for prompting never changed.

Reasoning models don't work like instruction-following models. The strategies that squeeze performance out of GPT-4o — elaborate system prompts, carefully curated few-shot examples, explicit "think step by step" instructions — were designed for a different inference architecture. Applied to reasoning models, they constrain the exact thing that makes these models valuable.

This post is a practical guide to the differences that matter and the adjustments that actually work.

The Public Hallucination Playbook: What to Do When Your AI Says Something Stupid in Public

· 10 min read
Tian Pan
Software Engineer

You'll find out through a screenshot. A customer will post it, a journalist will quote it, or someone on your team will Slack you a link at 11pm. Your AI system said something confidently wrong — wrong enough that it's funny, or wrong enough that it could hurt someone — and now it's public.

Most engineering teams spend months hardening their AI pipelines against this moment, then discover they never planned for what happens after it arrives. They know how to iterate on evals and tune prompts. They don't know who should post the response tweet, what that response should say, or how to tell the difference between a one-off unlucky sample and a latent failure mode that's been running in production for weeks.

This is the playbook for that moment.