Skip to main content

720 posts tagged with "llm"

View all tags

The Build-vs-Buy LLM Infrastructure Decision Most Teams Get Wrong

· 10 min read
Tian Pan
Software Engineer

A FinTech team built their AI chatbot on GPT-4o. Month one: $15K. Month two: $35K. Month three: $60K. Projecting $700K annually, they panicked and decided to self-host. Six months and one burned-out engineer later, they were spending $85K/month on infrastructure, a part-time DevOps engineer, and three CUDA incidents that took down production. They eventually landed at $8K/month — but not by self-hosting everything. By routing intelligently.

Both decisions were wrong. The real failure was that they never ran the actual math.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.

Document Injection: The Prompt Injection Vector Inside Every RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG security discussions focus on the generation layer — jailbreaks, system prompt leakage, output filtering. Practitioners spend weeks tuning guardrails on the model side while overlooking the ingestion pipeline that feeds it. The uncomfortable reality: every document your pipeline ingests is a potential instruction surface. A single PDF can override your system prompt, exfiltrate user data, or manipulate decisions without your logging infrastructure seeing anything unusual.

This isn't theoretical. Microsoft 365 Copilot, Slack AI, and commercial HR screening tools have all been exploited through this vector in the past two years. The same attack pattern appeared in 18 academic papers on arXiv, where researchers embedded hidden prompts to bias AI peer review systems in their favor.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

The Multi-Tenant Prompt Problem: When One System Prompt Serves Many Masters

· 9 min read
Tian Pan
Software Engineer

You ship a new platform-level guardrail — a rule that prevents the AI from discussing competitor pricing. It goes live Monday morning. By Wednesday, your largest enterprise customer files a support ticket: their sales assistant, which they'd carefully tuned to compare vendor options for their procurement team, stopped working. They didn't change anything. You changed something, and the blast radius hit them invisibly.

This is the multi-tenant prompt problem. B2B AI products that allow customer customization are actually running a layered instruction system, and most teams don't treat it like one. They treat it like string concatenation: take the platform prompt, append the customer's instructions, maybe append user preferences, and call the LLM. The model figures out the rest.

The model doesn't figure it out. It silently picks a winner, and you don't find out which one until someone complains.

The Operational Model Card: Deployment Documentation Labs Don't Publish

· 11 min read
Tian Pan
Software Engineer

A model card tells you whether a model was red-teamed for CBRN misuse and which demographic groups it underserves. What it doesn't tell you: the p95 TTFT at 10,000 concurrent requests, the accuracy cliff at 80% of the advertised context window, the percentage of complex JSON schemas it malforms, or how much the model's behavior has drifted since the card was published.

The gap is structural, not accidental. Model cards were designed in 2019 for fairness and safety documentation, with civil society organizations and regulators as the intended audience. Engineering teams shipping production systems were not the use case. Seven years of adoption later, that framing is unchanged — while the cost of treating a model card as a deployment specification has never been higher.

The 2025 Foundation Model Transparency Index (Stanford CRFM + Berkeley) confirmed the scope of the omission: OpenAI scored 24/100, Anthropic 32/100, Google 27/100 across 100 transparency indicators. Average scores dropped from 58 to 40 year-over-year, meaning AI transparency is getting worse, not better, as models get more capable. None of the four major labs disclose training data composition, energy usage, or deployment-relevant performance characteristics.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

The Semantic Validation Layer: Why JSON Schema Isn't Enough for Production LLM Outputs

· 10 min read
Tian Pan
Software Engineer

By 2025, every major LLM provider had shipped constrained decoding for structured outputs. OpenAI, Anthropic, Gemini, Mistral — they all let you hand the model a JSON schema and guarantee it comes back structurally intact. Teams adopted this and breathed a collective sigh of relief. Parsing errors disappeared. Retry loops shrank. Dashboards turned green.

Then the subtle failures started.

A sentiment classifier locked in at 0.99 confidence on every input — gibberish included — for two weeks before anyone noticed. A credit risk agent returned valid JSON approving a loan application that should have been declined, with a risk score fifty points too high. A financial pipeline coerced "$500,000" (a string, technically schema-valid) down to zero in an integer field, corrupting six weeks of risk calculations. Every one of these failures passed schema validation cleanly.

The lesson: structural validity is necessary, not sufficient. You need a semantic validation layer, and most teams don't have one.

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.