Skip to main content

678 posts tagged with "ai-engineering"

View all tags

LLM Cost Forecasting Before You Ship: The Estimation Problem Most Teams Skip

· 9 min read
Tian Pan
Software Engineer

A team ships a support chatbot. In testing, the monthly bill looks manageable—a few hundred dollars across the engineering team's demo sessions. Three weeks into production, the invoice arrives: $47,000. Nobody had lied about the token counts. Nobody had made an arithmetic error. The production workload was simply a different animal than anything they'd simulated.

This pattern repeats constantly. Teams estimate LLM costs the way they estimate database query costs—by measuring a representative request and multiplying by expected volume. That mental model breaks badly for LLMs, because the two biggest cost drivers (output token length and tool-call overhead) are determined at inference time by behavior you cannot fully predict at design time.

This post is about how to forecast better before you ship, not how to optimize after the bill arrives.

LLMs as Data Engineers: The Silent Failures in AI-Driven ETL

· 11 min read
Tian Pan
Software Engineer

Your hand-coded ETL pipeline handles 95% of records correctly. The edge cases — the currency strings with commas, the inconsistently formatted dates, the inconsistent country codes — flow through to your data warehouse and quietly corrupt your dashboards. Nobody notices until a quarterly report looks wrong. You add another special case to the pipeline. The cycle continues.

LLMs can solve this. They infer schemas from raw samples, handle messy edge cases that no engineer anticipated, and transform unstructured documents into structured records at a fraction of the development time. Several teams have shipped this. Some of them have also had LLMs silently transform "$1,200,000" into 1200 instead of 1200000, flip severity scores from "high" to "low" with complete structural validity, and join on the wrong foreign key in ways that passed every schema check.

The problem isn't that LLMs are bad at data engineering. It's that their failure mode is exactly wrong for ETL: high confidence, no error thrown, structurally valid output.

Model Deprecation Is a Systems Migration: How to Survive Provider Model Retirements

· 11 min read
Tian Pan
Software Engineer

A healthcare company running a production AI triage assistant gets the email every team dreads: their inference provider is retiring the model they're using in 90 days. They update the model string, run a quick manual smoke test, and ship the replacement. Three weeks later, the new model starts offering unsolicited diagnostic opinions. Token usage explodes 5×. Entire prompt templates break because the new model interprets instruction phrasing differently. JSON parsing fails because the output schema shifted.

This is not an edge case. It is the normal experience of surviving a model retirement when you treat it as a configuration change rather than a systems migration.

Model Upgrade as a Breaking Change: What Your Deployment Pipeline Is Missing

· 11 min read
Tian Pan
Software Engineer

When OpenAI deprecated max_tokens in favor of max_completion_tokens in their newer models, applications that had been running fine for months began returning 400 errors. No announcement triggered an alert. No error in your code. The model changed; your assumptions did not. This is the canonical story of a model upgrade as a breaking change — except most of them are quieter and therefore harder to catch.

Foundation model updates don't follow the same social contract as library releases. There's no BREAKING CHANGE: prefix in a git commit. There's no semver bump that tells your CI to fail. The output format narrows, the tone drifts, the JSON structure reorganizes, the reasoning path shortens — and downstream consumers discover it gradually, through degraded user experience and confused analytics, not a thrown exception.

Multi-User AI Sessions: The Context Ownership Problem Nobody Designs For

· 9 min read
Tian Pan
Software Engineer

In August 2024, security researchers discovered that Slack AI would pull both public and private channel content into the same context window when answering a query. An attacker in a public channel could craft a message that, when ingested by Slack AI, would inject instructions into a victim's session — and since Slack AI doesn't cite its sources, the resulting data exfiltration was nearly untraceable. The attack could leak API keys embedded in private DMs. Slack patched it after responsible disclosure.

This wasn't a bug in the traditional sense. It was a consequence of treating context as a shared mutable resource with no per-user access control. And it's a mistake that most teams building shared AI assistants are making right now, just more quietly.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

· 10 min read
Tian Pan
Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Zero-Shot, Few-Shot, or Chain-of-Thought: A Production Decision Framework

· 10 min read
Tian Pan
Software Engineer

Ask most engineers why they're using few-shot prompting in production, and you'll hear something like: "It seemed to work better." Ask why they added chain-of-thought, and the answer is usually: "I read it helps with reasoning." These aren't wrong answers, exactly. But they're convention masquerading as engineering. The evidence on when each prompting technique actually outperforms is specific enough that you can make this decision systematically—and the right choice can cut token costs by 60–80% or prevent a degradation you didn't know you were causing.

Here's what the research says, and how to apply it to your stack.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

· 11 min read
Tian Pan
Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Reasoning Model Economics: When Chain-of-Thought Earns Its Cost

· 9 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company added "let's think step by step" to every prompt after reading a few benchmarks. Their response quality went up measurably — and their LLM bill tripled. When they dug into the logs, they found that most of the extra tokens were being spent on tasks like classifying support tickets and summarizing meeting notes, where the additional reasoning added nothing detectable to output quality.

Extended thinking models are a genuine capability leap for hard problems. They're also a reliable cost trap when applied indiscriminately. The difference between a well-tuned reasoning deployment and an expensive one often comes down to one thing: understanding which tasks actually benefit from chain-of-thought, and which tasks are just paying for elaborate narration of obvious steps.

Shadow to Autopilot: A Readiness Framework for AI Feature Autonomy

· 11 min read
Tian Pan
Software Engineer

When a fintech company first deployed an AI transaction approval agent, the product team was convinced the model was ready for autonomy after a week of positive offline evals. They pushed it to co-pilot mode — where the agent suggested approvals and humans could override — and the approval rates looked great. Three weeks later, a pattern surfaced: the model was systematically under-approving transactions from non-English-speaking users in ways that correlated with name patterns, not risk signals. No one had checked segment-level performance before the rollout. The model wasn't a fraud-detection failure. It was a stage-gate failure.

Most teams understand, in principle, that AI features should be rolled out gradually. What they don't have is a concrete engineering framework for what "gradual" actually means: which metrics unlock each stage, what monitoring is required before escalation, and what triggers an automatic rollback. Without these, autonomy escalation becomes an act of organizational optimism rather than a repeatable engineering decision.