Skip to main content

35 posts tagged with "engineering"

View all tags

Prompt Archaeology: Recovering Intent from Legacy Prompts Nobody Documented

· 10 min read
Tian Pan
Software Engineer

You join a team that's been running an LLM feature in production for eighteen months. The feature is working — users like it, the business cares about it — but nobody can explain exactly what the prompt does or why it was written the way it was. The engineer who wrote it left. The Slack thread where they discussed it is buried somewhere in a channel that no longer exists. The prompt lives in a database record, 900 tokens long, with no comments and no commit message beyond "update prompt."

Now you've been asked to change it.

This situation is more common than the industry admits. Prompts are treated like configuration values: quick to write, invisible in code review, and forgotten the moment they start working. The difference is that a misconfigured feature flag announces itself immediately. A misconfigured prompt will silently degrade behavior across a subset of edge cases for weeks before anyone notices.

The Cold Start Problem in AI Personalization: Being Useful Before You Have Data

· 11 min read
Tian Pan
Software Engineer

Most personalization systems are built around a flywheel: users interact, you learn their preferences, you show better recommendations, they interact more. The flywheel spins faster as data accumulates. The problem is the flywheel needs velocity to generate lift — and a new user has none.

This is the cold start problem. And it's more dangerous than most teams recognize when they first ship personalization. A new user arrives with no history, no signal, and often a skeptical prior: "AI doesn't know me." You have roughly 5–15 minutes to prove otherwise before they form an opinion that determines whether they'll stay long enough to generate the data that would let you actually help them. Up to 75% of new users abandon products in the first week if that window goes badly.

The cold start problem isn't a data problem. It's an initialization problem. The engineering question is: what do you inject in place of history?

The LLM Pipeline Monolith vs. Chain Trade-off: When Task Decomposition Helps and When It Hurts

· 8 min read
Tian Pan
Software Engineer

Most teams building LLM pipelines reach for chaining almost immediately. A complex task gets split into steps — extract, then classify, then summarize, then format — and each step gets its own prompt. It feels right: smaller prompts are easier to write, easier to debug, and easier to iterate on. But here's what rarely gets asked: is a chain actually more accurate than doing the whole thing in one call? In most codebases I've seen, nobody measured.

The monolith vs. chain trade-off is one of the most consequential architectural decisions in AI engineering, and it's almost always made by instinct. This post breaks down what the empirical evidence says, when decomposition genuinely helps, when it quietly makes things worse, and what signals to watch for in production.

Sampling Parameters in Production: The Tuning Decisions Nobody Explains

· 11 min read
Tian Pan
Software Engineer

Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.

This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.

The Accessibility Gap in AI Interfaces Nobody Is Shipping Around

· 8 min read
Tian Pan
Software Engineer

Most AI teams run accessibility audits on their landing pages. Almost none run them on the chat interface itself. The gap isn't laziness — it's that the tools don't exist. WCAG 2.2 has no success criterion for streaming content, no standard for non-deterministic outputs, and no guidance for token-by-token delivery. Which means every AI product streaming responses into a <div> right now is operating in a compliance grey zone while breaking the experience for a significant portion of its users.

This isn't a minor edge case. Blind and low-vision users report information-seeking as their top AI use case. Users with dyslexia, ADHD, and cognitive disabilities are actively trying to use AI tools to reduce reading load — and the default implementation pattern actively makes things worse for them.

AI Code Review at Scale: When Your Bot Creates More Work Than It Saves

· 10 min read
Tian Pan
Software Engineer

Most teams that adopt an AI code reviewer go through the same arc: initial excitement, a burst of flagged issues that feel useful, then a slow drift toward ignoring the bot entirely. Within a few months, engineers have developed a muscle memory for dismissing AI comments without reading them. The tool still runs. The comments still appear. Nobody acts on them anymore.

This is not a tooling problem. It is a measurement problem. Teams deploy AI code review without ever defining what "net positive" looks like — and without that baseline, alert fatigue wins.

API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic

· 9 min read
Tian Pan
Software Engineer

Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.

This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.

Pricing AI Features: The Unit Economics Framework Engineering Teams Always Skip

· 11 min read
Tian Pan
Software Engineer

Cursor hit 1billioninrevenuein2025andlost1 billion in revenue in 2025 and lost 150 million doing it. Every dollar customers paid went straight to LLM API providers, with nothing left for engineering, support, or infrastructure overhead. This wasn't a scaling problem—it was a unit economics problem that was invisible until it was catastrophic.

Most engineering teams building AI features make the same mistake: they treat inference cost as a minor line item, ship a flat-rate subscription, and assume the economics will work out later. They don't. Variable inference costs don't behave like any other COGS in software, and the pricing architectures that work for traditional SaaS will bleed you dry the moment your heaviest users find your most expensive feature.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

Your AI Feature Should Lose to a Regex First

· 9 min read
Tian Pan
Software Engineer

A team spends three weeks integrating a foundation model to classify incoming support tickets into routing categories. The model reaches 87% accuracy in testing. They ship it. Six months later, an engineer notices that 70% of tickets contain a product name in the subject line and that a simple lookup table would have handled those with 99% accuracy. The LLM is running on the hard 30% and making it up the rest of the time.

This is not an unusual story. It happens because teams treat "use an LLM" as the first implementation choice rather than the last. The fix is a required gate: your AI feature must lose to a dumb rule before you are allowed to build the AI version.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.