Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The AI Hiring Rubric Problem: Why Your Interview Loop Selects the Wrong Engineer

· 8 min read
Tian Pan
Software Engineer

Most teams hiring AI engineers today are running an interview process optimized for a job that doesn't exist. They're screening for LeetCode fluency, quizzing candidates on transformer internals, and rewarding anyone who can confidently sketch a distributed system on a whiteboard. Then those same candidates join the team, struggle to debug a hallucinating retrieval pipeline, and ship a model integration that works beautifully in staging and silently degrades in production.

This isn't a talent problem. It's a measurement problem. The skills that predict success in AI engineering are largely invisible to traditional interview loops—and the skills interviews do measure correlate poorly with what the job actually requires.

Ambient AI Design: When the Chat Interface Is the Wrong Abstraction

· 8 min read
Tian Pan
Software Engineer

Most engineering teams default to building AI features as chat interfaces. A user types something; the model responds. The pattern feels natural because it maps to human conversation, and the tooling makes it easy. But when you watch those chat-based AI features in production, you often see the same dysfunction: the UI sits idle, waiting for a user who is too busy, too distracted, or simply unaware that they should be asking something.

Chat is a pull model. The user initiates. The AI reacts. For a meaningful subset of the valuable AI work in any product—monitoring, anomaly detection, workflow automation, proactive notification—pull is the wrong shape. The work needs to happen whether or not the user remembered to open the chat window.

Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough

· 10 min read
Tian Pan
Software Engineer

During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.

The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.

Behavioral Contracts: Writing AI Requirements That Engineers Can Actually Test

· 11 min read
Tian Pan
Software Engineer

Most AI projects that die in the QA phase don't fail because the model is bad. They fail because nobody agreed on what "good" meant before the model was built. The acceptance criteria in the ticket said something like "the summarization feature should produce accurate, relevant summaries" — and when the engineer asked what "accurate" meant, the answer was "you know it when you see it." That is not a behavioral requirement. That is a hope.

The problem compounds because teams imported their existing requirements process from deterministic software and applied it unchanged to systems that are fundamentally stochastic. When you write assertTrue(output.equals("Paris")) for a database query, the test either passes or fails with complete certainty. When you write the same shape of assertion for an LLM, you get a test that fails on every valid paraphrase and passes on every confident hallucination. The unit test is lying to you, and the spec it was derived from was never designed for a system that generates distributions of outputs rather than single values.

The Cold Start Problem in AI Features: Why Week One Always Fails

· 11 min read
Tian Pan
Software Engineer

You build a personalization feature, wire it into your app, and ship it. Week one arrives. The system dutifully serves every new user the same handful of globally popular items — your AI, supposedly intelligent, is no smarter than an alphabetically sorted list. Your engagement metrics barely move. Your team concludes the model needs more tuning. It doesn't. The model is working exactly as designed. The problem is you asked it to learn before it had anything to learn from.

This is the cold start problem, and it kills more AI features than bad models ever will.

The core dynamic is circular: a behavioral ML system needs user interactions to produce useful predictions, but it needs to produce useful predictions to earn user interactions. One large e-commerce platform documented that cold start affected more than 60% of their new users — and those users were receiving misfired recommendations that measurably hurt conversion rates. In aggregate metrics, this signal was nearly invisible because warm users masked the damage.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

· 9 min read
Tian Pan
Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

The Multi-Tenant Prompt Problem: When One System Prompt Serves Many Masters

· 9 min read
Tian Pan
Software Engineer

You ship a new platform-level guardrail — a rule that prevents the AI from discussing competitor pricing. It goes live Monday morning. By Wednesday, your largest enterprise customer files a support ticket: their sales assistant, which they'd carefully tuned to compare vendor options for their procurement team, stopped working. They didn't change anything. You changed something, and the blast radius hit them invisibly.

This is the multi-tenant prompt problem. B2B AI products that allow customer customization are actually running a layered instruction system, and most teams don't treat it like one. They treat it like string concatenation: take the platform prompt, append the customer's instructions, maybe append user preferences, and call the LLM. The model figures out the rest.

The model doesn't figure it out. It silently picks a winner, and you don't find out which one until someone complains.

The Multi-Variable Regression Problem: Isolating AI Failures When Everything Changed at Once

· 11 min read
Tian Pan
Software Engineer

The ticket comes in on a Monday morning: user satisfaction for your AI-powered feature dropped 18% over the weekend. You open the deployment log and your stomach drops. Friday's release included a model version bump from your provider, a prompt refinement by the product team, a retrieval corpus refresh after a content audit, and a tool schema update for a renamed API field. Four changes. One regression. Zero idea which variable to blame.

This is the multi-variable regression problem, and it's the hardest class of failure in production AI systems. Not because the failure is exotic — behavioral regressions happen constantly — but because the conditions that produce it are nearly guaranteed when teams move fast. The changes that individually look safe pile up, release together, and then leave you debugging in the dark.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.