Skip to main content

553 posts tagged with "ai-engineering"

View all tags

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

The Quality Tax of Over-Specified System Prompts

· 9 min read
Tian Pan
Software Engineer

Most engineering teams discover the same thing on their first billing spike: their system prompt has quietly grown to 4,000 tokens of carefully reasoned instructions, and the model has quietly started ignoring half of them. The fix is rarely to add more instructions. It's almost always to delete them.

The instinct to be exhaustive is understandable. More constraints feel like more control. But there's a measurable quality degradation that kicks in as system prompts bloat — and it compounds with cost in ways that aren't visible until they hurt. Research consistently finds accuracy drops at around 3,000 tokens of input, well before hitting any nominal context limit. The model doesn't refuse to comply; it just starts underperforming in ways that are hard to pin down.

This post is about making that degradation visible, understanding why it happens, and building a trimming discipline that doesn't require hoping nothing breaks.

Text-to-SQL at Scale: What Nobody Tells You Before Production

· 11 min read
Tian Pan
Software Engineer

Text-to-SQL demos are deceptively easy to build. You paste a schema into a prompt, ask GPT-4 a question, get back a clean SELECT statement, and suddenly your Slack is full of "what if we built this into our data platform?" messages. Then you try to actually ship it. The benchmark says 85% accuracy. Your internal data team reports that about half the answers are wrong. Your security team asks who reviewed the generated queries before they hit production. Nobody has a good answer.

This is the gap between text-to-SQL as a research problem and text-to-SQL as an engineering problem. The research problem is about getting models to produce syntactically valid SQL. The engineering problem is about schema ambiguity, access control, query validation, and the fact that your enterprise database looks nothing like Spider or BIRD.

Adding AI to Systems You Don't Own: The Third-Party Model Integration Playbook

· 12 min read
Tian Pan
Software Engineer

Most engineering problems are self-inflicted. The code you deploy, the schemas you define, the dependencies you choose — when things break, you can trace it back to something in your control. AI API integrations violate this assumption. When you build on a third-party model API, a silent model update can degrade your feature at 3am without a deploy happening on your end. A provider outage can take your product offline. A price change can turn a profitable workflow into a money-losing one. The breaking change will never show up in your changelog.

This isn't a reason to avoid external AI APIs. It's a reason to build as if you don't trust them.

The Transcript Layer Lie: Why Your Multimodal Pipeline Hallucinates Downstream

· 9 min read
Tian Pan
Software Engineer

Your ASR system returned "the patient takes metaformin twice daily." The correct word was metformin. The transcript looked clean — no [INAUDIBLE] markers, no error flags. Confidence was 0.73 on that word. Your pipeline discarded that number and handed clean text to the LLM. The LLM, treating it as ground truth, reasoned about a medication that doesn't exist.

This is the transcript layer lie: the implicit assumption that intermediate text representations — whether produced by speech recognition, OCR, or vision models parsing a document — are reliable enough to pass downstream without qualification. They aren't. But almost every production pipeline treats them as if they are.

The User Adaptation Trap: Why Rolling Back an AI Model Can Break Things Twice

· 9 min read
Tian Pan
Software Engineer

You shipped a model update. It looked fine in offline evals. Then, two weeks later, you notice your power users are writing longer, more qualified prompts — hedging in ways they never used to. Your support queue fills with vague complaints like "the AI feels off." You dig in and realize the update introduced a subtle behavior shift: the model has been over-confirming user ideas, validating bad plans, and softening its pushback. You decide to roll back.

Here is where it gets worse. When you roll back, a new wave of complaints arrives. Users say the model feels cold, terse, unhelpful — the opposite of what the original rollback complainers said. What happened? The users who interacted with the broken version long enough built new workflows around it. They learned to drive harder, push back more, frame questions more aggressively. The rollback removed the behavior they had adapted to, leaving them stranded.

This is the user adaptation trap. A subtly wrong behavior, left in production long enough, gets baked into user habits. Rolling it back doesn't restore the status quo — it creates a second disruption on top of the first.

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

Who Owns AI Quality? The Cross-Functional Vacuum That Breaks Production Systems

· 10 min read
Tian Pan
Software Engineer

When Air Canada's support chatbot promised customers a discount fare for recently bereaved travelers, the policy it described didn't exist. A court later ordered Air Canada to honor the hallucinated refund anyway. When a Chevrolet dealership chatbot negotiated away a 2024 Tahoe for $1, no mechanism stopped it. In both cases, the immediate question was about model quality. The real question — the one that matters operationally — was simpler: who was supposed to catch that?

The answer, in most organizations, is nobody specific. AI quality sits at the intersection of ML engineering, product management, data teams, and operations. Each function has a partial view. None claims full ownership. The result is a vacuum where things that should be caught aren't, and when something breaks, the postmortem produces a list of teams that each assumed someone else was responsible.

Agent Identity and Delegated Authorization: OAuth Patterns for Agentic Actions

· 10 min read
Tian Pan
Software Engineer

When an AI agent books a calendar event, sends an email, or submits a form, it isn't acting on its own identity — it's acting under delegated authority from a human who said "go do this." That distinction sounds philosophical until an agent leaks sensitive data, takes an irreversible action the user didn't intend, or gets compromised. At that point, the question isn't what happened but who authorized it, when, and can we revoke it.

The blast radius of poorly scoped agent credentials is larger than most teams realize. An agent authenticated with broad API access isn't one point of failure — it's a standing invitation. In 2025, agentic AI CVE counts jumped 255% year-over-year, and most incidents traced back to credentials that were too broad, too long-lived, or impossible to revoke cleanly. Building agents right means designing the authorization layer before you hit production.

Agentic Data Pipelines: Offline Enrichment and Classification at Scale

· 9 min read
Tian Pan
Software Engineer

You have a batch job that classifies 10 million customer support tickets overnight. You swap the regex classifier for an LLM and the accuracy jumps from 61% to 89%. Then you ship it and discover: the job now costs 40x more, runs 12x slower, silently skips 3% of records when the model returns unparseable output, and your downstream analytics team is filing bugs because the label schema drifted without anyone noticing.

Agentic data pipelines break in ways that ETL engineers haven't seen before, and the fixes require a different mental model than either traditional batch processing or real-time LLM serving.