Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Prompt Injection at Scale: Defending Agentic Pipelines Against Hostile Content

· 10 min read
Tian Pan
Software Engineer

A banking assistant processes a customer support chat. Embedded in the message—invisible because it's rendered in zero-opacity white text—are instructions telling the agent to bypass the transaction verification step. The agent complies. By the time the anomaly surfaces in logs, $250,000 has moved to accounts the customer never touched.

This isn't a contrived scenario. It happened in June 2025, and it's a precise illustration of why prompt injection is the hardest unsolved problem in production agentic AI. Unlike a chatbot that produces text, an agent acts. It calls tools, sends emails, executes code, and makes API requests. When its instructions get hijacked, the blast radius isn't a bad sentence—it's an unauthorized action at machine speed.

According to OWASP's 2025 Top 10 for LLM Applications, prompt injection now ranks as the #1 critical vulnerability, present in over 73% of production AI deployments assessed during security audits. Every team building agents needs a coherent threat model and a defense architecture that doesn't make the system useless in the name of safety.

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

The Silent Regression: How to Communicate AI Behavioral Changes Without Losing User Trust

· 9 min read
Tian Pan
Software Engineer

Your power users are your canaries. When you ship a new model version or update a system prompt, aggregate evaluation metrics tick upward — task completion rates improve, hallucination scores drop, A/B tests declare victory. Then your most sophisticated users start filing bug reports. "It used to just do X. Now it lectures me first." "The formatting changed and broke my downstream parser." "I can't get it to stay in character anymore." They aren't imagining things. You shipped a regression, you just didn't see it in your dashboards.

This is the central paradox of AI product development: the users most harmed by behavioral drift are the ones who invested most in understanding the system's quirks. They built workflows around specific output patterns. They learned which prompts reliably triggered which behaviors. When you change the model, you don't just ship updates — you silently invalidate months of their calibration work.

The Debugging Regression: How AI-Generated Code Shifts the Incident-Response Cost Curve

· 9 min read
Tian Pan
Software Engineer

In March 2026, a single AI-assisted code change cost one major retailer 6.3 million lost orders and a 99% drop in North American order volume — a six-hour production outage traced to a change deployed without proper review. It wasn't a novel attack. There was no exotic failure mode. The system just did what the AI told it to do, and no one on-call had the mental model to understand why that was wrong until millions of customers had already seen errors.

This is the debugging regression. The productivity gains from AI-generated code are front-loaded and visible on dashboards. The costs are back-loaded and invisible until your alerting wakes you up at 3am.

AI-Assisted Codebase Migration at Scale: Automating the Upgrades Nobody Wants to Touch

· 11 min read
Tian Pan
Software Engineer

When Airbnb needed to migrate 3,500 React test files from Enzyme to React Testing Library, they estimated the project at 1.5 years of manual effort. They shipped it in 6 weeks using an LLM-powered pipeline. When Google studied 39 distinct code migrations executed over 12 months by a team of 3 developers—595 code changes, 93,574 edits—they found that 74% of the edits were AI-generated, 87% of those were committed without human modification, and the overall migration timeline was cut by 50%.

These numbers are real. But so is this: during those same migrations, engineers spent approximately 50% of their time validating AI output—fixing context window failures, cleaning up hallucinated imports, and untangling business logic errors the tests didn't catch. The efficiency gains are genuine and the pain points are genuine. The question isn't whether AI belongs in code migrations; it's knowing exactly where it helps and where it creates more cleanup than it saves.

The AI Engineering Career Ladder: Why Your SWE Leveling Framework Is Lying to You

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized startup recently got a mediocre performance review. Their velocity was inconsistent — some weeks they shipped a ton of code, others almost nothing. Their manager, trained on traditional SWE frameworks, marked them down for output variability. Six weeks later, that engineer left for a competing team. What the manager didn't understand: the engineer's "slow" weeks were spent building evaluation infrastructure that prevented three categories of silent failures. Without it, the product would have been subtly broken in ways nobody would have noticed for months.

This pattern is playing out across engineering orgs right now. Teams that built their career ladders for deterministic software systems are applying those same frameworks to AI engineers — and systematically misidentifying their best people.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust

· 12 min read
Tian Pan
Software Engineer

When OpenAI first tried to retire GPT-4o in August 2025, the backlash forced them to reverse course within days. Users flooded forums with petitions and farewell letters. One user wrote: "He wasn't just a program. He was part of my routine, my peace, my emotional balance." That is not how users react to a deprecated REST endpoint. That is how they react to losing a relationship.

AI features break the mental model engineers bring to deprecation planning. Traditional software has a defined behavior contract: given the same input, you get the same output, forever, until you change it. An LLM-powered feature has a personality. It has warmth, hedges, phrasing preferences, and a characteristic way of saying "I'm not sure." Users don't just use these features — they calibrate to them. They build workflows, emotional dependencies, and intuitions around specific behavioral quirks that will never appear in any spec document.

When you shut that down, you are not removing a function. You are changing the social contract.

What 'Done' Means for AI-Powered Features: Engineering the Perpetual Beta

· 10 min read
Tian Pan
Software Engineer

Shipping a feature in traditional software ends with a merge. The unit tests pass. The integration tests pass. QA signs off. You flip the flag, and unless a bug surfaces in production, you move on. The feature is done. For AI-powered features, that moment doesn't exist — and if you're pretending it does, you're accumulating a stability debt that will eventually show up as a user trust problem.

The reason is straightforward but rarely designed around: deterministic software produces the same output from the same input every time. AI features do not. Not because of a bug, but because the behavior is defined by a model that lives outside your codebase, trained on data that reflects a world that keeps changing, consumed by users whose expectations evolve as they see what's possible.