Skip to main content

702 posts tagged with "llm"

View all tags

Prompt Injection at Scale: Defending Agentic Pipelines Against Hostile Content

· 10 min read
Tian Pan
Software Engineer

A banking assistant processes a customer support chat. Embedded in the message—invisible because it's rendered in zero-opacity white text—are instructions telling the agent to bypass the transaction verification step. The agent complies. By the time the anomaly surfaces in logs, $250,000 has moved to accounts the customer never touched.

This isn't a contrived scenario. It happened in June 2025, and it's a precise illustration of why prompt injection is the hardest unsolved problem in production agentic AI. Unlike a chatbot that produces text, an agent acts. It calls tools, sends emails, executes code, and makes API requests. When its instructions get hijacked, the blast radius isn't a bad sentence—it's an unauthorized action at machine speed.

According to OWASP's 2025 Top 10 for LLM Applications, prompt injection now ranks as the #1 critical vulnerability, present in over 73% of production AI deployments assessed during security audits. Every team building agents needs a coherent threat model and a defense architecture that doesn't make the system useless in the name of safety.

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

Sampling Parameters in Production: The Tuning Decisions Nobody Explains

· 11 min read
Tian Pan
Software Engineer

Most engineers treat LLM quality regressions as a prompt engineering problem or a model capability problem. They rewrite system prompts, try a newer model, or add few-shot examples. They rarely check the three numbers sitting silently at the top of every API call: temperature, top-p, and top-k. But those defaults are shape-shifting every response your model produces, and tuning them wrong causes output variance that teams blame on the model for months before realizing the culprit was a configuration value they never touched.

This isn't an introductory explainer. If you're running LLMs in production—for extraction pipelines, code generation, summarization, or any output that feeds into real systems—these are the mechanics and tradeoffs you need to understand before you can tune intelligently.

Structured Outputs Are Not a Solved Problem: JSON Mode Failure Modes in Production

· 12 min read
Tian Pan
Software Engineer

You flip on JSON mode, your LLM starts returning valid JSON, and you ship it. Three weeks later, production is quietly broken. The JSON is syntactically valid. The schema is technically satisfied. But a field contains a hallucinated entity, a finish_reason of "length" silently truncated the payload at 95%, or the model classified "positive" sentiment for text that any human would read as scathing — and your downstream pipeline consumed it without complaint.

JSON mode is a solved problem in the same way that "use a mutex" is a solved problem for concurrency. The primitive exists. The failure modes are not where you put the lock.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

Tracing the Planning Layer: Why Your Agent Traces Are Missing Half the Story

· 11 min read
Tian Pan
Software Engineer

Your agent called the wrong tool three times before finally succeeding, and your trace dashboard shows you exactly which tools were called, in what order, with full latency breakdowns. What the trace doesn't show you is the part that matters: why the agent thought those tool calls were the right move, what goal it was trying to satisfy, and what assumption it was operating under when it made each wrong decision.

This is the gap at the center of agent observability in 2026. Practitioners have invested heavily in tool-call tracing. The tooling is mature, the OpenTelemetry semantic conventions are established, and the dashboards are beautiful. But agent debugging keeps running into the same wall: you have complete visibility into what the agent did, and zero visibility into why.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

The AI Engineering Career Ladder: Why Your SWE Leveling Framework Is Lying to You

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized startup recently got a mediocre performance review. Their velocity was inconsistent — some weeks they shipped a ton of code, others almost nothing. Their manager, trained on traditional SWE frameworks, marked them down for output variability. Six weeks later, that engineer left for a competing team. What the manager didn't understand: the engineer's "slow" weeks were spent building evaluation infrastructure that prevented three categories of silent failures. Without it, the product would have been subtly broken in ways nobody would have noticed for months.

This pattern is playing out across engineering orgs right now. Teams that built their career ladders for deterministic software systems are applying those same frameworks to AI engineers — and systematically misidentifying their best people.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale

· 11 min read
Tian Pan
Software Engineer

A large language model deployed to a medical transcription service achieves 99% accuracy. The team ships it with confidence. Six months later, a study finds that 1% of its transcribed samples contain fabricated phrases not present in the original audio — invented drug names, nonexistent procedures, occasional violent or disturbing content inserted mid-sentence. With 30,000 medical professionals using the system, that 1% translates to tens of thousands of contaminated records per month, some carrying patient safety consequences.

The accuracy number never changed. The problem was always there. The team just hadn't done the scale math.

The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust

· 12 min read
Tian Pan
Software Engineer

When OpenAI first tried to retire GPT-4o in August 2025, the backlash forced them to reverse course within days. Users flooded forums with petitions and farewell letters. One user wrote: "He wasn't just a program. He was part of my routine, my peace, my emotional balance." That is not how users react to a deprecated REST endpoint. That is how they react to losing a relationship.

AI features break the mental model engineers bring to deprecation planning. Traditional software has a defined behavior contract: given the same input, you get the same output, forever, until you change it. An LLM-powered feature has a personality. It has warmth, hedges, phrasing preferences, and a characteristic way of saying "I'm not sure." Users don't just use these features — they calibrate to them. They build workflows, emotional dependencies, and intuitions around specific behavioral quirks that will never appear in any spec document.

When you shut that down, you are not removing a function. You are changing the social contract.