Skip to main content

567 posts tagged with "llm"

View all tags

Structured Outputs Are Not a Solved Problem: JSON Mode Failure Modes in Production

· 12 min read
Tian Pan
Software Engineer

You flip on JSON mode, your LLM starts returning valid JSON, and you ship it. Three weeks later, production is quietly broken. The JSON is syntactically valid. The schema is technically satisfied. But a field contains a hallucinated entity, a finish_reason of "length" silently truncated the payload at 95%, or the model classified "positive" sentiment for text that any human would read as scathing — and your downstream pipeline consumed it without complaint.

JSON mode is a solved problem in the same way that "use a mutex" is a solved problem for concurrency. The primitive exists. The failure modes are not where you put the lock.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

Tracing the Planning Layer: Why Your Agent Traces Are Missing Half the Story

· 11 min read
Tian Pan
Software Engineer

Your agent called the wrong tool three times before finally succeeding, and your trace dashboard shows you exactly which tools were called, in what order, with full latency breakdowns. What the trace doesn't show you is the part that matters: why the agent thought those tool calls were the right move, what goal it was trying to satisfy, and what assumption it was operating under when it made each wrong decision.

This is the gap at the center of agent observability in 2026. Practitioners have invested heavily in tool-call tracing. The tooling is mature, the OpenTelemetry semantic conventions are established, and the dashboards are beautiful. But agent debugging keeps running into the same wall: you have complete visibility into what the agent did, and zero visibility into why.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

The AI Engineering Career Ladder: Why Your SWE Leveling Framework Is Lying to You

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized startup recently got a mediocre performance review. Their velocity was inconsistent — some weeks they shipped a ton of code, others almost nothing. Their manager, trained on traditional SWE frameworks, marked them down for output variability. Six weeks later, that engineer left for a competing team. What the manager didn't understand: the engineer's "slow" weeks were spent building evaluation infrastructure that prevented three categories of silent failures. Without it, the product would have been subtly broken in ways nobody would have noticed for months.

This pattern is playing out across engineering orgs right now. Teams that built their career ladders for deterministic software systems are applying those same frameworks to AI engineers — and systematically misidentifying their best people.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale

· 11 min read
Tian Pan
Software Engineer

A large language model deployed to a medical transcription service achieves 99% accuracy. The team ships it with confidence. Six months later, a study finds that 1% of its transcribed samples contain fabricated phrases not present in the original audio — invented drug names, nonexistent procedures, occasional violent or disturbing content inserted mid-sentence. With 30,000 medical professionals using the system, that 1% translates to tens of thousands of contaminated records per month, some carrying patient safety consequences.

The accuracy number never changed. The problem was always there. The team just hadn't done the scale math.

The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust

· 12 min read
Tian Pan
Software Engineer

When OpenAI first tried to retire GPT-4o in August 2025, the backlash forced them to reverse course within days. Users flooded forums with petitions and farewell letters. One user wrote: "He wasn't just a program. He was part of my routine, my peace, my emotional balance." That is not how users react to a deprecated REST endpoint. That is how they react to losing a relationship.

AI features break the mental model engineers bring to deprecation planning. Traditional software has a defined behavior contract: given the same input, you get the same output, forever, until you change it. An LLM-powered feature has a personality. It has warmth, hedges, phrasing preferences, and a characteristic way of saying "I'm not sure." Users don't just use these features — they calibrate to them. They build workflows, emotional dependencies, and intuitions around specific behavioral quirks that will never appear in any spec document.

When you shut that down, you are not removing a function. You are changing the social contract.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

· 9 min read
Tian Pan
Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.