Skip to main content

200 posts tagged with "ai-engineering"

View all tags

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

Load Testing LLM Applications: Why k6 and Locust Lie to You

· 11 min read
Tian Pan
Software Engineer

You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?

The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.

The Eval-to-Production Gap: Why 92% on Your Test Suite Means 40% User Satisfaction

· 10 min read
Tian Pan
Software Engineer

Your team spent three weeks building a rigorous eval suite. It covers edge cases. It includes adversarial examples. The LLM-as-judge scores 92% across all dimensions. You ship.

Then the support tickets start. Users say the AI "doesn't understand what they're asking." Session abandonment is up 30%. Satisfaction scores come back at 41%.

This gap — between eval performance and real-world outcomes — is the most common failure mode in production AI systems today. It's not a model problem. It's a measurement problem.

Temporal Reasoning Failures in Production AI Systems

· 10 min read
Tian Pan
Software Engineer

An agent that confidently recommends products that have been out of stock for six months. A customer service bot that tells a user there's no record of the order they placed 20 minutes ago. A coding assistant that generates working code against a library API deprecated two years ago. These aren't hallucinations in the traditional sense — the model is recalling something that was once accurate. That's a different failure mode entirely, and most teams aren't equipped to detect or defend against it.

The distinction matters because the mitigations are fundamentally different. You can't prompt-engineer your way out of staleness. You can't fine-tune your way out of it either — fine-tuning on stale knowledge makes the problem worse, not better, because the model expresses outdated information with greater authority. And as models become more fluent and confident in their delivery, their confidently-wrong stale answers become harder, not easier, for users to catch.

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

· 10 min read
Tian Pan
Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

The Anatomy of an Agent Harness

· 8 min read
Tian Pan
Software Engineer

There's a 100-line Python agent that scores 74–76% on SWE-bench Verified — only 4–6 percentage points behind state-of-the-art systems built by well-funded teams. The execution loop itself isn't where the complexity lives. World-class teams invest six to twelve months building the infrastructure around that loop. That infrastructure has a name: the harness.

The formula is simple: Agent = Model + Harness. The model handles reasoning. The harness handles everything else — tool execution, context management, safety enforcement, error recovery, state persistence, and human-in-the-loop workflows. If you've been spending months optimizing prompts and model selection while shipping brittle agents, you've been optimizing the wrong thing.

LLM Evals: What Actually Works and What Wastes Your Time

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications fall into one of two failure modes. The first is building no evals at all and shipping features on vibes. The second is building elaborate evaluation infrastructure before they understand what they're actually trying to measure. Both are expensive mistakes.

The teams that do evals well share a common approach: they start by looking at data, not by building systems. Error analysis comes before evaluation automation. Human judgment grounds the metrics before any automated judge is trusted. And they treat evaluation not as a milestone to cross but as a continuous discipline that evolves alongside the product.

This is what evals actually look like in practice — the decisions that matter, the patterns that waste effort, and the tradeoffs that aren't obvious until you've been burned.

Why Your LLM Evaluators Are Miscalibrated — and the Data-First Fix

· 9 min read
Tian Pan
Software Engineer

Most teams build their LLM evaluators in the wrong order. They write criteria, then look at data. That inversion is the root cause of miscalibrated evals, and it's almost universal in teams shipping their first AI product. The criteria sound reasonable on paper — "the response should be accurate, helpful, and concise" — but when you apply them to real model outputs, you discover the rubric doesn't match what you actually care about. You end up with an evaluator that grades things you're not measuring and misses failures that matter.

The fix isn't a better rubric. It's a different workflow: look at the data first, define criteria second, and then validate your evaluator against human judgment before trusting it to run unsupervised.

Eval Engineering for Production LLM Systems

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM systems start with the wrong question. They ask "how do I evaluate this?" before understanding what actually breaks. Then they spend weeks building eval infrastructure that measures the wrong things, achieve 90%+ pass rates immediately, and ship products that users hate. The evaluations weren't wrong—they just weren't measuring failure.

Effective eval engineering isn't primarily about infrastructure. It's about developing a precise, shared understanding of what "good" means for your specific system. The infrastructure is almost incidental. In mature LLM teams, 60–80% of development time goes toward error analysis and evaluation—not feature work. That ratio surprises most engineers until they've shipped a broken model to production and spent a week debugging what went wrong.

Designing an Agent Runtime from First Principles

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks make a critical mistake early: they treat the agent as a function. You call it, it loops, it returns. That mental model works for demos. It falls apart the moment a real-world task runs for 45 minutes, hits a rate limit at step 23, and you have nothing to resume from.

A production agent runtime is not a function runner. It is an execution substrate — something closer to a process scheduler or a distributed workflow engine than a Python function. Getting this distinction right from the beginning determines whether your agent system handles failures gracefully or requires a human to hit retry.

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.