Skip to main content

702 posts tagged with "llm"

View all tags

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.

When AI Features Create Moats (and When They Don't)

· 9 min read
Tian Pan
Software Engineer

A leaked internal Google memo put it plainly: "We aren't positioned to win this arms race and neither is OpenAI." The author's argument was that fine-tuning a model with LoRA costs roughly $100, that open-source communities could replicate closed-model capabilities within months, and that "we have no moat." This was a Google researcher writing about Google. If that's true inside the world's best-resourced AI lab, what does it mean for your product team betting on a data advantage?

The honest answer is that most AI features are not moats. They are rented capabilities with a UI. But some genuinely compound — and the difference is not about how much data you have. It's about the specific mechanical conditions under which data actually creates defensibility.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

Why Your AI Demo Always Outperforms Your Launch

· 8 min read
Tian Pan
Software Engineer

The demo was spectacular. The model answered every question fluently, summarized documents without hallucination, and handled every edge case you threw at it. Stakeholders were impressed. The launch date was set.

Three weeks after shipping, accuracy was somewhere around 60%. Users were confused. Tickets were piling up. The model that aced your showcase was stumbling through production traffic.

This is not a story about a bad model. It is a story about a mismatch that almost every team building LLM features encounters: the inputs you tested on are not the inputs your users send.

Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough

· 10 min read
Tian Pan
Software Engineer

During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.

The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.

Behavioral Contracts: Writing AI Requirements That Engineers Can Actually Test

· 11 min read
Tian Pan
Software Engineer

Most AI projects that die in the QA phase don't fail because the model is bad. They fail because nobody agreed on what "good" meant before the model was built. The acceptance criteria in the ticket said something like "the summarization feature should produce accurate, relevant summaries" — and when the engineer asked what "accurate" meant, the answer was "you know it when you see it." That is not a behavioral requirement. That is a hope.

The problem compounds because teams imported their existing requirements process from deterministic software and applied it unchanged to systems that are fundamentally stochastic. When you write assertTrue(output.equals("Paris")) for a database query, the test either passes or fails with complete certainty. When you write the same shape of assertion for an LLM, you get a test that fails on every valid paraphrase and passes on every confident hallucination. The unit test is lying to you, and the spec it was derived from was never designed for a system that generates distributions of outputs rather than single values.

The Build-vs-Buy LLM Infrastructure Decision Most Teams Get Wrong

· 10 min read
Tian Pan
Software Engineer

A FinTech team built their AI chatbot on GPT-4o. Month one: $15K. Month two: $35K. Month three: $60K. Projecting $700K annually, they panicked and decided to self-host. Six months and one burned-out engineer later, they were spending $85K/month on infrastructure, a part-time DevOps engineer, and three CUDA incidents that took down production. They eventually landed at $8K/month — but not by self-hosting everything. By routing intelligently.

Both decisions were wrong. The real failure was that they never ran the actual math.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.

Document Injection: The Prompt Injection Vector Inside Every RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG security discussions focus on the generation layer — jailbreaks, system prompt leakage, output filtering. Practitioners spend weeks tuning guardrails on the model side while overlooking the ingestion pipeline that feeds it. The uncomfortable reality: every document your pipeline ingests is a potential instruction surface. A single PDF can override your system prompt, exfiltrate user data, or manipulate decisions without your logging infrastructure seeing anything unusual.

This isn't theoretical. Microsoft 365 Copilot, Slack AI, and commercial HR screening tools have all been exploited through this vector in the past two years. The same attack pattern appeared in 18 academic papers on arXiv, where researchers embedded hidden prompts to bias AI peer review systems in their favor.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.