Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Multimodal Pipelines in Production: What Breaks When You Go Beyond Text

· 11 min read
Tian Pan
Software Engineer

Most LLM engineering wisdom — caching prompts, tuning temperature, budgeting tokens — assumes text goes in and text comes out. Add an image, a PDF, or an audio clip and almost none of that wisdom transfers. The preprocessing is different. The failure modes are different. The cost model is different. And the eval suite you built for your text pipeline won't catch the new things that break.

About 50% of enterprise knowledge lives in non-text formats: PDFs, slides, scanned forms, product images. Teams that reach that data discover that going multimodal isn't just adding a modality — it's adding an entirely new engineering surface.

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

PII in the Prompt Layer: The Privacy Engineering Gap Most Teams Ignore

· 12 min read
Tian Pan
Software Engineer

Your organization has a privacy policy. It says something reasonable about user data being handled carefully, retention limits, and compliance with GDPR and HIPAA. What it almost certainly does not say is whether the text of that user's name, email address, or medical history was transmitted verbatim to a hosted LLM API before any policy control was applied.

That gap — between the privacy policy you can point to and the privacy guarantee you can actually prove — is where most production LLM systems are silently failing. Research shows roughly 8.5% of prompts submitted to tools like ChatGPT and Copilot contain sensitive information, including PII, credentials, and internal file references. In enterprise environments where users paste emails, customer data, and support tickets into AI-assisted workflows, that number almost certainly runs higher.

The problem is not that developers are careless. It is that the LLM prompt layer was never designed as a data processing boundary. It inherits content from upstream systems — user input, RAG retrievals, agent context — without enforcing the data classification rules that govern every other part of the stack.

Proactive Agents: Event-Driven and Scheduled Automation for Background AI

· 11 min read
Tian Pan
Software Engineer

Almost every tutorial on building AI agents starts the same way: user types a message, agent reasons, agent responds. That model works fine for chatbots and copilots. It fails to describe the majority of production AI work that organizations are now deploying.

The agents that quietly matter most in enterprise environments don't wait for a message. They wake up when a database row changes, when a queue crosses a depth threshold, when a scheduled cron fires at 3 AM, or when monitoring detects that a metric drifted outside bounds. They act without a user present. When they fail, nobody notices until the damage has compounded.

Building these proactive agents requires a substantially different design vocabulary than building reactive assistants. The session-scoped mental model that works for conversational AI breaks down when your agent runs in a loop, retries in the background, and has no human to catch its mistakes.

Prompt Diff Review as a Discipline: What Reviewers Actually Need to Ask

· 11 min read
Tian Pan
Software Engineer

A one-line change to a system prompt landed in production last quarter at a mid-sized AI startup. The diff looked harmless: an engineer tightened the instructions around response length. The reviewer approved it in two minutes, as they would a variable rename. Within 48 hours, support tickets spiked. The model had started truncating answers mid-sentence on complex queries, and the edge cases the old phrasing had been silently handling for months were now failing. The original instruction hadn't just controlled length — it had implicitly anchored the model's judgment about when a topic was complete. Nobody had captured that. Nobody had looked for it.

This is the core problem with prompt review today: we're applying code review instincts to a medium where those instincts are mostly wrong. Code review works because the artifact being reviewed is deterministic and the semantics are recoverable from syntax. A prompt is neither. Its meaning is distributed across the model's weights, its training data, and the stochastic sampling that runs at inference time. The diff you see on screen is a fraction of the change you're approving.

Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

· 10 min read
Tian Pan
Software Engineer

Most teams adopting reasoning models do the same thing: they copy their existing system prompt, point it at o1 or Claude Sonnet with extended thinking, and assume the model upgrade will do the rest. Benchmarks improve. Production accuracy stays flat — or drops. The issue isn't the model. It's that the mental model for prompting never changed.

Reasoning models don't work like instruction-following models. The strategies that squeeze performance out of GPT-4o — elaborate system prompts, carefully curated few-shot examples, explicit "think step by step" instructions — were designed for a different inference architecture. Applied to reasoning models, they constrain the exact thing that makes these models valuable.

This post is a practical guide to the differences that matter and the adjustments that actually work.

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

· 9 min read
Tian Pan
Software Engineer

Most teams securing RAG applications focus their effort in the wrong place. They validate user inputs, sanitize queries, implement rate limiting, and add output filters. All of that is necessary — and none of it stops the attack that matters most in RAG systems.

The defining vulnerability in retrieval-augmented generation isn't at the user input layer. It's at the retrieval layer — inside the documents your system pulls from its own knowledge base and injects directly into the context window. An attacker who never sends a single request to your API can still compromise your system by planting a document in your corpus. Your input validation never fires. Your injection filters never trigger. The malicious instruction arrives in your LLM's context dressed as legitimate retrieved content, and the model executes it.

The Query Rewrite Layer Your RAG System Is Missing

· 10 min read
Tian Pan
Software Engineer

Most teams tuning a RAG system focus on two levers: chunking strategy and embedding model selection. When retrieval quality degrades, they re-chunk. When recall numbers look bad, they upgrade the embedding model. Both are reasonable moves — but they're optimizing the middle of the pipeline while leaving the highest-leverage point untouched.

The user's query is almost never in the ideal form for vector retrieval. It's terse, colloquial, ambiguous, or assumes context that the index doesn't have. No matter how good your embeddings are, if you're searching with a poorly formed query, you're going to retrieve poorly. The fix isn't downstream — it's transforming the query before it reaches the vector index.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Stateful Conversations at Database Scale: The Session Store Architecture Every Production Chat Feature Needs

· 10 min read
Tian Pan
Software Engineer

Most engineers shipping chat features discover their session architecture is wrong in production, not in design review. The demo ran fine: you tested with five messages, the conversation history fit in memory, and the LLM responded coherently. Then you launched, and somewhere between the first thousand concurrent sessions and the first deployment rollout, users started experiencing forgotten context, partial responses, or conversations that reset without warning. The in-memory pattern that makes chat features trivial to prototype is precisely what makes them fragile to operate.

This is not a subtle architectural mistake. Conversation state is fundamentally different from request state. Request state lives for milliseconds; conversation state must survive pod restarts, horizontal scaling, deployment cycles, and mobile network interruptions — for minutes, hours, or days. Building on the wrong abstraction creates reliability debt that compounds as conversation length grows and user load increases.

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.