Skip to main content

780 posts tagged with "ai-engineering"

View all tags

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

· 9 min read
Tian Pan
Software Engineer

Most teams securing RAG applications focus their effort in the wrong place. They validate user inputs, sanitize queries, implement rate limiting, and add output filters. All of that is necessary — and none of it stops the attack that matters most in RAG systems.

The defining vulnerability in retrieval-augmented generation isn't at the user input layer. It's at the retrieval layer — inside the documents your system pulls from its own knowledge base and injects directly into the context window. An attacker who never sends a single request to your API can still compromise your system by planting a document in your corpus. Your input validation never fires. Your injection filters never trigger. The malicious instruction arrives in your LLM's context dressed as legitimate retrieved content, and the model executes it.

The Query Rewrite Layer Your RAG System Is Missing

· 10 min read
Tian Pan
Software Engineer

Most teams tuning a RAG system focus on two levers: chunking strategy and embedding model selection. When retrieval quality degrades, they re-chunk. When recall numbers look bad, they upgrade the embedding model. Both are reasonable moves — but they're optimizing the middle of the pipeline while leaving the highest-leverage point untouched.

The user's query is almost never in the ideal form for vector retrieval. It's terse, colloquial, ambiguous, or assumes context that the index doesn't have. No matter how good your embeddings are, if you're searching with a poorly formed query, you're going to retrieve poorly. The fix isn't downstream — it's transforming the query before it reaches the vector index.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Stateful Conversations at Database Scale: The Session Store Architecture Every Production Chat Feature Needs

· 10 min read
Tian Pan
Software Engineer

Most engineers shipping chat features discover their session architecture is wrong in production, not in design review. The demo ran fine: you tested with five messages, the conversation history fit in memory, and the LLM responded coherently. Then you launched, and somewhere between the first thousand concurrent sessions and the first deployment rollout, users started experiencing forgotten context, partial responses, or conversations that reset without warning. The in-memory pattern that makes chat features trivial to prototype is precisely what makes them fragile to operate.

This is not a subtle architectural mistake. Conversation state is fundamentally different from request state. Request state lives for milliseconds; conversation state must survive pod restarts, horizontal scaling, deployment cycles, and mobile network interruptions — for minutes, hours, or days. Building on the wrong abstraction creates reliability debt that compounds as conversation length grows and user load increases.

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.

The AI Hiring Rubric Problem: Why Your Interview Loop Selects the Wrong Engineer

· 8 min read
Tian Pan
Software Engineer

Most teams hiring AI engineers today are running an interview process optimized for a job that doesn't exist. They're screening for LeetCode fluency, quizzing candidates on transformer internals, and rewarding anyone who can confidently sketch a distributed system on a whiteboard. Then those same candidates join the team, struggle to debug a hallucinating retrieval pipeline, and ship a model integration that works beautifully in staging and silently degrades in production.

This isn't a talent problem. It's a measurement problem. The skills that predict success in AI engineering are largely invisible to traditional interview loops—and the skills interviews do measure correlate poorly with what the job actually requires.

Ambient AI Design: When the Chat Interface Is the Wrong Abstraction

· 8 min read
Tian Pan
Software Engineer

Most engineering teams default to building AI features as chat interfaces. A user types something; the model responds. The pattern feels natural because it maps to human conversation, and the tooling makes it easy. But when you watch those chat-based AI features in production, you often see the same dysfunction: the UI sits idle, waiting for a user who is too busy, too distracted, or simply unaware that they should be asking something.

Chat is a pull model. The user initiates. The AI reacts. For a meaningful subset of the valuable AI work in any product—monitoring, anomaly detection, workflow automation, proactive notification—pull is the wrong shape. The work needs to happen whether or not the user remembered to open the chat window.

Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough

· 10 min read
Tian Pan
Software Engineer

During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.

The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.

Behavioral Contracts: Writing AI Requirements That Engineers Can Actually Test

· 11 min read
Tian Pan
Software Engineer

Most AI projects that die in the QA phase don't fail because the model is bad. They fail because nobody agreed on what "good" meant before the model was built. The acceptance criteria in the ticket said something like "the summarization feature should produce accurate, relevant summaries" — and when the engineer asked what "accurate" meant, the answer was "you know it when you see it." That is not a behavioral requirement. That is a hope.

The problem compounds because teams imported their existing requirements process from deterministic software and applied it unchanged to systems that are fundamentally stochastic. When you write assertTrue(output.equals("Paris")) for a database query, the test either passes or fails with complete certainty. When you write the same shape of assertion for an LLM, you get a test that fails on every valid paraphrase and passes on every confident hallucination. The unit test is lying to you, and the spec it was derived from was never designed for a system that generates distributions of outputs rather than single values.

The Cold Start Problem in AI Features: Why Week One Always Fails

· 11 min read
Tian Pan
Software Engineer

You build a personalization feature, wire it into your app, and ship it. Week one arrives. The system dutifully serves every new user the same handful of globally popular items — your AI, supposedly intelligent, is no smarter than an alphabetically sorted list. Your engagement metrics barely move. Your team concludes the model needs more tuning. It doesn't. The model is working exactly as designed. The problem is you asked it to learn before it had anything to learn from.

This is the cold start problem, and it kills more AI features than bad models ever will.

The core dynamic is circular: a behavioral ML system needs user interactions to produce useful predictions, but it needs to produce useful predictions to earn user interactions. One large e-commerce platform documented that cold start affected more than 60% of their new users — and those users were receiving misfired recommendations that measurably hurt conversion rates. In aggregate metrics, this signal was nearly invisible because warm users masked the damage.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.