9 posts tagged with "caching"

The First Token Lies: Why Context Loading—Not Inference—Controls Your AI Feature's Latency

May 7, 2026 · 9 min read

Software Engineer

Most AI latency conversations focus on the wrong thing. Teams obsess over GPU utilization, model quantization, and batch sizes. Meanwhile, the latency that actually annoys users—the pause before the AI says anything at all—is determined almost entirely by what happens before inference starts. The bottleneck is context, not compute.

Time-to-first-token (TTFT) is the metric that determines whether your AI feature feels responsive or sluggish. And TTFT is dominated by the prefill phase: the time it takes to process the full input context before a single output token is generated. On a 128K-token context, prefill can take seconds. The GPU is working hard, but the user sees nothing.

The solution isn't a better GPU. It's pre-loading the context before the user asks anything.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

Semantic Cache Is a Safety Problem, Not a Perf Win

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

The Caching Hierarchy for Agentic Workloads: Five Layers Most Teams Stop at Two

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams deploying AI agents implement prompt caching, maybe add a semantic cache, and call it done. They're leaving 40-60% of their potential savings on the table. The reason isn't laziness — it's that agentic workloads create caching problems that don't exist in simple request-response LLM calls, and the solutions require thinking in layers that traditional web caching never needed.

A single agent task might involve a 4,000-token system prompt, three tool calls that each return different-shaped data, a multi-step plan that's structurally identical to yesterday's plan, and session context that needs to persist across a conversation but never across users. Each of these represents a different caching opportunity with different TTL requirements, different invalidation triggers, and different failure modes when the cache goes stale.

Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.

Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

April 9, 2026 · 8 min read

Tian Pan

Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.

Prompt Caching: The Optimization That Cuts LLM Costs by 90%

April 7, 2026 · 7 min read

Tian Pan

Software Engineer

Most teams building on LLMs are overpaying by 60–90%. Not because they're using the wrong model or prompting inefficiently — but because they're reprocessing the same tokens on every single request. Prompt caching fixes this, and it takes about ten minutes to implement. Yet it remains one of the most underutilized optimizations in production LLM systems.

Here's what's happening: every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you're handling 1,000 requests per day, you're paying to process 10 million tokens daily just for the static part of your prompt — context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.

About Tian Pan