Skip to main content

The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.

The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.

What Feature Stores Actually Solve (and Why It Maps Directly to LLM Problems)

A feature store is infrastructure that separates feature computation from feature serving. It has two halves:

  • Offline store: Historical feature values in columnar format (Parquet, BigQuery). Used to build training datasets with point-in-time correctness — i.e., you can reconstruct exactly which feature values were available at a given past timestamp, without contaminating training data with future information.
  • Online store: Current feature values in a low-latency key-value store (Redis, DynamoDB, RonDB). Optimized for sub-millisecond lookups at inference time.

The core insight is that features are computed once and served many times. In traditional ML, this might be a user's 30-day purchase history, precomputed nightly and materialized to both stores. At prediction time, you fetch the row from the online store rather than running the aggregation on the fly.

LLM applications have an identical problem with different terminology. Instead of "features for a classification model," you're assembling "context for a generation model." The constraint is the same: context window space is finite and expensive, retrieval has latency, and running the same aggregations per request is wasteful. The solution is the same too: precompute the stable parts, retrieve only what changes.

The Three-Tier Freshness Model

Not all context has the same staleness tolerance. Applying feature store thinking to LLM context reveals a natural three-tier architecture:

Batch tier (daily to weekly)

This is context that changes slowly and benefits from expensive computation that you don't want to redo per request:

  • User preference summaries distilled from months of activity
  • Entity profiles (company information, product attributes, article summaries)
  • Knowledge base embeddings indexed into a vector store
  • Fine-tuning datasets built from historical interaction patterns

These get computed by batch jobs (Spark, dbt, scheduled SQL) and materialized to the offline store, then synced to the online store. At request time, you fetch a pre-built user profile — 500 tokens — instead of retrieving 50 raw interactions and summarizing them live.

Streaming tier (seconds to minutes)

Context that changes fast enough that daily computation is stale, but not so fast that you need real-time queries:

  • Recent session activity (last 10 interactions)
  • Active document edit history
  • Real-time behavioral signals (currently browsing category X)

This flows through a streaming pipeline (Kafka + Flink, or Spark Structured Streaming) and writes continuously to the online store. The feature is always fresh enough without being queried from source systems on every request.

Real-time tier (on-demand)

Context that must reflect the exact current state:

  • Inventory availability for a purchase recommendation
  • Compliance-sensitive data that cannot be cached
  • The current user message itself

This tier is small by design. The failure mode is putting everything here: each additional source-system call adds 50–300ms latency. Conversational applications with a 200ms response budget cannot afford more than one or two real-time lookups per request.

The right distribution for most production systems is approximately 80% batch, 15% streaming, 5% real-time. Teams that invert this ratio — querying everything fresh on every request — pay 10× in latency and cost without proportional quality gains.

Point-in-Time Correctness: The Subtle Training Trap

The most underappreciated feature store concept in the LLM context is point-in-time correctness, and its absence is responsible for a category of model failures that are hard to diagnose.

When you fine-tune an LLM on historical examples, each training example typically includes some retrieved context: the user's profile at the time, their recent activity, the relevant documents. If you assemble this training data carelessly — fetching current user profiles for historical interactions — you've contaminated your training data with future information. The model learns from examples where the "context" includes signals that wouldn't have existed when the original interaction happened.

The result is a model that performs well on your eval set (which has the same contamination) and degrades mysteriously in production (which doesn't).

Feature stores solve this by versioning feature values with event timestamps. When building a training dataset, you reconstruct each example using the feature values that were present at the time of that interaction, not the feature values present today. The offline store maintains the complete history to make this possible.

In practice: if you're building training data for a customer support summarization model, you want to pair each support ticket with the customer profile and interaction history as it existed when that ticket was submitted, not as it exists now. That requires point-in-time retrieval, which requires an offline store with historical snapshots.

The Unintentional Feature Store (and Why Formalizing It Matters)

Here's the failure mode the TODO framing identified: teams build feature stores without calling them that, and then pay the cost without getting the benefit.

The pattern is recognizable. A team starts with real-time retrieval. Latency is a problem, so they add Redis caching. The cache gets stale, so they add a cron job to refresh it. The cron job and the training pipeline define the user preference aggregation differently, so their fine-tuned model sees different inputs at inference than at training time. A model update breaks because nobody versioned the feature schema. A silent API failure means 2% of requests get "user has no history" context because the upstream timeout is interpreted as an empty result — the phantom zero problem.

What they built is a feature store: offline computation, online serving, and some synchronization between them. What they're missing is the formalization: a single feature definition used by both training and serving, schema versioning, freshness monitoring, and data quality alerts.

Formalizing it means:

  • One transformation function that computes a feature, used in both the batch pipeline and the serving path
  • Schema versioning so model artifacts know which feature schema version they were trained on
  • Freshness monitoring that alerts when the streaming tier falls more than 5 minutes behind
  • Distribution monitoring that flags when online feature values diverge statistically from the offline store

This isn't about adopting a specific platform (Feast, Tecton, Hopsworks all work). It's about recognizing that you're doing feature engineering and applying the operational discipline the practice requires.

LLM-Specific Patterns That Don't Exist in Traditional Feature Stores

The traditional feature store pattern needs extension for LLM use cases.

Context assembly as a first-class operation

Traditional feature stores serve scalar features (user_age: 34, spend_30d: 1200.50). LLM context assembly produces structured text blocks: a 500-token user profile summary, a list of relevant document chunks, a system prompt template. The "online store" for LLM applications needs to serve not just feature values but precomputed context fragments that get composed at request time.

This means the online store entry for a user might be their pre-formatted profile summary, ready to slot directly into the prompt without any additional processing. The assembly layer just concatenates: system prompt + user profile + recent interactions + retrieved docs + current query.

Token budget as a constraint dimension

Traditional features have size (bytes in Redis), but this doesn't affect model behavior. LLM context fragments have token counts that directly constrain what else can fit in the context window. A feature store for LLM applications needs to track token counts per feature and implement token-budget-aware assembly: if the user profile takes 600 tokens and the context window is 8K, the assembly layer knows how many tokens remain for document retrieval.

User embeddings as compressed context

Google's USER-LLM work demonstrates an important pattern: rather than appending thousands of tokens of raw user history to every prompt, encode the user's behavioral sequence into a compact embedding (32 tokens) that the model consumes via cross-attention. The user embedding becomes a feature in the traditional sense — a fixed-size vector representation, precomputed and stored in the online store — but it carries semantic information that would otherwise require enormous context windows.

This is the LLM equivalent of feature engineering: transforming high-dimensional raw data into a compact representation that a model can actually use.

Latency and Cost: The Numbers

The practical stakes are concrete.

A typical RAG pipeline without precomputation:

  • Query embedding: 100–500ms
  • Vector search: 50–300ms
  • Source document retrieval: variable
  • Generation: 500–2000ms
  • Total: 650–2800ms, highly variable

With precomputed context (batch profile + streaming recent activity + cached embeddings):

  • Cache lookup: 0.35ms on hit
  • Vector search (cache miss): 100ms
  • Generation: 500–2000ms
  • Total: 500–2100ms, predictable

A real-time voice assistant implementation reported 316× speedup on cache hits — 0.35ms versus 110ms — with 75% cache hit rate across production traffic. For applications with a 200ms total latency budget, the difference between precomputed and real-time context assembly is the difference between working and broken.

Cost follows a similar pattern. Pre-embedding a 1M product catalog costs roughly $0.10 (one-time). Embedding at 100 requests per second against an API costs thousands of dollars per month. Precomputed user profile summaries at 500 tokens cost 10× less per inference request than assembling the same information from 5K tokens of raw history. At production scale, the batch computation costs are amortized across every request that consumes the feature.

When Not to Precompute

The batch tier is not for everything. Real-time retrieval stays appropriate for:

  • Highly dynamic data: Inventory counts, market prices, live sensor readings — anything where seconds-old data produces wrong outputs
  • Security-sensitive context: Authorization states or compliance data that must reflect current system state
  • Single-use features: Context that will only ever be needed for one model, where the overhead of a materialization pipeline exceeds the savings
  • Low-traffic applications: If a feature is consumed by fewer than 10K requests before it changes, the amortization math doesn't work

The goal is not to precompute everything. It's to make a deliberate architectural choice for each context source rather than defaulting to real-time retrieval because it's the path of least resistance.

The Checklist

If your LLM application retrieves context at inference time, apply this audit:

  1. Is this context the same for multiple requests? If yes, it's a batch feature candidate.
  2. Does this context change faster than hourly? If no, streaming materialization is more than sufficient.
  3. Are the same aggregations computed in your training pipeline and your serving pipeline? If no, you have a training-serving skew risk.
  4. When did you last verify that your cached context actually reflects source system state? If you don't have a monitoring answer, you have a phantom failure risk.
  5. Do you know how many tokens each context source contributes? If no, you're not managing your context window budget.

Teams that work through this list typically find that 70–80% of their real-time retrieval can be moved to a batch tier without quality loss — and the 20–30% that remains is finally getting the latency budget it needs.

The feature store pattern wasn't invented for LLMs, but it maps onto the LLM context assembly problem almost perfectly. The teams that recognize the pattern early spend their engineering time on model quality. The teams that discover it post-incident spend it untangling ad-hoc pipelines.

References:Let's stay in touch and Follow me for more thoughts and updates