The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute
Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.
The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.
What Feature Stores Actually Solve (and Why It Maps Directly to LLM Problems)
A feature store is infrastructure that separates feature computation from feature serving. It has two halves:
- Offline store: Historical feature values in columnar format (Parquet, BigQuery). Used to build training datasets with point-in-time correctness — i.e., you can reconstruct exactly which feature values were available at a given past timestamp, without contaminating training data with future information.
- Online store: Current feature values in a low-latency key-value store (Redis, DynamoDB, RonDB). Optimized for sub-millisecond lookups at inference time.
The core insight is that features are computed once and served many times. In traditional ML, this might be a user's 30-day purchase history, precomputed nightly and materialized to both stores. At prediction time, you fetch the row from the online store rather than running the aggregation on the fly.
LLM applications have an identical problem with different terminology. Instead of "features for a classification model," you're assembling "context for a generation model." The constraint is the same: context window space is finite and expensive, retrieval has latency, and running the same aggregations per request is wasteful. The solution is the same too: precompute the stable parts, retrieve only what changes.
The Three-Tier Freshness Model
Not all context has the same staleness tolerance. Applying feature store thinking to LLM context reveals a natural three-tier architecture:
Batch tier (daily to weekly)
This is context that changes slowly and benefits from expensive computation that you don't want to redo per request:
- User preference summaries distilled from months of activity
- Entity profiles (company information, product attributes, article summaries)
- Knowledge base embeddings indexed into a vector store
- Fine-tuning datasets built from historical interaction patterns
These get computed by batch jobs (Spark, dbt, scheduled SQL) and materialized to the offline store, then synced to the online store. At request time, you fetch a pre-built user profile — 500 tokens — instead of retrieving 50 raw interactions and summarizing them live.
Streaming tier (seconds to minutes)
Context that changes fast enough that daily computation is stale, but not so fast that you need real-time queries:
- Recent session activity (last 10 interactions)
- Active document edit history
- Real-time behavioral signals (currently browsing category X)
This flows through a streaming pipeline (Kafka + Flink, or Spark Structured Streaming) and writes continuously to the online store. The feature is always fresh enough without being queried from source systems on every request.
Real-time tier (on-demand)
Context that must reflect the exact current state:
- Inventory availability for a purchase recommendation
- Compliance-sensitive data that cannot be cached
- The current user message itself
This tier is small by design. The failure mode is putting everything here: each additional source-system call adds 50–300ms latency. Conversational applications with a 200ms response budget cannot afford more than one or two real-time lookups per request.
The right distribution for most production systems is approximately 80% batch, 15% streaming, 5% real-time. Teams that invert this ratio — querying everything fresh on every request — pay 10× in latency and cost without proportional quality gains.
Point-in-Time Correctness: The Subtle Training Trap
The most underappreciated feature store concept in the LLM context is point-in-time correctness, and its absence is responsible for a category of model failures that are hard to diagnose.
When you fine-tune an LLM on historical examples, each training example typically includes some retrieved context: the user's profile at the time, their recent activity, the relevant documents. If you assemble this training data carelessly — fetching current user profiles for historical interactions — you've contaminated your training data with future information. The model learns from examples where the "context" includes signals that wouldn't have existed when the original interaction happened.
The result is a model that performs well on your eval set (which has the same contamination) and degrades mysteriously in production (which doesn't).
Feature stores solve this by versioning feature values with event timestamps. When building a training dataset, you reconstruct each example using the feature values that were present at the time of that interaction, not the feature values present today. The offline store maintains the complete history to make this possible.
In practice: if you're building training data for a customer support summarization model, you want to pair each support ticket with the customer profile and interaction history as it existed when that ticket was submitted, not as it exists now. That requires point-in-time retrieval, which requires an offline store with historical snapshots.
The Unintentional Feature Store (and Why Formalizing It Matters)
Here's the failure mode the TODO framing identified: teams build feature stores without calling them that, and then pay the cost without getting the benefit.
The pattern is recognizable. A team starts with real-time retrieval. Latency is a problem, so they add Redis caching. The cache gets stale, so they add a cron job to refresh it. The cron job and the training pipeline define the user preference aggregation differently, so their fine-tuned model sees different inputs at inference than at training time. A model update breaks because nobody versioned the feature schema. A silent API failure means 2% of requests get "user has no history" context because the upstream timeout is interpreted as an empty result — the phantom zero problem.
What they built is a feature store: offline computation, online serving, and some synchronization between them. What they're missing is the formalization: a single feature definition used by both training and serving, schema versioning, freshness monitoring, and data quality alerts.
- https://www.hopsworks.ai/dictionary/feature-store
- https://www.databricks.com/blog/what-feature-store-complete-guide-ml-feature-engineering
- https://unified.to/blog/index_time_rag_vs_real_time_rag_choosing_the_right_retrieval_strategy
- https://arxiv.org/html/2603.02206v1
- https://research.google/blog/user-llm-efficient-llm-contextualization-with-user-embeddings/
- https://arxiv.org/html/2601.12078
- https://medium.com/@vesaalexandru95/the-role-of-feature-stores-in-fine-tuning-llms-22bd60afd4b9
- https://aws.amazon.com/blogs/machine-learning/personalize-your-generative-ai-applications-with-amazon-sagemaker-feature-store/
- https://weaviate.io/blog/context-engineering
- https://serokell.io/blog/design-patterns-for-long-term-memory-in-llm-powered-architectures
- https://building.nubank.com/dealing-with-train-serve-skew-in-real-time-ml-models-a-short-guide/
- https://www.featurestore.org/benchmarks
- https://odsc.com/speakers/personalizing-llms-with-a-feature-store/
