Skip to main content

The Feature Store Your Agent Reinvented Badly

· 10 min read
Tian Pan
Software Engineer

Watch a support agent handle one conversation, and count how many times it computes "churn risk." First when it triages the ticket. Again when it decides whether to offer a discount. A third time when it drafts the escalation summary. Each time, it re-reads the raw orders table, re-runs an inline aggregation, and produces a number. The three numbers don't match. Nobody notices, because they were never written down next to each other.

This is feature engineering. The agent is doing it on every turn, in prose, and doing it worse than a pipeline you would have laughed out of code review a decade ago.

The machine learning world already solved this. The solution is called a feature store, and the discipline it enforces — compute a feature once, name it, version it, serve it consistently — is exactly the discipline an agent throws away the moment you hand it a database tool. Your agent didn't avoid building a feature pipeline. It built one. It just built the worst one in the building.

The agent is a feature pipeline, and you didn't review it

A feature is a derived fact: account age, plan tier, lifetime value, days since last login, churn risk. None of these exist in your database as a column you can SELECT. They are the output of logic — a join, a window function, a threshold, a model score — applied to raw rows. The job of turning rows into features is the oldest unglamorous chore in applied ML, and the feature store exists because teams kept doing it inconsistently.

An agent that answers questions about customers does this same job. It has to. When a user asks "should this customer get a refund," the model cannot reason from orders and events tables directly; it needs the derived facts. So somewhere in the agent's loop — inside a tool call, inside the model's chain of thought, inside a SQL string the model wrote at runtime — the derivation happens.

The difference is that your data team's feature pipeline went through review. Someone argued about whether "active" means thirty days or seven. Someone wrote the join. Someone checked it into version control. The agent's feature pipeline went through none of that. It was authored at inference time, by a model, in a language with no type system, and it is re-authored every single turn.

That's the frame to hold onto: the agent is not querying features, it is manufacturing them, live, with no factory.

Four guarantees you silently gave up

A feature store is not a database. It is a set of guarantees about derived data. Agents that compute features inline break all four.

Caching. A feature store splits storage into an offline store (full history, for training and analysis) and an online store (a sub-millisecond cache, for live serving). The point of the online store is that you compute "churn risk" once and read it cheaply forever after. An agent computing churn risk inline has no such cache. It pays the full cost — tokens, latency, database load — every time the question comes up, and it comes up several times per conversation. You are running the most expensive possible feature pipeline, one priced in LLM tokens, and running it redundantly.

Point-in-time correctness. This is the guarantee feature stores work hardest for. If you ask "what was this customer's plan tier when they filed the ticket three weeks ago," the honest answer requires the value as of that timestamp, not today's value. Feature stores implement this with point-in-time joins — sometimes called time travel — specifically so training data doesn't leak future information into the past. An agent reading the current customers row has no concept of "as of." It will cheerfully tell you the customer was on the Enterprise plan when the ticket was filed, because they are on Enterprise now. They were on Free then. The agent's answer is not a small error; it is a category error about which point in time it is even talking about.

Shared definitions. In a feature store, "active user" is defined once, in one place, with an owner. Every model and every dashboard that reads user_active_30d gets the same logic. An agent with a database tool has no shared definition. Tool A's prompt says active means a login in the last thirty days. Tool B's prompt, written by a different engineer three sprints later, says active means any event in the last seven. The agent calls both, gets two answers, and resolves the contradiction in prose — usually by silently picking one, occasionally by averaging them, which is meaningless. The feature has no canonical value because it has no canonical definition.

Determinism. A feature pipeline is code. Run it twice on the same input, get the same output. An agent's inline derivation lives in the model's reasoning, which means it is sensitive to phrasing, temperature, conversation history, and whatever else is competing for attention in the context window. Ask "is this a high-value customer" two ways and you can get two thresholds. The derivation logic is not pinned down anywhere, so it is not stable anywhere.

"Just give it a database tool" makes it worse, not better

The standard reaction to all of this is: fine, give the agent a run_sql tool and let it query the warehouse directly. This feels like progress. It is the opposite.

A run_sql tool relocates the feature logic into a string the model writes at runtime. That string is your feature definition now. And it is the worst possible place to keep one. It is not in version control. It was not reviewed. It is not tested. It is regenerated, slightly differently, on every call — different column aliases, a different WHERE clause, a JOIN that quietly drops the customers with null regions this time but not last time. The boundary of the feature — what counts as churn, what window defines "recent" — is being redrawn per invocation by a process you cannot inspect.

You have not given the agent access to your features. You have asked it to reinvent them, in SQL, over and over, and you have made the reinvention invisible by burying it in a tool-call argument that nobody reads.

This is why agents quietly become the worst feature pipeline in the company. They combine the highest cost per derivation (you are paying a frontier model to write aggregation logic), the lowest reproducibility (the logic changes every run), and the widest blast radius (the same untested derivation feeds triage, pricing, and the message the customer actually receives). A bad feature pipeline that at least failed loudly would be better. This one fails quietly, with a confident paragraph.

Serve the agent features, the same way you serve a model

The fix is not exotic. It is to treat the agent like any other feature consumer.

A model in production does not write SQL to figure out a customer's lifetime value. It calls the online store and gets customer_ltv — a named feature, with a definition, an owner, a version, and a guaranteed-consistent value. The agent should do exactly the same thing.

Concretely, that means the agent's tool is not run_sql(query). It is get_customer_features(customer_id), and it returns a typed struct of named features: account_age_days, plan_tier, ltv, churn_risk_score, tickets_last_90d, days_since_last_login. Each of those names resolves to a single definition that lives in code, was reviewed, and is computed by one pipeline. The agent consumes features. It does not author them.

This buys back all four guarantees at once:

  • The online store caches the values, so repeated reads within a conversation are cheap and identical.
  • The store does the point-in-time join, so "as of the ticket date" is a parameter, not an accident.
  • The definition is shared, so every tool and every turn sees the same churn_risk_score.
  • The computation is deterministic code, so it can be unit-tested, and a change to it is a reviewable diff instead of a buried prompt edit.

There is a useful side effect for compliance, too. When you serve features through a store, you can snapshot the exact feature vector the agent saw at decision time and attach it to the audit log. "Why did the agent deny the refund" becomes answerable: here are the seven named features it was given, with values and timestamps. Compare that to reconstructing a decision from a SQL string the model improvised.

Where to draw the line

None of this means the agent should never touch raw data. The whole reason you built an agent was to handle open-ended questions, and a feature store only knows the features you defined in advance. An analyst asking a genuinely novel, one-off question still wants the agent to explore the warehouse directly.

The line is about recurrence and consequence. A fact deserves to be a named feature when either of two things is true: it gets asked for on most turns, or it feeds a decision with real consequences — eligibility, pricing, a refund, an escalation. Recurrence means caching pays off. Consequence means you cannot afford the derivation to be non-deterministic and untested.

A practical way to find these: take a week of agent traces and look for repeated aggregations. The same GROUP BY over the orders table, the same ninety-day window, the same churn heuristic, showing up in hundreds of tool calls with slightly different SQL each time — that is a feature begging to be promoted. Give it a name, give it an owner, give it one definition, and put it behind the typed tool. Leave the genuine long tail of novel questions to direct queries.

The migration is incremental. You do not need a feature store product on day one; you need a module of named, tested feature functions and one tool that calls them. The discipline matters more than the vendor.

The cheap data lesson the agent era keeps re-learning

Every wave of applied ML rediscovers the same thing the hard way: the model is rarely the problem, and the data plumbing always is. Feature stores exist because a decade of teams shipped models that disagreed with their own dashboards, leaked the future into their training sets, and burned engineering months maintaining four pipelines for one number.

Agents are walking straight back into that swamp, and the prose interface is hiding the splash. An inline aggregation inside a tool call does not look like an unreviewed feature pipeline. It looks like the agent being helpful. But a derived fact is a derived fact, whether it is computed by a Spark job or improvised by a language model mid-sentence — and only one of those two can be named, versioned, tested, and trusted.

Your agent already built a feature store. The only open question is whether you are going to keep letting it build a bad one on every turn, or hand it a good one and let it get back to the job you actually hired it for.

References:Let's stay in touch and Follow me for more thoughts and updates