Skip to main content

The Prompt Log Is the Product Roadmap You Threw Away

· 9 min read
Tian Pan
Software Engineer

Somewhere in your observability stack is a table that holds every prompt a user typed into your AI feature last quarter. If your team is like most, that table is used for three things: cost attribution, abuse detection, and the occasional debugging session when a customer reports a bad answer. Nobody on the product team has ever opened it. Nobody on the research team has clustered it. The PM running the AI roadmap has never read a single row.

This is the most expensive oversight in your product organization. The prompts your users typed — especially the ones your feature handled badly — are the highest-resolution form of "what users wish this product did" you will ever collect. You are paying inference costs to generate this signal in real time, and you are throwing it away because nobody decided whose job it was to read it.

What the prompt log actually contains

User research surveys ask people what they want. Support tickets capture what broke. Analytics dashboards report what users clicked. Each of these signals is filtered through a layer of abstraction — the user's memory, their willingness to file a ticket, the click options the product offered them in the first place.

Prompts skip all of that. A prompt is a user, in their own words, in the moment of need, telling your product exactly what they want it to do. They did not choose from a dropdown. They were not prompted by a survey question. They typed a sentence describing a job they wanted done, and your feature either did it, did it badly, or refused.

The prompts where your feature did it badly are the gold. They are recorded evidence that someone reached for your product to solve a specific problem and your product was the wrong shape for that problem. Aggregated across thousands of users over weeks, those prompts cluster into intents. Each cluster above a recurrence threshold is a feature request that has already been validated by real demand. You did not need to run a discovery sprint; the demand walked in and typed itself into your log.

Why most teams discard the signal

If this signal is so obviously valuable, the natural question is why almost no team mines it systematically. Three reasons recur.

The first is ownership. Prompt logs sit in a data warehouse owned by the platform team, are surfaced in an observability tool owned by the ML team, and are governed by privacy policies owned by legal. The product manager who would benefit from reading them has neither query access nor a job description that includes the activity. There is no quarterly OKR for "spend two days a month reading what users typed." So nobody does.

The second is shape. A raw prompt log is messy. The same intent shows up as a hundred different phrasings. Half the entries are people testing the system or asking trivial questions. Aggregating prompts into actionable clusters requires embedding them, clustering them, summarizing each cluster, and filtering by frequency and recency. None of this is hard, but it is a small data project that needs to be staffed, and the team that would staff it does not see it as their work.

The third is interpretation. When teams do look at prompt failures, they file them under "model quality" — a thing for the ML or eval team to fix by adjusting the prompt or fine-tuning the model. But many of these failures are not model failures. They are scope failures. The user asked the AI to do something the AI was not built to do, and no amount of prompt engineering or model upgrade will make the feature do it. The right response is not a retraining run; it is a product decision about whether to expand scope. That decision belongs to the PM, who is not looking at the log.

Distinguishing scope failures from quality failures

If you are going to mine prompts for product signal, the first analytical move is to separate failures into two buckets. Quality failures are prompts where the user asked for something the product was supposed to do, and the AI did it wrong — wrong answer, wrong format, missing a step, hallucinated a fact. These belong to the eval and prompt team.

Scope failures are prompts where the user asked for something the product was not supposed to do. The AI may have refused, given a generic apology, redirected to documentation, or done the requested thing badly because it lacked the tools or data. These belong to the PM. They are feature requests in disguise.

The tell for scope failures is usually a request shape that does not match any of your product's documented capabilities. A coding assistant built for refactoring suddenly getting asked to scaffold new projects. A summarization tool getting asked to translate. A support bot getting asked for invoice copies. Each of these clusters is a customer telling you, in aggregate, that they assumed your product covered an adjacent job. Sometimes the right call is to clarify scope. More often it is to expand it.

A practical clustering pipeline produces a weekly report with two columns: top quality-failure clusters and top scope-failure clusters. The first goes to the team improving the model. The second goes to the PM, who reads it like a feature backlog written by customers.

The mining methodology

The actual technical work to build this pipeline is well-trodden ground. Embed each user turn with a general-purpose embedding model. Run a clustering algorithm — HDBSCAN works well because it handles variable density and ignores noise. For each cluster, sample representative prompts and ask an LLM to write a one-sentence intent summary. Filter clusters by size, by recurrence over time, and by the outcome signal of the original turn — did the user thumb-down the response, abandon the session, or rephrase the question.

The clusters with high frequency, recurrence over multiple weeks, and negative outcome signals are your candidate feature requests. The shortlist for a single quarter rarely exceeds twenty clusters. Each gets a name, a representative quote, a count, and an estimate of whether it is solvable inside the current product scope.

Run this pipeline weekly. Compare the weekly clusters to last month's clusters. New clusters are usually triggered by external events — a competitor shipping something, a viral use case, a season. Persistent clusters are unmet demand. Persistent clusters that grew between months are unmet demand under acceleration. Each tells you something a roadmap planning meeting would otherwise have to guess at.

What changes when the prompt log becomes the roadmap input

A team that treats the prompt log as a discovery channel develops a few habits that look unusual from the outside.

Feature proposals start citing prompt counts. A new feature pitch reads less like "we should build X because it is strategically aligned" and more like "1,847 users tried to do X with our current feature in the last 30 days, here are the ten most representative phrasings, and the cluster has grown 23% month-over-month." The conversation in the planning meeting shifts from opinion to evidence.

Scope decisions get made faster. The recurring question of "should the AI feature also do Y" stops being a philosophical debate about product identity and becomes a question about whether Y is a top-10 cluster. If it is, the demand is real; if it is not, the question can wait.

Eval sets get richer. Prompts from the scope-failure pile become the spec for new capabilities. Once a feature ships, prompts from the original cluster become the regression test that the feature actually solved what users were asking for. The eval set stops being something the ML team made up and starts being something the customer wrote.

The relationship between the product team and the eval team gets less adversarial. The eval team is no longer the gatekeeper saying "the model is failing X% of the time, fix it." Both sides now share a common artifact — the clustered prompt log — and disagreements move from "is the model good enough" to "is this cluster a quality issue or a scope issue."

The privacy guardrails are tractable

The most common objection to systematic prompt mining is privacy. User prompts contain personal information. Mining them across a product population sounds like a compliance incident waiting to happen. This concern is real, but it is also tractable, and it is not why most teams fail to do this — it is the excuse the teams use to justify not having tried.

The mitigations are well-understood. PII detection and redaction run before the prompt enters the analysis pipeline. Aggregation only happens above a minimum cluster size so individual users cannot be reidentified. Access to the clustering tool is logged and audited. Customers in regulated tiers can be excluded from the pipeline by contract. Embedding-based clustering can operate on hashed or redacted prompts without losing most of the signal.

None of this is novel. Every team running analytics on user content already does some version of it. The work to extend the same controls to prompt logs is a quarter of engineering effort, not a multi-year compliance program. If your team is using "privacy" as the reason not to mine prompts but is happily training on the same data, the privacy argument is not what is actually blocking the work.

The pipeline you can build this quarter

A reasonable starting version of this system is small. A scheduled job pulls last week's prompts, embeds them, clusters them, summarizes each cluster, joins outcome signals like thumbs-down and session abandonment, and writes a Notion or wiki page with the top clusters ranked by size and outcome severity. The PM and the eval lead get a weekly notification with the link.

That is the entire intervention. No new tools. No headcount. No platform team dependency. The pipeline runs unattended; the only human input is the meeting where the PM walks through the top clusters and decides what becomes a roadmap item, what becomes a quality ticket, and what gets ignored.

The teams that build this version of the pipeline almost universally report that the first month surfaces at least one feature cluster nobody on the team had considered, and at least one quality issue nobody on the eval team had flagged. The second month surfaces persistence patterns. By the third month, the PM stops treating the planning meeting as a place to defend opinions and starts treating it as a place to allocate effort against evidence.

The prompt log was never noise. It was always your roadmap. The work is just to read it.

References:Let's stay in touch and Follow me for more thoughts and updates