Skip to main content

The Cold Start Problem in AI Features: Why Week One Always Fails

· 11 min read
Tian Pan
Software Engineer

You build a personalization feature, wire it into your app, and ship it. Week one arrives. The system dutifully serves every new user the same handful of globally popular items — your AI, supposedly intelligent, is no smarter than an alphabetically sorted list. Your engagement metrics barely move. Your team concludes the model needs more tuning. It doesn't. The model is working exactly as designed. The problem is you asked it to learn before it had anything to learn from.

This is the cold start problem, and it kills more AI features than bad models ever will.

The core dynamic is circular: a behavioral ML system needs user interactions to produce useful predictions, but it needs to produce useful predictions to earn user interactions. One large e-commerce platform documented that cold start affected more than 60% of their new users — and those users were receiving misfired recommendations that measurably hurt conversion rates. In aggregate metrics, this signal was nearly invisible because warm users masked the damage.

What Actually Breaks (and Why)

Cold start is not a single failure mode — it's a cluster of related problems that manifest differently depending on which part of your system is "cold."

Collaborative filtering is paralyzed from the start. CF works by finding users similar to the current user and surfacing what those users engaged with. With no interaction history, there is no neighborhood to compute. The user-item matrix is empty; factorization produces noise rather than signal.

Systems default to popularity ranking, which is almost never what users want. Globally popular items achieve maximum similarity to all users in the absence of personalization signal, so they flood every cold user's feed. This "popularity trap" is particularly insidious because popularity ranking is not obviously wrong — popular items are genuinely good choices on average — but on average is exactly the wrong optimization target for a personalization system. The user came for relevance, not consensus.

New items are invisible. A freshly listed product, article, or track has no interaction history, so collaborative systems have no basis to surface it. This creates a compounding problem: items that never receive early promotion never accumulate the interactions they'd need to earn recommendation. Long-tail items can remain stuck in this cold state indefinitely.

Early signals lock in feedback loops. The few interactions that do occur during cold start carry outsized weight because there is so little other signal to dilute them. If a new user's first three clicks are accidental or context-dependent, the system can build a preference profile around statistical noise. That profile then shapes future recommendations, which shapes future interactions, which reinforces the noisy profile. DeepMind research on recommendation systems found that algorithms biased toward exploitation during cold start produce measurably more "system degeneracy" — filter bubbles and narrow profiles — than those that lean toward exploration.

LLM routing faces a domain shift variant of the same problem. A router that decides which model handles which query needs labeled examples of how your specific users phrase your specific domain's queries. Routers trained on generic benchmarks generalize poorly when deployed in new domains. You have a cold-start routing system the moment you launch in a new product area with no query history.

The Three-Layer Bootstrapping Architecture

The most robust approach to cold start is not a single solution — it's a layered system where each layer operates on the data it actually has available.

Layer 1: Rules and Editorial Curation

Start without machine learning. Build the heuristic version first.

This is not defeatism — it's the correct sequencing. Rule-based systems work with no data. They provide immediate value to users. And critically, they generate the training data that your eventual ML model will need. A system that serves the most-purchased items in a user's inferred geography is not impressive, but it converts, it collects interaction signals, and it gives your ML team a baseline to beat.

Google's machine learning rules documentation is explicit on this point: "Don't be afraid to launch a product without machine learning." Hamel Hussain, formerly at GitHub, adds the practical reason beyond philosophy: solving the problem manually first forces engineers to understand their data at a level that model-only approaches never achieve.

Editorial curation adds a second rule-based layer. Manually select "starter pack" collections segmented by the coarse signals you do have: device type, referrer URL, geolocation, time of day. A user arriving from a cycling forum and a user arriving from a cooking newsletter should not see the same cold-start content, and you can make that distinction without any behavioral data.

Layer 2: Content-Based and Transfer Signal

While rule-based systems carry the load, seed your ML models with non-behavioral signal.

Content-based similarity doesn't need interaction data. Item metadata — descriptions, categories, images, audio features, tags — can be embedded into a shared vector space where semantic proximity substitutes for behavioral co-occurrence. A new product with no purchase history still has attributes; a new track with no play history still has audio features. Spotify analyzes every new track on ingest, building a 42-dimensional audio feature vector and running LLM-based semantic analysis on lyrics before a single user plays it. The track enters the recommendation system with a position in embedding space, not a blank slate.

Transfer from adjacent signals exploits the fact that users often have behavioral history in related domains. DoorDash's expansion into grocery delivery illustrates the pattern cleanly: users had no grocery history, but they had extensive restaurant order history. DoorDash built an LLM-based system that inferred grocery preferences from restaurant cuisine choices — someone who consistently orders Thai food and Indian food likely has preferences around spice tolerance and fresh herbs. The transfer worked: they measured double-digit CTR improvements over the no-transfer baseline.

The generalizable insight is to audit what behavioral signals you have in adjacent systems before concluding you have no data. Search queries, abandoned carts, session depth on categories, referral patterns — these are all weak signals individually, but combined they can substantially shorten the cold-start window.

Synthetic seed data is an emerging third option. LLMs can generate plausible training examples — relevance judgments, simulated user profiles, query-answer pairs — to pre-warm models before real user data exists. Algolia found that LLM-generated relevance judgments match human expert annotations with roughly 97% accuracy, making synthetic signals a viable bootstrap mechanism. For routing models specifically, recent research shows that generating synthetic Q&A pairs from a structured task taxonomy can produce routers that outperform systems trained on cross-domain real data.

One critical caveat: use synthetic data as a one-time seed, not as a continuous generation loop. Training a model on AI-generated data, then using that model to generate data for the next version, causes progressive performance degradation — Oxford researchers demonstrated this "model collapse" phenomenon in Nature in 2024. The diversity that makes training data useful erodes with each generative cycle.

Layer 3: Exploration Infrastructure

Once your system is generating recommendations, it needs to collect signal efficiently. This is where exploration strategy matters.

Pure exploitation of current knowledge accelerates feedback loops. Pure exploration wastes user experience. The standard framework for managing this tradeoff is multi-armed bandits — algorithms that systematically balance serving known-good recommendations against probing uncertain ones.

Thompson Sampling and UCB (Upper Confidence Bound) are the dominant approaches. UCB intrinsically favors under-explored items: it selects items with the highest upper confidence bound, which means uncertain items get boosted until their uncertainty resolves. This naturally investigates new content and new users without requiring explicit exploration logic. Thompson Sampling achieves a similar effect by sampling from posterior distributions — adding controlled randomness that decays as evidence accumulates.

TikTok's content distribution architecture is a well-documented example of bandit-style exploration at scale. Every new video enters a small test pool of 500 to 1,000 initial views. Engagement rate in those initial hours determines whether the video advances to 10,000 views, then 100,000, then viral distribution. Each new creator starts in the same cold state; the algorithm itself resolves the cold start through tiered exploration rather than requiring upfront training data.

The user-side equivalent: onboarding is your highest-leverage data collection opportunity. Most teams optimize onboarding for minimal friction, which is correct for conversion — but minimal onboarding maximizes cold-start duration. The optimal design asks for the preference signals that provide the most disambiguation per question, presented as value exchange ("tell us what you like and we'll personalize immediately") rather than data extraction. Active learning research shows that presenting users with items selected to differentiate between taste clusters yields far more information per interaction than random or popularity-based probing.

How Long Does Cold Start Last?

There is no universal threshold, but the practical answer for most systems is: longer than teams plan for, and shorter than teams fear.

User cold start typically resolves somewhere in the range of 5 to 20 meaningful interactions. "Meaningful" is key — passive exposures count for less than explicit engagement, and engagement on items served via recommendation counts for less than unprompted search behavior, because recommendation-driven engagement is influenced by what the system chose to show.

Item cold start depends heavily on platform volume. For Spotify or TikTok, a new piece of content in a popular category can graduate from cold in days. For a B2B SaaS recommendation feature with lower traffic, the same graduation might take months.

System cold start — the case where both users and items are new — is the most severe form and routinely takes three to six months before collaborative filtering becomes the primary recommendation mechanism. During this window, rules and content-based systems should be carrying the load, not treated as temporary embarrassments.

The signal that you've graduated: recommendation diversity increases (you're no longer serving the same popularity hits to everyone), per-user CTR diverges meaningfully across users (personalization is actually differentiating), and A/B tests show the ML model outperforming the popularity fallback for users with sufficient interaction history.

The Mistakes That Extend Cold Start Unnecessarily

Shipping ML before you have data. Teams build collaborative filtering models and launch into a data vacuum. The model outputs noise, the team concludes it needs more tuning, and they delay launching the rule-based fallback that would have generated the training data they need. The correct sequencing is rules first, then ML.

Single-signal reliance. Explicit ratings have response rates of 1 to 10%. Clicks are noisy and position-biased. Either one alone is insufficient during cold start. Combining multiple weak signals — dwell time, scroll depth, cart behavior, abandonment patterns, search queries — produces substantially richer training data than any single source.

Batch pipeline ML during cold start. A system that retrains daily is especially problematic when it needs to learn quickly from sparse early interactions. Cold start is exactly when real-time or near-real-time feedback loops matter most. If your infrastructure forces a 24-hour feedback cycle, early interactions take a full day to influence the next recommendation — long enough for the user to have churned.

Not tracking cold-start users separately. Averaging performance metrics across warm and cold users makes the problem invisible. A product with excellent retention among users who've made five or more interactions can simultaneously be hemorrhaging new users who never reach that threshold. Track cold-user cohorts explicitly, with their own dashboards and their own success criteria.

Treating it as a purely technical problem. Cold start sits at the intersection of ML engineering, product design, and data collection strategy. The most cost-effective interventions are often product-side: a better onboarding flow, a preference survey with meaningful completion incentives, an explicit "not interested" signal that accelerates profile building. Delegating cold start entirely to ML engineers misses the surface area where product teams can have the most impact.

The Layered Cold-Start Playbook

The mental model that ties this together: design your AI feature as three concurrent systems running in parallel, not one system that gets better over time.

The rules layer serves everyone immediately and generates training data. The transfer and content layer uses non-behavioral signal to provide better-than-random personalization from day one. The behavioral ML layer starts thin, learns from every interaction, and gradually takes over from the other two as individual users and items accumulate sufficient history.

Each user and each item moves through this stack on their own schedule. A user with 20 meaningful interactions is already in behavioral ML territory. A user who signed up yesterday is still in the rules layer. A new item ingested this morning is content-based only. Your system needs to handle all of these states simultaneously, not sequentially.

Cold start is not a bug to be fixed before launch. It's a designed state with its own toolbox — and teams that treat it as a first-class engineering concern, rather than a temporary limitation waiting for more data, are the ones that retain the users they spend so much to acquire.

References:Let's stay in touch and Follow me for more thoughts and updates