Toutiao Recommendation System: P2 Content Analysis

In Toutiao Recommendation System: P1 Overview, we know that content analysis and data mining of user tags are the cornerstones of the recommendation system.

What is the content analysis?

content analysis = derive intermediate data from raw articles and user behaviors.

Take articles for example. To model user interests, we need to tag contents and articles. To associate a user with the interests of the “Internet” tag, we need to know whether a user reads an article with the “Internet” tag.

Why are we analyzing those raw data?

We do it for the reason of …

Tagging users (user profile)
- Tagging users who liked articles with “Internet” tag. Tagging users who liked articles with “xiaomi” tag.
Recommending contents to users by tags
- Pushing “meizu” contents to users with “meizu” tag. Pushing “dota” contents to users with “dota” tag.
Preparing contents by topics
- Put “Bundesliga” articles to “Bundesliga topic”. Put “diet” articles to “diet topic”.

Case Study: Analysis Result of an Article

Here is an example of “article features” page. There are article features like categorizations, keywords, topics, entities.

Analysis Result of an Article

Analysis Result of an Article: Details

What are the article features?

Semantic Tags: Human predefine those tags with explicit meanings.
Implicit Semantics, including topics and keywords. Topic features are describing the statistics of words. Certain rules generate keywords.
Similarity. Duplicate recommendation once to be the most severe feedbacks we get from our customers.
Time and location.
Quality. Abusing, porn, ads, or “chicken soup for the soul”?

Article features are important

It is not true that a recommendation system cannot work at all without article features. Amazon, Walmart, Netflix can recommend by collaborative filtering.
However, in news product, users consume contents of the same day. Bootstrapping without article features is hard. Collaborative filtering cannot help with bootstrapping.
- The finer of the granularity of the article feature, the better the ability to bootstrap.

Document classification

Classification hierarchy

Root
Science, sports, finance, entertainment
Football, tennis, table tennis, track and field, swimming
International, domestic
Team A, team B

Classifiers:

SVM
SVM + CNN
SVM + CNN + RNN

Calculating relevance

Lexical analysis for articles
Filtering keywords
Disambiguation
Calculating relevance

References:

Want to keep learning more?

Twitter LinkedIn Telegram Discord 小红书

Toutiao Recommendation System: P2 Content Analysis

What is the content analysis?

Why are we analyzing those raw data?

Case Study: Analysis Result of an Article

Article features are important

More on Semantic Tags

Document classification

Calculating relevance

About Tian Pan

Stay up to date

What is the content analysis?​

Why are we analyzing those raw data?​

Case Study: Analysis Result of an Article​

Article features are important​

More on Semantic Tags​

Document classification​

Calculating relevance​

About Tian Pan

Stay up to date

What is the content analysis?

Why are we analyzing those raw data?

Case Study: Analysis Result of an Article

Article features are important

More on Semantic Tags

Document classification

Calculating relevance