Recommendation System-Content Recommendation

1. User portrait

1.1 What is a user portrait

First, the mission of the recommendation system is to establish a connection between users and items. The general way is to score the match between users and items, that is, predict user ratings or preferences. Before the recommendation system scores the match, it must first represent the user and the item as a vector, so that it can be calculated. The recommendation algorithm used is different, the vectorization method is also different, and the final method of using the matching score will be different.

The result of user vectorization is the user portrait. Therefore, vectorization is a must for user portraits. Without vectorization, the computer cannot calculate. User portraits are not the purpose of the recommendation system, but a by-product of a key link in the recommendation system process. In addition to common registration information, tags, and embedding vectors obtained from various deep learning, these are all good portrait content.

User portraits are vectors of users, so there are only two types: sparse vector and dense vector. Labels, registration information, and social relationships are all sparse vectors. Embedding vectors obtained by training neural networks, topic distributions obtained by matrix hidden factors, shallow semantic analysis or topic models are all dense vectors. It can be said that the sparse vector captures more obvious user portraits, and therefore is more interpretable, and can be used to give reasons for recommendation; dense vectors can capture more hidden interests, but its black box characteristics also lead to its feasibility. Explanatory is not as good as the former.

3.1.2 Key factors

  • Dimension
  • Quantify

In the actual production system, the quantification of each dimension of the user portrait should be handed over to the machine. Target-oriented, user portraits that are reverse-optimized based on the recommendation effect are meaningful, subjective quantification is a taboo. The quantification of user portraits is closely related to the effectiveness of the recommendation system. User portraits should not be used for user portraits. It is just a by-product of the recommendation system. The quantification of user portraits should be guided according to the use effect (ranking quality, recall coverage).

3.1.3 Construction method

  • Account checking
    directly uses the original data as the content of the user portrait, such as demographic information such as registration information, or purchase history, reading history, etc., except for data cleaning, without any abstraction. Simple to implement and usually useful for cold start.

  • Pile of data
    Piling up historical data for statistical work is the most common way to obtain user profile data. The most common method is interest tags. Label the items first, then get the user's historical behavior on these labels, do data statistics from the label dimension, use the statistical results as the quantitative results, and finally do the truncation.

  • Black box
    Use machine learning methods to learn dense vectors that humans cannot intuitively understand. For example, using a latent semantic model to build users' reading interest, and using matrix decomposition to obtain hidden factors or deep learning models to learn the user's Embedding vector. This type of user profile data cannot be explained.

2. Tag mining technology

Early recommendation systems generally started from content-based recommendation, which is inseparable from mining user interest tags, which are an important part of user portraits.

2.1 Mining tagged materials

Mining interest tags mainly deals with text data. Text data is the most common way of expressing information in Internet products.

  • Client text data
    • The text in the registration information, such as name, personal signature, etc.;
    • User-produced content, such as comments, dynamics, etc.;
    • The text that has a connection with the user, such as the content that has been read, etc.
  • Article text information:
    • Title and description of the item
    • The content of the item itself
    • Text in the basic attributes of the item

2.2 What the tag library should look like

The quality of the tag library:

  • Label coverage: The
    more items and users covered by all labels, the better, so as not to waste traffic;
  • Label health:
    quantify the extent to which the labels cover items on average. The number of items covered by a single label obviously conforms to Zipf's law. A good label library has a higher label coverage distribution entropy, and the higher the entropy, the more uniform coverage;
  • Label economy: The
    smaller the semantic similarity between labels, the better.
    A good tag library should be a cube in a super-dimensional space: a complete space is constructed, and tags are independent of each other.

There are two schools to build tag libraries:

  • Centralization Centralized
    construction of tag libraries, also called professional taxonomy
  • Decentralization
    relies on "collective wisdom". Rely on users to contribute their own tags.
Contrast dimension Centralization Decentralization
Label coverage small Big
Label coverage health Good (even) Not good (tilted)
Label economy Good (relatively independent) Not good (more synonymous and similar meanings)

Therefore, the two should be combined to build a high-quality tag library.

2.3 Tag mining method

2.3.1 Keyword extraction

  • TF-IDF
  • TextRank

2.3.2 Embedding vector

2.3.3 Text classification

2.3.4 Named entity recognition

2.3.5 Text clustering

2.3.6 Label selection

3. Content-based recommendations

To make content-based recommendations well, you need to grasp, wash, dig, and calculate

  • Catch: To make a content-based recommendation, it is indispensable to capture data to supplement content sources and increase the dimension of analysis.
  • Wash: Some redundant content, spam, politics, pornography and other sensitive content need to be washed out
  • Digging: It is necessary to dig deeper into the content to improve the effectiveness of the recommendation system.
  • Calculate: Match the user's interest and the attributes of the item to calculate a more reasonable correlation. This is the mission of the recommendation system itself, not just for content-based recommendations.

Content-based recommendation framework
The process and basic elements shown in the above figure are analyzed at the content source side to obtain a structured content library and content model.

3.1 Content source

Crawl data to supplement the daily content consumption of the recommendation system. Because only when the content diversity increases, a recommendation system has legitimacy.

3.2 Content analysis

For content-based recommendation, the most important thing is not recommendation algorithms, but content mining and analysis. As long as the content is dig deep enough, even if hard rules are adopted, it's ok. The deeper the content analysis, the more detailed the user group that can be captured, the higher the recommended conversion rate, and the higher the user's favorability of the product.
There are two output of content analysis:

  • Structured content library
  • Content analysis model

The most important use of structured content library is to learn user portraits in combination with user feedback behavior.
Get the model during content analysis, for example:

  • Classification model
  • Topic model
  • Entity recognition model
  • Embedding model
    The main application scenarios of these models are: when new items have just entered, they need to be recommended in real time. At this time, the content needs to be analyzed in real time, structured content is extracted, and then matched with user portraits.

3.3 Content recommendation algorithm

  • The simplest calculation of similarity The
    content of the user portrait is expressed as a sparse vector, and the content side also has a corresponding sparse vector. The cosine similarity is calculated between the two, and the recommended items are ranked according to the similarity.
  • Utilize structured information in content.
    Different fields are of different importance
  • Learning algorithm The
    typical scenario is to improve the conversion rate of certain behaviors, such as clicking, collecting, and forwarding. The standard approach is to collect log data of this type of behavior, convert it into training samples, and then train the prediction model. Each sample is composed of two parts: one part is a feature, including the user-side portrait content, the article-side structured content, and optionally some context information at the time of logging, such as time, geographic location, and equipment. The other part is user behavior. As annotated information, there are two types of "with feedback" and "without feedback". When training a binary classifier with such samples, the commonly used models are LR, GBDT, or a combination of both. When recommending matching, estimate the probability of user behaviors and sort them according to probability.

Reference: "Recommendation System" Chen Kaijiang

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/109768237