Recommendation algorithm [1] Traditional machine learning recommendation algorithm

Recommendation Algorithm[1]

Traditional Machine Learning Recommendation Algorithms

Popularity-Based Recommendation Algorithm

Simple and rude, similar to major news, Weibo hot list, etc., according to PV (visit volume), UV (unique visitor), daily average VV (visit times) or share rate and other data to recommend to users in a certain popularity order .

Pros: For new users who just signed up.

Disadvantages: Cannot provide personalized recommendations for users.

Improvement: Based on this algorithm, some optimizations can be made, such as adding the popularity of user groups to sort, recommending sports content on the hot list to sports fans first, and pushing hot posts of politicians to users who love to talk about politics.

Recommendation Algorithm Based on Collaborative Filtering

Inverted index

Inverted index. Find a set of documents containing certain words from a large number of documents, and it is done in O(1) or O(log n) time complexity. In the recommendation, the advertisement library or item collection is indexed, and some relevant advertisements or items are quickly retrieved in the advertisement library according to some targeting conditions.

  1. Gather the documents (items) that need to be indexed.
  2. Word segmentation for each document.
  3. Preprocess the document.
  4. Create an inverted index for all documents, the index contains a dictionary and a list of all inverted records.

insert image description here

Example:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-8exdaWrn-1673923050957) (C:\Users\11878\AppData\Roaming\Typora\typora-user-images\ image-20221224202028114.png)]

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-GMaT4iJh-1673923050958) (C:\Users\11878\AppData\Roaming\Typora\typora-user-images\ image-20221225121342039.png)]

TF-IDF (term frequency-inverse document frequency):

For each document, the bag of words (BOW) is used to describe the document in a vector form, which is convenient for similarity measurement. Build a thesaurus, each article corresponds to a vector whose length is the length of the thesaurus, and the corresponding position element of the vector is the number of times the corresponding word appears in the document. For inverse document frequency: if a word appears in every article, it does not contribute to the similarity between articles, and there is no way to distinguish the difference between articles. This word is for the vector Contribution should be lowered.

I D F ( m ) = l o g ( N D F ( m ) ) IDF(m)=log(\frac{N}{DF(m)}) IDF(m)=log(DF(m)N) , N represents the total number of articles, and m represents how many articles this word appears in.

For each word in the article, a value should be calculated, tf − idft , d = tft , d × idft tf-idf_{t,d}=tf_{t,d}\times idf_ttfidft,d=tft,d×idftThe word frequency in the current document is multiplied by the reciprocal of how many articles it has appeared in.

  • If a word only appears in a few documents, and appears more frequently in the current document, then it is more representative of the current document, and the proportion of the word is relatively large.
  • If a word appears less frequently or appears in many documents, its weight is relatively small.

What are they used for? Describe an article. Calculate the similarity of articles.

User-based Collaborative Filtering Algorithm Description

How to find people with similar hobbies? Calculate data similarity!

  1. Jaccard similarity coefficient

    J ( A , B ) = A ∩ B A ∪ B J(A,B)=\frac{A\cap B}{A \cup B} J ( A ,B)=ABAB

  2. Angle cosine (Cosine)

    The cosine formula of the angle between vector A and vector B

    cos ( θ ) = a ⋅ b ∣ a ∣ ∣ b ∣ cos(\theta)=\frac{a\cdot b}{|a||b|}cos ( θ )=a∣∣bab

  3. Other methods: Euclidean distance, Manhattan distance

insert image description here

How to find the most similar people first? Calculate similarity!

insert image description here

This formula is to calculate the similarity between user i and user j. I(i, j) represents the items that user i and user j have evaluated together. R(i, x) represents the rating of user i on user x. R (i) The one with a bar above the head represents the average score of all the ratings of user i. The reason why the average score is subtracted is because some users have strict ratings while others are loose. User ratings are normalized to avoid mutual influence.

After finding the most similar people, based on these similar people, calculate the weighted score for an item that I have not liked yet, and the item with the highest score is recommended.

  1. Construct posting list
  2. Build a co-occurrence matrix
  3. Calculate the similarity between users
Item-based collaborative filtering algorithm description

insert image description here

Calculation formula:

img

|N(i)| indicates the number of users who like item i, |N(j)| indicates the number of users who like item j, |N(i)∩N(j)| indicates the number of users who like both item i and item j . From the above formula, we can see that item i and item j are similar because they are liked by many users. The higher the similarity, the more users like them at the same time.

Existing problems:

  • Sparse matrix problem

    When users and products have thousands of ratings, the matrix will be very large, and many of them will be blank. Using a computer to process such a matrix will waste memory and time.

  1. Construct posting list
  2. Build a co-occurrence matrix
  3. Calculate the similarity between users
Latent Factor Model

Matrix Factor Model

For m users and n items, there will be a m × nm\times nm×A matrix of n , the matrix elements are the user's preference or rating for the item. Decompose this matrix into two matrices, one ofm × km\times km×matrix of k , anotherk × nk\times nk×matrix of n . Each user and each item is represented by a k-dimensional vector, and the inner product of the two vectors represents a user's preference for an item. At the same time, the similarity between users and the similarity between items can also be calculated.

insert image description here

Solution:

  1. Eigenvalue Decomposition
  2. singular value decomposition
  3. gradient descent

Based on the method of interest classification , the interests of items can be classified. For a certain user, first get his interest classification, and then select items he may like from the classification.

公式: p ( u , i ) = r u i = p u T q i = ∑ f = 1 F p u , k q i , k p(u, i)=r_{ui}=p_u^Tq_i=\sum_{f=1}^{F}p_{u,k}q_{i,k} p(u,i)=rui=puTqi=f=1Fpu,kqi,k

p u , k p_{u,k} pu,kis user u’s preference for category k, qi , k q_{i,k}qi,kis the possibility that item i belongs to category k. This is actually a process of matrix decomposition.

  1. How to classify items?
  2. How to determine which types of items the user is interested in, and the degree of interest?
  3. For a given class, which items belonging to this class are selected to be recommended to users, how to determine the weight of these items in a class?

graph-based model

insert image description here

Highly relevant features:

  • There are many paths connecting two vertices.
  • The length of the path connecting two vertices is relatively short.
  • A path linking two vertices does not pass through items with larger out-degrees.

Content-Based Recommendation Algorithms

  1. concept

    Based on the relevant information of the subject matter, the relevant information of the user and the operation behavior of the user on the subject matter, a recommendation algorithm model is constructed to provide users with recommendation services. The subject matter-related information here may be metadata information, tags, user comments, and manually-labeled information describing the subject matter in text. User-related information refers to demographic information (such as age, gender, preferences, geography, income, etc.). The user's operation behavior on the subject matter can be comments, favorites, likes, viewing, browsing, clicking, adding to the shopping cart, purchasing, etc. Content-based recommendation algorithms generally only rely on the user's own behavior to provide recommendations for users, and do not involve the behavior of other users.

  2. Realization principle

    insert image description here

  3. The difference between and item-based collaborative filtering algorithm

    The collaborative filtering algorithm only recommends by understanding the relationship between the user and the item, and does not consider the attributes of the item itself, but the content-based algorithm will consider the attributes of the item itself, and it will use the characteristics of the item itself as a benchmark to find similar items. thing.

  4. There are generally three steps:

    1. Construct user feature representation based on user information and user operation behavior.

    2. An object feature representation of the object is constructed based on the object information.

    3. Based on the user and object feature representation, the object is recommended for the user.

      insert image description here

Logistic regression

Offline training: Spark, sklearn, Tensorflow, etc. train various feature combinations. For example, if the features obtained by the intersection of two features of a user (uid) and item (item) are used for training, the user for uid will be obtained. The hyperplane of the item whose feature is item, saves the weight.

When used online, many feature combinations may be input, and then the corresponding weights are found in the saved weight file, and the corresponding weights are taken out and directly calculated to obtain the corresponding classification results.

LS-PLM (Large Scale Piece-wise Linear Model) large-scale piecewise linear model

Or called MLR (Mixed Logistic Regression) mixed logistic regression

insert image description here
It can be understood as an integrated algorithm.

Clustering: Classify samples, such as classifying according to users (male, female), and train an LR model for each category. After clustering, each category will have a possibility value. Use this possibility as a weight to integrate multiple LR models.

Recommendation Algorithm Based on Feature Intersection

In the ordinary linear model, each feature is considered independently, and the linear combination of features is weakly crossed after the activation function.

There is a correlation between a large number of features.

Non-linear enhancement.

The information is more differentiated.

Guess you like

Origin blog.csdn.net/no1xiaoqianqian/article/details/128713066