Recommendation Algorithm[1]
Traditional Machine Learning Recommendation Algorithms
Popularity-Based Recommendation Algorithm
Simple and rude, similar to major news, Weibo hot list, etc., according to PV (visit volume), UV (unique visitor), daily average VV (visit times) or share rate and other data to recommend to users in a certain popularity order .
Pros: For new users who just signed up.
Disadvantages: Cannot provide personalized recommendations for users.
Improvement: Based on this algorithm, some optimizations can be made, such as adding the popularity of user groups to sort, recommending sports content on the hot list to sports fans first, and pushing hot posts of politicians to users who love to talk about politics.
Recommendation Algorithm Based on Collaborative Filtering
Inverted index
Inverted index. Find a set of documents containing certain words from a large number of documents, and it is done in O(1) or O(log n) time complexity. In the recommendation, the advertisement library or item collection is indexed, and some relevant advertisements or items are quickly retrieved in the advertisement library according to some targeting conditions.
- Gather the documents (items) that need to be indexed.
- Word segmentation for each document.
- Preprocess the document.
- Create an inverted index for all documents, the index contains a dictionary and a list of all inverted records.
Example:
TF-IDF (term frequency-inverse document frequency):
For each document, the bag of words (BOW) is used to describe the document in a vector form, which is convenient for similarity measurement. Build a thesaurus, each article corresponds to a vector whose length is the length of the thesaurus, and the corresponding position element of the vector is the number of times the corresponding word appears in the document. For inverse document frequency: if a word appears in every article, it does not contribute to the similarity between articles, and there is no way to distinguish the difference between articles. This word is for the vector Contribution should be lowered.
I D F ( m ) = l o g ( N D F ( m ) ) IDF(m)=log(\frac{N}{DF(m)}) IDF(m)=log(DF(m)N) , N represents the total number of articles, and m represents how many articles this word appears in.
For each word in the article, a value should be calculated, tf − idft , d = tft , d × idft tf-idf_{t,d}=tf_{t,d}\times idf_ttf−idft,d=tft,d×idftThe word frequency in the current document is multiplied by the reciprocal of how many articles it has appeared in.
- If a word only appears in a few documents, and appears more frequently in the current document, then it is more representative of the current document, and the proportion of the word is relatively large.
- If a word appears less frequently or appears in many documents, its weight is relatively small.
What are they used for? Describe an article. Calculate the similarity of articles.
User-based Collaborative Filtering Algorithm Description
How to find people with similar hobbies? Calculate data similarity!
-
Jaccard similarity coefficient
J ( A , B ) = A ∩ B A ∪ B J(A,B)=\frac{A\cap B}{A \cup B} J ( A ,B)=A∪BA∩B
-
Angle cosine (Cosine)
The cosine formula of the angle between vector A and vector B
cos ( θ ) = a ⋅ b ∣ a ∣ ∣ b ∣ cos(\theta)=\frac{a\cdot b}{|a||b|}cos ( θ )=∣a∣∣b∣a⋅b
-
Other methods: Euclidean distance, Manhattan distance
How to find the most similar people first? Calculate similarity!
This formula is to calculate the similarity between user i and user j. I(i, j) represents the items that user i and user j have evaluated together. R(i, x) represents the rating of user i on user x. R (i) The one with a bar above the head represents the average score of all the ratings of user i. The reason why the average score is subtracted is because some users have strict ratings while others are loose. User ratings are normalized to avoid mutual influence.
After finding the most similar people, based on these similar people, calculate the weighted score for an item that I have not liked yet, and the item with the highest score is recommended.
- Construct posting list
- Build a co-occurrence matrix
- Calculate the similarity between users
Item-based collaborative filtering algorithm description
Calculation formula:
|N(i)| indicates the number of users who like item i, |N(j)| indicates the number of users who like item j, |N(i)∩N(j)| indicates the number of users who like both item i and item j . From the above formula, we can see that item i and item j are similar because they are liked by many users. The higher the similarity, the more users like them at the same time.
Existing problems:
-
Sparse matrix problem
When users and products have thousands of ratings, the matrix will be very large, and many of them will be blank. Using a computer to process such a matrix will waste memory and time.
- Construct posting list
- Build a co-occurrence matrix
- Calculate the similarity between users
Latent Factor Model
Matrix Factor Model
For m users and n items, there will be a m × nm\times nm×A matrix of n , the matrix elements are the user's preference or rating for the item. Decompose this matrix into two matrices, one ofm × km\times km×matrix of k , anotherk × nk\times nk×matrix of n . Each user and each item is represented by a k-dimensional vector, and the inner product of the two vectors represents a user's preference for an item. At the same time, the similarity between users and the similarity between items can also be calculated.
Solution:
- Eigenvalue Decomposition
- singular value decomposition
- gradient descent
Based on the method of interest classification , the interests of items can be classified. For a certain user, first get his interest classification, and then select items he may like from the classification.
公式: p ( u , i ) = r u i = p u T q i = ∑ f = 1 F p u , k q i , k p(u, i)=r_{ui}=p_u^Tq_i=\sum_{f=1}^{F}p_{u,k}q_{i,k} p(u,i)=rui=puTqi=∑f=1Fpu,kqi,k
p u , k p_{u,k} pu,kis user u’s preference for category k, qi , k q_{i,k}qi,kis the possibility that item i belongs to category k. This is actually a process of matrix decomposition.
- How to classify items?
- How to determine which types of items the user is interested in, and the degree of interest?
- For a given class, which items belonging to this class are selected to be recommended to users, how to determine the weight of these items in a class?
graph-based model
Highly relevant features:
- There are many paths connecting two vertices.
- The length of the path connecting two vertices is relatively short.
- A path linking two vertices does not pass through items with larger out-degrees.
Content-Based Recommendation Algorithms
-
concept
Based on the relevant information of the subject matter, the relevant information of the user and the operation behavior of the user on the subject matter, a recommendation algorithm model is constructed to provide users with recommendation services. The subject matter-related information here may be metadata information, tags, user comments, and manually-labeled information describing the subject matter in text. User-related information refers to demographic information (such as age, gender, preferences, geography, income, etc.). The user's operation behavior on the subject matter can be comments, favorites, likes, viewing, browsing, clicking, adding to the shopping cart, purchasing, etc. Content-based recommendation algorithms generally only rely on the user's own behavior to provide recommendations for users, and do not involve the behavior of other users.
-
Realization principle
-
The difference between and item-based collaborative filtering algorithm
The collaborative filtering algorithm only recommends by understanding the relationship between the user and the item, and does not consider the attributes of the item itself, but the content-based algorithm will consider the attributes of the item itself, and it will use the characteristics of the item itself as a benchmark to find similar items. thing.
-
There are generally three steps:
-
Construct user feature representation based on user information and user operation behavior.
-
An object feature representation of the object is constructed based on the object information.
-
Based on the user and object feature representation, the object is recommended for the user.
-
Logistic regression
Offline training: Spark, sklearn, Tensorflow, etc. train various feature combinations. For example, if the features obtained by the intersection of two features of a user (uid) and item (item) are used for training, the user for uid will be obtained. The hyperplane of the item whose feature is item, saves the weight.
When used online, many feature combinations may be input, and then the corresponding weights are found in the saved weight file, and the corresponding weights are taken out and directly calculated to obtain the corresponding classification results.
LS-PLM (Large Scale Piece-wise Linear Model) large-scale piecewise linear model
Or called MLR (Mixed Logistic Regression) mixed logistic regression
It can be understood as an integrated algorithm.
Clustering: Classify samples, such as classifying according to users (male, female), and train an LR model for each category. After clustering, each category will have a possibility value. Use this possibility as a weight to integrate multiple LR models.
Recommendation Algorithm Based on Feature Intersection
In the ordinary linear model, each feature is considered independently, and the linear combination of features is weakly crossed after the activation function.
There is a correlation between a large number of features.
Non-linear enhancement.
The information is more differentiated.