[Recommendation algorithm] Learning recommendation from architecture to principle (on)

Preface

        Recommendation is an interesting and high degree of freedom gameplay in the algorithm field. The essence of solving the problem is "what kind of item is recommended for what user", so modeling is also to follow this idea, of course, the user and item here Concepts can be interchanged or sublimated. The recommendation system architecture used by different companies in the industry is not the same. This article introduces a set of relatively basic three-tier architecture in the industry: recall + sort + filter . This article will introduce all concepts in general, and will involve some mathematical knowledge and algorithm principles. The next chapter will explain the actual code combat. Welcome everyone to communicate and point out the shortcomings, thanks!

1 Recommended definition

Recommendations in a narrow sense: Recommend items to users, and discover items that users may like to show them to users.
Recommendation in a broad sense: Sublimation of the recommended subject and object. Subject and object can be selected arbitrarily, such as recommending users to projects, recommending users to users, etc. It is worth noting that users and items can also be transformed according to different entities in the scene, such as recommending items to user sets (such as groups containing multiple users), and recommending item sets to users (such as itineraries containing multiple locations). )Wait

2 Recommended architecture

        As mentioned earlier, we use a three-tier structure: recall + sort + filter . The flow chart is as follows. Insert picture description here
        Now we assume that we want to recommend 10 products out of 9999 products to Xiao Ming. This scenario helps to understand the role of each part of the above:
        1 ) Item complete set: all recommended objects, that is, 9999 products;
        2) Recall algorithm: We know that 9999 products are relatively large numbers, and the recall algorithm is equivalent to the first layer of "funnel", which can initially screen out some more consistent Xiao Ming’s products are 200 items selected from 9999;
        3) Sorting model (sorting algorithm): The product size is reduced to 200 from the recall algorithm, but it is still relatively large, so the sorting algorithm is equivalent to the second layer of the "funnel" "Help us calculate the probability of Xiao Ming buying these 200 products. In the framework of this article, the model constructed by the feature engineering and timing tagging method is to solve "what kind of person buys what kind of product is the probability "And sort from high to low;
        4) Business filtering: In theory, the sorting algorithm gives us 10 results from high to low and recommends them to Xiao Ming to be OK, but maybe these 10 highest commodities may not be in line with the actual business scenario If Xiao Ming has purchased 5 of these 10 products and will not buy them again in the short term, then we should not recommend duplicate products to him in the short term. Business filtering is the third layer set up to meet these business requirements.” "Funnel";
        5) Recommendation result: After the above process, the final recommendation result can be obtained. We can calculate the evaluation index, measure the recommendation effect, and provide the basis for algorithm optimization.
        The above is a general summary of the recommended system architecture. After you have a preliminary understanding of this process, I will introduce the "funnel" of each layer in detail, including algorithm principles, changes in the form of data, and flow.

3 step-by-step understanding

        Explain the recall algorithm, sorting algorithm, and business filtering in turn

3.1 Recall algorithm

        In fact, the often-mentioned recommendation algorithm usually refers to the recall algorithm (recall here is a completely different concept from the recall in supervised learning). Still returning to the scene recommended by Xiao Ming just now, the recall algorithm solves the problem of finding 9999 products. There are 200 products that Xiao Ming may like, but the 200 products derived from different recall algorithms are also different. Usually, only one algorithm can be used at this stage, or multiple algorithms can be used to achieve multiple recalls, that is, multiple recall algorithms The results are combined to form these 200 products. This article introduces several classic algorithms to achieve recall

3.1.1 User-based collaborative filtering (UserCF)

        I believe that the term collaborative filtering (collaborative filtering) has also been heard by some friends. In layman's terms, it is defined as: Xiao Ming likes a batch of products, then I will see who likes these products like Xiao Ming, and think these people They have the same preferences as Xiaoming, and then recommend products that these people have liked (that Xiaoming has not liked or contacted) to Xiaoming. The program is such a program, how to measure the specific "like"?
We need such data:
Insert picture description here
        This is a scoring matrix , which is usually obtained by processing buried point data or other raw data. Here we take a 0-5 scoring system as an example. 0 means this person has not clicked on this product, 1-5 It can be divided according to the behavior path from click to successful purchase. 5 points means successful purchase (of course, you can also use the 0-10 points system). The 0/1 system is also commonly used, that is, as long as the purchase is 1, the rest are 0. This matrix can reflect a person’s preferences. For example, Xiao Ming’s preferences are [5,3,0,5,5] and Xiaojun [5,4,0,4,5]. Obviously these are two vectors, two People’s similarity can be measured by the angle between vectors. Such similarity is called cosine similarity . The formula is:
Insert picture description here
        You can also use other methods to calculate similarity, such as: Pearson correlation coefficient, Min's distance, Jie Card similarity coefficient, Manhattan distance, Hamming distance, etc. In short, the greater the similarity, the closer the two hobbies.
        Through the above scoring matrix and similarity calculation formula, is it possible to calculate the similarity between any two people? Then draw a table:
Insert picture description here
        this is the similarity matrix, I will be lazy and not calculate one by one. The ellipsis is the specific similarity. The similarity between each person and himself must be the greatest, and we have to set it to 0. Then according to the similarity matrix, you can find the n people closest to Xiaoming. Assuming that Xiaohong and Xiaojun have the closest hobbies to Xiaoming, and there are enough goods, then you can choose 4 points from Xiaohong and Xiaojun (you decide ) The products above and Xiao Ming corresponding to 0 points are used as Xiao Ming’s recommended list. This is what UserCF did. Returning to our previous scenario, 200 products out of 9999 products can be recalled for Xiao Ming, of course some special ones. For example, if Xiao Ming has purchased 9900 products, and only 99 have 0 points, he can design a solution by himself. You can make up to 200 products, or you can recall only 99.

3.1.2 Item-based collaborative filtering (ItemCF)

        If you can understand UserCF above, ItemCF is well understood. In the scoring matrix just now, we treat each row as a vector. Now, for the same scoring matrix, we treat each column as a vector: for
Insert picture description here
        example, item1 is [5,5,5,0,0], the vector of each item measures its status in the crowd, and it's time to calculate the similarity link, which is popular to see, to calculate the similarity matrix between any two items:
Insert picture description here
        this The physical meaning of the similarity matrix is ​​to use the performance of each item in the crowd to measure the proximity of any item. In the original scoring matrix, Xiao Ming’s vector is [5,3,0,5,5], that is, Xiao Ming’s pair of item1 , item3, and item5 are all very favorites. For these three items, is it possible to find the nearest n items in the above item similarity matrix? The answer is yes, if there are enough items, then n can be equal to 200, which will serve as Xiao Ming’s recommendation list. This is what ItemCF does.
        The above are the two principles of collaborative filtering. Collaborative filtering can be said to be the most classic recommendation algorithm. You may have discovered that UserCF tends to be socialized and find groups close to you, while ItemCF tends to be personalized and help you find The kind of items you would like. If you are a product manager, you can skip the next few algorithms here, and quickly browse the ranking algorithms and business filtering behind to understand the operation of the entire system; if you are a data mining engineer, you may wish to concentrate on reading the next Of several recall algorithms.

3.1.3 Traditional Matrix Factorization (MF) & SVD & SVD++

        The idea of ​​matrix factorization is completely different from that of collaborative filtering. Let’s take a look at the score matrix:
Insert picture description here
        As mentioned earlier, the 0 in the matrix means that this person has not been exposed to this product, but actually Xiao Mingxi doesn’t like this product. We don’t I know, in other words, Xiao Ming will have a specific score for item3, but we don't know it at the moment. Can we predict this score? Yes, in order to facilitate understanding, first replace the zero grids in the matrix with question marks:
Insert picture description here
        matrix decomposition can calculate the approximate value of all the above question marks, which is a purely mathematical problem. We know that multiplying two matrices can get a new matrix. Conversely, a matrix can also be decomposed into two matrices and multiply: after
Insert picture description here
        a rating matrix is ​​decomposed into two matrices, they represent the user hidden factor matrix P and the item hidden factor matrix respectively. Q, what do you mean, it will be more complicated to explain, my understanding is as follows:

        Your preference for "Titanic" is 5 points. After matrix decomposition, the user hidden factor matrix P indicates how many other users are like you, and the item hidden factor matrix Q indicates how other movies are like "Titanic" ", so that the weighted combination calculates how much everyone likes each movie.

        Therefore, the process of matrix decomposition is the process of fitting these two sub-matrices:
Insert picture description here

Algorithm idea:

  1. Initialize the left matrix P and the right matrix Q, and assign random values
  2. Construct objective function C = score matrix-left matrix x right matrix
  3. Solved by gradient descent, and end condition of the set step size:
    the maximum number of iterations or C sufficiently small or C changes sufficiently small
  4. Update the left and right matrices with the step size
  5. After the iteration: score matrix ≈ left matrix P x right matrix Q
  6. The position with the original score of 0 gets a non-zero value as the predicted score
  7. Take n prediction scores from high to low as recall

        You don’t need to understand it, just know the conclusion: the score matrix A can be decomposed into two sub-matrices, and the two sub-matrices are multiplied again to get the approximate matrix A′≈A, where A′ calculates the 0 value of A, and the two matrices The non-zero value of is very close. It is reasonable to think that the calculated value of 0 is the predicted value, the score matrix is ​​solved, and the 200 items with high predicted scores can be recommended to Xiaoming from high to low.
        The variants of matrix decomposition include SVD singular value matrix decomposition and SVD++. SVD is a variant of traditional matrix decomposition. It decomposes a score matrix into three sub-matrices . The intermediate matrix reflects the degree to which the original matrix can be restored after decomposition. The formula is different from the idea, but in the form it is hoped to calculate the value of 0. SVD ++ integration on the basis of SVD model based on user implicit behavior of items that score = explicit interest + implicit interest + bias when calculating the value of zero will be added to the user the difference to the average score of all items and all users For example, it finds that most of your even scores are 0.5 points lower than the average, so when you calculate how much you like the new item, an additional 0.5 points are subtracted, which is equivalent to adding "bold guesses" as the bias value, so SVD++ calculation takes a long time, and sometimes it is too "bold" to make the effect worse, so it is not used much in practice.

3.1.4 Bayesian Personalized Ranking (BPR)

        BPRMF is also a kind of matrix factorization, but it is a matrix factorization based on the idea of ​​BPR, so it is distinguished from traditional matrix factorization. Suppose that the collection of items that Xiao Ming likes is I, and the other collections are J. Obviously the score performance is Ui>Uj. In order to find a specific score, the traditional matrix factorization training process is to fit a set of parameter matrices P and Q. To predict Xiao Ming’s score on the J set , the BPR training process is to fit a set of parameter matrices W and H to maximize the gap between Xiao Ming’s scores of I and J, that is, the probability P(Ui>Uj) When the values ​​of W and H are calculated, the matrix obtained by multiplying the last two matrices is the user's ranking score for any product, and you can choose the recommendation from high to low . The following gives the algorithm ideas without mathematical formulas, refer to the blog of teacher Liu Jianping, interested students can go to learn the principle in detail: the portal
        loss function L:
Insert picture description here

Algorithm idea:
Input: training set D triple, gradient step size α, regularization parameter λ, decomposition matrix dimension k
Output: model parameters, matrix W, H
1. Randomly initialize matrix W, H;
2. Iteratively update model parameters (The loss function is getting bigger and bigger);
3. If W, H converge, the algorithm ends, output W, H, otherwise go back to step 2.
4. When you get W, H, you can calculate each The ranking score of any product corresponding to user u; finally select the products with the highest ranking score to output

3.1.5 Recall based on word vector (word2vec)

        The NLP field can also be used for recommendation recall. The idea is very intuitive and clear. We know that the keywords in an article are often related to the topic of the article. In other words, the keywords of articles on the same topic are often close or even the same. In other words, the same The keywords of the topic articles are often bundled . After understanding this idea, let's see how to use word vectors for recommendation.
        Let’s sublime it. Articles on the same topic are regarded as the same type of users, and the keywords appearing in the articles are regarded as the names of items that users like . That is to say, items of the same type of users’ favorites are often bundled , such as Xiaoming and Xiaojun. He is a similar person. The collection of items that Xiao Ming bought in chronological order is [shirts, footballs, football shoes, stockings, guards] , and the collection of items that Xiaojun bought in chronological order is [whistles, footballs, football boots, guards] , obviously these users favorite items have a binding relationship occurs, then came a new user Xiao Wang, his behavior data showed on soccer shoes like, according to Xiao Ming and Xiao Jun's performance, if only one recommendation Commodities can be "football" or "stockings" . Footballs are often purchased before football shoes, and stockings are often purchased after them. Therefore, we can recommend forward or backward. If we want to recommend 2 products, then Can push jerseys, footballs, whistle, stockings, guardsAny two of them can still be forward or backward. The recommended one or two here is a window concept. The larger the window, the larger the recommendation range, and the less obvious the binding relationship. On the contrary, the smaller the binding relationship. Strong. If this special case is clear, just promote the concept. Thousands of users, each user’s favorite product collections are sorted in chronological order. Train a model, then this model describes the bundling relationship in this crowd. Once a newcomer shows his liking for a certain product, the model can find other bundled products and recommend to him based on the product. This is a popular explanation. The technical linguistic summary is: treat a list of user behaviors as text, and products are words. Use nlp technology to calculate the vector of each word (commodity), so that the similarity between products can be measured. Use user history purchase records to recommend similar products. This is what the word2vec word vector model does.
        The more classic recall algorithms have been introduced. They are side by side. The relationship of choosing one more is to go back to the beginning scene. The task of recalling 200 items from 9999 items to Xiaoming is completed. Of course, if you want the results of the recall As the final recommendation result, then the following sorting algorithm is not needed, and 200 is set to 10.

3.2 Sorting algorithm

        Some of the recommended evaluation indicators will measure the order of recommendation, such as MAP, MRR. These indicators cannot be reflected in a simple recall algorithm, because the recall algorithm only tells us the items that the user may buy, but we cannot know the purchased items. Probability, and the recall algorithms mentioned above use only the user's behavioral information data, without using the relevant attributes of the product and the user, there is still a lot of room for improvement, which requires the second layer of "funnel"—— Sorting Algorithm.
        The sorting algorithm is actually a classification task in supervised learning. In this case, we will use LightGBM, a rookie with good results in classification, to solve the problem (Xgb is also possible), simple, light, fast, and can output probability values , I won’t repeat its principle here, and interested students will understand it by themselves, and let’s use it directly.
        After each recommended user comes out of the recall algorithm, there are 200 items suitable for him. Next, let's talk about the modeling ideas. Let the original data be the behavior data from January to April:
Insert picture description here

        The data from January to March is processed by the recall algorithm to obtain the recall result, which is associated with the user information table and performed feature engineering, and the April data is used to label it, and the following training set will be obtained:

Insert picture description here
        Then train the model and cover the User_name and Item_name columns. It is not difficult to find that the essence of the model learning is "users with certain attributes will buy items with certain characteristics". This model can explain the current data set well. The buying habits of the crowd; adjust the parameters and output the probability value, which is the probability of the user buying a certain product. Select 10 recommended results from high to low.

3.3 Service filtering

        So far we have pushed 10 products that Xiao Ming is most likely to buy, but considering the recommended periodicity (it is impossible to run it only once, right), we can design some business rules based on our business needs, such as: How long does it take for the user to continue to recommend the product after purchasing the same product (even if the probability calculated by the model is very high); or the same product is repeatedly recommended in several consecutive recommendation cycles, and the user still has not purchased it, so we can't make it difficult, and so on the rule of. There are many ways to implement business filtering. SQL can be used to construct a table of repeated products regularly, and the sorted recommendation results are associated and filtered with this table for recommendation, and so on.
        In addition, there are some solutions to deal with problems such as cold start and new users. Different scenarios have different treatment methods. You can do some processing and add it to the algorithm, or you can go through another set of processes separately, such as creating hot list rules. Pushing popular products, or pushing new products to active users, has the advantage that for active users or popular products, contacting new products/new users with them can quickly generate interactive data to facilitate subsequent recommendations.

4 Evaluation Index

        Finally, we want to measure the overall recommendation effect. There are several commonly used indicators for recommendation:

PRE precision rate: the proportion of the recommended funds that are "liked" by users to the number of recommendations, value range [0,1]
REC recall rate: among the recommended funds, the portions "liked" by users account for the user's ownership" The ratio of "likes", the value range is [0,1]
F1: Like machine learning, the values ​​of PRE and REC are considered comprehensively.
HR hit rate: the number of recommended users who have "likes" accounts for the
average MRR ranking of the total number of users : In the recommendation list, the reciprocal of the ranking of the first fund that the user "likes", the larger the better.
MAP average precision: the average of the reciprocal sum of the rankings of the funds that each user "likes" in the recommendation list

        The calculation formula directly borrows the diagram summarized by other big guys:
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

Insert picture description here
        The first type of indicators care about what users want, have I recommended it, and emphasize the "accuracy" in the language, which are Precision@K, Recall@K, F1@K, HR@K ; the second type of indicators are more concerned about finding Whether these items are placed in a more conspicuous position for users, that is, "sequence" is emphasized, which are MRR@K and MAP@K respectively . In fact, the commonly used ones are PRE, REC, MRR and MAP. For the latter two indicators that consider the order, usually the value is relatively low and fluctuates greatly. However, when the user has a lot of browsing records but a small number of recommendations, REC The score will be very low, so if the system is not very strict for the recommendation effect, PRE can better reflect the quality of the recommendation.

Pure hand-playing, I finally finished writing after a while. The next article in this series will use code to implement this system. If you think it is helpful to yourself, please like, collect and follow O

Guess you like

Origin blog.csdn.net/a7388787/article/details/109184324