Internet marketing recommendation algorithm theory - 30 minutes to understand collaborative filtering

 

1. The concept of collaborative filtering

No matter in Taobao or JD.com, if you browse/purchase a certain product A, you will always see products similar to product A on the home page in the app in the next few days. The recommendation system is behind this ability , and its recommendation algorithm may be collaborative filtering. (Note: This kind of advertisement recommendation ability in the app can be turned off, thanks to the people's government)

main idea

Birds of a feather flock together

Two angles of thinking:

1) You may also like what people like you like——user-based CF

2) You may like something similar to what you like—item-based CF

Subsequent chapters will focus on user-based CF by default, and the last chapter will briefly introduce the item-based CF calculation process

noun definition

Synergy: Association, harmony among all. At the same time, union also——"Shuowen", the process or ability of coordinating two or more different resources or individuals to accomplish a certain goal in concert.

Collaborative filtering: Bring together the relevant information of many users and items, find out the same or similar favorite items, and filter them out of this large collection. ——Personal explanation

2. Recommendation process based on collaborative filtering algorithm

2.1 Data Collection & Integration

The premise of using this algorithm is to have experienced the collection and precipitation of basic data. The data mainly used by the collaborative filtering algorithm is the user’s interaction information on products, such as purchases, ratings and other quantifiable information.

Purchase behavior, if you have purchased product A, it will be recorded as 1, and if you have not purchased product A, it will be recorded as 0

Rating behavior, such as each user's independent rating for item A

According to the scoring behavior, the highest score is 5 points, and we build a perspective chart according to different users and different products—also called a matrix: <example product granularity is coarser>

lipstick

perfume

foundation

eye cream

cream

shampoo

cleasing milk

conditioner

Xiao Ming

4

5

4

5

little bear

5

3

4

2

4

Xiaoxuan

1

2

3

3

4

Kodai

4

2

3

4

2

Mary

1

1

2

1

1

Perspective 1

Blanks represent no ratings.

2.2 Similarity Calculation

The core of collaborative filtering lies in similarity calculation. There are three commonly used similarity calculation methods for collaborative filtering: cosine similarity, Pearson correlation coefficient, and Jaccard similarity coefficient. Only select the applicable one in the actual model training process.

2.2.1 Cosine similarity

The calculation formula of cosine similarity is as follows:

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected into a multidimensional space,

The smaller the angle, the more similar (correlated) the two vectors are, and the larger the angle, the less similar (uncorrelated) the two vectors are: cos0 is positively correlated, and cos180 is negatively correlated. Let's revisit junior high school geometry together,

The dimension used by user-based CF is each product in the perspective, and the value is each rating item. For example, for Xiaoming and Xiaoxuan we have the following vectors

\overrightarrow{小明}(None,None,4,None,None,5,4, 5)

\overrightarrow{小璇}(None,1, 2, 3, None,3,None,4)

When calculating the similarity between two users, select the column value in which both vectors have values ​​as the similarity calculation vector , so that the vector

\overrightarrow{Xiao Ming} (4, 5, 5)

\overrightarrow{small arrow} (2, 3, 4)

According to the cosine calculation formula, the user similarity between Xiao Ming and Xiao Xuan is calculated as:


cos\theta=\frac{4*2+5*3+5*4}{\sqrt{4^2+5^2+5^2}*\sqrt{2^2+3^2+4^2}}=\frac{43}{\sqrt{66}*\sqrt{29}}\approx0.98287

Therefore, from the calculation principle of cosine similarity, the two users Xiaoming and Xiaoxuan are almost positively correlated. Therefore, in the subsequent recommendations, you can use the products with high Xiaoxuan ratings to recommend to Xiaoming.

2.2.2 Pearson correlation coefficient

The formula is as follows,

The Pearson correlation coefficient is also a cosine similarity calculation method, which takes into account the impact of the length of the vector on the score, so that it can reflect the user's preferences.

Let's look at Xiaoxiong and Xiaoli. There are vectors according to the method of calculating vectors according to the cosine similarity.

\overrightarrow{Little Bear} (5, 3, 4, 2)

\overrightarrow{Mary} (1, 1, 2, 1)

According to the cosine similarity calculation, the cosine similarity cos\theta\approx0.92582 can be obtained, and the two users are very similar. Then we analyze from the perspective of user ratings. The rating range is 1 to 5 points. Xiaoli’s ratings are generally low, which shows that Xiaoli doesn’t like lipstick, perfume, foundation, and face cream, but Xiaoxiong’s ratings for lipstick, perfume, and foundation Higher, indicating that I still like it very much. From this intuitive point of view, the two users should be irrelevant. Obviously, the result of cosine similarity in this case is inconsistent with our expectations.

From a business point of view, the Pearson correlation coefficient greatly eliminates the factor of individual score differentiation. Everyone has different ratings for the same product. Each user has an evaluation benchmark value in his mind. A value higher than this value means that the user likes the product very much, and a value lower than the value means that the user does not like the product very much. The phenomenon is Some users have generally high ratings, and some users have generally low ratings. The method of the Pearson correlation coefficient formula is to do centralization, take the average score of each user's rating, that is, locate the evaluation benchmark value of each user, and then subtract the average value from the rating of each product. Negative numbers indicate dislike, positive The number indicates likes, and the size of the value indicates the degree of user likes and dislikes.

Let us calculate the Pearson correlation coefficient together,

cos\theta=\frac{(5-3.5)*(1-1.25)+(3-3.5)*(1-1.25)+(4-3.5)*(2-1.25)+(2-3.5)*(1-1.25)}{\sqrt{(5-3.5)^2+(3-3.5)^2+(4-3.5)^2+(2-3.5)^2} * \sqrt{(1-1.25)^2+(1-1.25)^2+(2-1.25)^2+(1-1.25)^2}}=\frac{0.5}{\sqrt{5}*\sqrt{0.75}}\approx0.25819

It can be seen that the calculation results of the Pearson correlation coefficient are basically consistent with our previous analysis.

2.2.3 Jaccard correlation coefficient

The dot product of two vectors in the numerator, the denominator is the sum of the two vectors after OR operation

It is only applicable to the similarity calculation of 0, 1 matrix, and it is oriented to the scene where there is only the concept of yes or no.

2.3 Recommended product filtering

Users similar to Xiaoming have been found. The actual situation is that the advertising space is limited, and it is difficult to expose so many similar items to users. So which of these products should be ranked first and which one should be recommended to Xiaoming? We also need to estimate Xiao Ming's ratings for these unrated products, and we will sort these ratings again, so that the order of product recommendations will be there. We first calculate all the Pearson correlations of the data users in perspective 1, as shown in the following perspective:

user

Mary

Kodai

Xiao Ming

little bear

Xiaoxuan

Mary

1

0.866025

NaN

0.258199

0

Kodai

0.866025

1

0.5

0

-0.5

Xiao Ming

NaN

0.5

1

NaN

0.866025

little bear

0.258199

0

NaN

1

1

Xiaoxuan

0

-0.5

0.866025

1

1

Perspective 2

Xiao Ming and Xiao Xuan have a high degree of similarity. Xiao Xuan has also rated perfume and eye cream. How to calculate Xiao Ming's ratings for these two items? The weighted average method is used here , the formula is as follows,

R_{u,p} represents the inferred user u's evaluation score for product p, W_{u,s} represents the similarity between two users, R_{s,p} represents the evaluation score of user s for p product,

In addition, if user s has no rating for product p, it does not participate in the calculation of this formula.

In the actual calculation, we also need to determine the range of the set S, that is, how many user information similar to Xiao Ming we need to use as the basic data set for recommending products to Xiao Ming. For the convenience of calculation, here we tentatively use one similar user as the basic data set. From perspective figure 2, we can see that Xiaoxuan and Xiaoming are the most similar. The ratings for perfume and eye cream recommended by Xiaoming are as follows:

R_{Xiao Ming, perfume}=\frac{0.866025*1}{0.866025}=1

R_{Xiao Ming, eye cream}=\frac{0.866025*3}{0.866025}=3

Judging from the results, it seems that there is a problem. The calculated result is consistent with Xiaoxuan's score. Part of the reason is that we only used 1 similar user data set, and another point is that personality is not considered in combination with what was mentioned earlier. difference in scoring. We use the idea of ​​decentralization and have the following formula,

R_{u,p}=\overline{R_u}+\frac{\Sigma_{\in S}(w_{u,s}*(R_{s,p}-\overline{R_s}))}{\Sigma_{\in S}w_{u,s}}

Calculate the estimated score of Xiaoming's recommended products perfume and eye cream again as follows:

\overrightarrow{Xiao Ming}(4, 5, 5) \overline{R_{Xiao Ming}}=\frac{4+5+5}{3}\approx4.667

\overrightarrow{Xiaoxuan}(2, 3, 4) \overline{R_{Xiaoxuan}}=\frac{2+3+4}{3}=3

R_{Xiao Ming, perfume}=4.667+\frac{0.866025*(1-3)}{0.866025}=4.667-2\approx2.667

R_{Xiao Ming, eye cream}=4.667+\frac{0.866025*(3-3)}{0.866025}\approx4.667

Both calculation methods can give the correct ranking, and Xiao Ming's eye cream is recommended first. The two algorithms can be used in practice according to the actual situation.

In the actual situation, the products with high evaluation by Xiao Ming also need to consider whether they need to be included in the recommendation list according to the business situation.

At this point, the overall process of the collaborative filtering algorithm is over.

3. Evaluation of recommendation system effect

Recall rate, also called recall rate

In a delivery cycle T, Recall Rate=\frac{the number of correct predictions}{the number of actual operations by users}

precision rate

In a delivery cycle T, Precision=\frac{predicted correct number}{predicted number of users will operate}

coverage

In a delivery cycle, Coverage=\frac{Number of items to be deduplicated by union of all user recommendation sets}{Total number of items}

Coverage only shows how many items were recommended, not how many times they were recommended.

Popularity

Subsequent additions. slightly.

Glossary

Operation: it can be click, purchase, evaluation, etc.;

Prediction is correct: During the entire test cycle, it is predicted that the user will purchase a product, and the user has indeed purchased the product during verification;

Actual user actions: For example, the user purchased the product during the entire test cycle

Four. Summary

Problem 1: Data is sparse and generalization ability is weak

In actual situations, the number of products or users is very large, so the generated matrix data is likely to be a sparse matrix, so that popular products are prone to head effects and are similar to many other products, while tail products are rare due to the sparse matrix Will be recommended, and active users will have the same impact. Correspondingly, the industry has penalties for popular products and active users for collaborative filtering similarity calculations.

Problem 2: Cold start problem

Collaborative filtering is suitable for retained user recommendations, and new users need to go through a cold start cycle to complete data collection.

Advantage 1: Simple and efficient

Simple and efficient, the recommendation function can be completed with very few features, and only the interaction information between the user and the item is used.

Internet marketing has been moving towards precision. User behavior information and item information are actually very rich. Collaborative filtering algorithms do not make full use of these effective information. Many recommendation systems use logistic regression models in precise marketing scenarios.

Comparison of user-based and item-based CF

5. Brief description of item-based CF calculation

The user-based CF calculation uses the row vector in perspective 1, and the item-based CF calculation uses the column vector in the perspective.

For lipstick, perfume, and foundation, respectively, the following vectors are used:

\overrightarrow{lipstick}=(None,5,None,None,1)

\overrightarrow{perfume}=(None,3,1,None,1)

\overrightarrow{foundation}=(4,4,2,4,2)

The similarity calculation formula remains unchanged, and the estimated score calculation formula also remains unchanged.

Guess you like

Origin blog.csdn.net/weixin_43805705/article/details/130868811