Collaborative filtering algorithm (example understanding)

        Collaborative filtering algorithm is a recommendation system algorithm that uses user evaluation data on items to predict the user's preference for unrated items. The algorithm is based on a simple idea: if two users have rated certain items similarly in the past, it is likely that they will rate those items similarly in the future . Therefore, the collaborative filtering algorithm takes the similarity between users as the basis for predicting the user's evaluation of items, so as to realize the prediction of user interest . The algorithm is divided into two types: user-based collaborative filtering and item-based collaborative filtering.

1. User-based collaborative filtering

        The user-based collaborative filtering algorithm is a recommendation system algorithm. Its basic idea is to find other users who are similar to the current user's interests based on the user's historical behavior data, and then use these user behavior data to predict the current user's interest, so as to recommend it to other users. Recommended items.

Specifically, the user-based collaborative filtering algorithm includes the following steps:

  1. Determine the target users, that is, users for whom items need to be recommended.

  2. Finding other users who are similar to the target user can be achieved by calculating the similarity between users . Commonly used similarity calculation methods include cosine similarity, Pearson correlation coefficient, etc.

  3. A certain number of similar users are selected as the neighbor set, which can be ranked according to the similarity , and the top k similar users are selected as the neighbors.

  4. Predicting the ratings of target users on items that have not been evaluated can be calculated by methods such as weighted average or weighted sum . Specifically, the weighted average or weighted sum operation can be performed by using the ratings of the target user on the items evaluated by the neighbor users as weights.

  5. To recommend items that have not been evaluated for target users, they can be sorted according to the predicted ratings, and the top n items are selected as the recommendation results.

        The advantage of the user-based collaborative filtering algorithm is that it can use the user's historical behavior data to make recommendations, and has a better personalized effect. However, there are also some disadvantages. For example, for new users, it is impossible to accurately predict their interests, and it is necessary to wait for them to generate enough historical behavior data. At the same time, there is also a cold start problem, that is, for newly added items, effective recommendations cannot be made at the initial stage.

2. Item-based collaborative filtering algorithm

        The item-based collaborative filtering algorithm is a recommendation algorithm, which uses the rating data of users on items to find the similarity between items, so as to recommend items similar to their historical interests to users. The core idea of ​​the algorithm is to predict the user's rating of the item based on the similarity of the item.

Specifically, the process of item-based collaborative filtering algorithm is as follows:

1. Calculate the similarity between each pair of items. Common calculation methods include cosine similarity and Pearson correlation coefficient.

2. Find the items rated by the user in the history. For a given target user, it is necessary to find the set of items rated in the history.

3. Calculate the weighted score of each item

4. Recommend items to users

        Compared with the user-based collaborative filtering algorithm, the item-based collaborative filtering algorithm has the advantage that it can quickly calculate the similarity matrix when there are a large number of items, and the recommendation results are more stable and accurate.

3. Key formula

Pearson correlation coefficient:

insert image description here

Cosine similarity:

insert image description here

        Pearson's correlation coefficient and cosine similarity are two common methods used to measure the similarity between vectors. They have some differences in calculation methods and application scenarios.

1. Calculation method:

        The Pearson correlation coefficient measures the similarity between two vectors by computing their covariance divided by the product of their standard deviations. Its value range is [-1,1], where 1 means complete positive correlation, -1 means complete negative correlation, and 0 means no correlation.

        Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. Its value range is [-1,1], where 1 means completely similar, -1 means completely opposite, and 0 means no similarity.

2. Application scenarios:

        The Pearson correlation coefficient is often used when dealing with data with numerical properties, such as scoring data, and in time series analysis. It measures the linear correlation between two variables and is suitable for continuous variables .

        Cosine similarity is often used to deal with text data and other sparse data , such as item recommendations in recommender systems. It can measure the angle and direction between two vectors and is suitable for processing large-scale sparse data.

Calculate the predicted score:

insert image description here

The numerator is the analysis of n different users on the same product, while the denominator is to take each user out separately and analyze different products

Co-occurrence similarity:

        Among them, the denominator is the number of users who like item i, and the numerator is the number of users who like both item i and item j. Therefore, the following formula can be understood as the percentage of users who like item i also like j (similar to association rules)

insert image description here

The denominator |N(i)| is the number of users who like item i, and the numerator |N(i)∩N(j)| is the number of users who like item i and j at the same time, but if item j is very popular, Wij will be very popular. is close to 1. So to avoid recommending popular items, we use the following formula:

insert image description here

User u's interest in item j:

4. Example

4-1 User-based collaborative filtering - cosine similarity

insert image description here

Calculation of similarity between users:

        The similarity between user A and user B=(5*0+1*0+0*4.5+0*3)/(SQRT(5^2+1^2+0^2+0^2)*SQRT(0 ^2+0^2+4.5^2+3^2)) =0

        The similarity between user A and user C=(5*1+1*4+0*0+0*4)/(SQRT(5^2+1^2+0^2+0^2)*SQRT(1 ^2+4^2+0^2+4^2)) =0.307254934

        The similarity between user B and user C=(0*1+0*4+4.5*0+3*4)/(SQRT(0^2+0^2+4.5^2+3^2)*SQRT(1 ^2+4^2+0^2+4^2)) =0.38624364

Compute likes:

User A's preference for product 3 = similarity between user A and user B * user B's rating for product 3 + similarity between user A and user C * user C's rating for product 3 = 0*4.5+0.307254934*0 =0

User A's preference for product 4 = similarity between user A and user B * user B's rating for product 4 + similarity between user A and user C * user C's rating for product 4 = 0*3+0.307254934*4 =1.22901974

User B's preference for product 1 = similarity between user B and user A * user A's rating for product 1 + similarity between user B and user C * user C's rating for product 1 = 0*5+0.38624364*1 =0.38624364

4-2 User-based collaborative filtering - Pearson coefficient

 insert image description here

 Calculation of similarity: (user C product 4 as an example)

 Only user A and user D have rated product 4 too much, so there are only 2 candidate neighbors, namely user A and user D.

insert image description here

insert image description here

 Calculate the predicted score:

insert image description here

 Other handwritten derivations:

User A:

 User B:

 User D:

 User E:

4-3 Item-based collaborative filtering - cosine similarity

insert image description here

The product cosine similarity calculation results are as follows :

The similarity between product 1 and product 2=(5*1+0*0+1*4)/(SQRT(5^2+0^2+1^2)*SQRT(1^2+0^2+4 ^2))=0.428086345

The similarity between product 1 and product 3=(5*0+0*4.5+1*0)/(SQRT(5^2+0^2+1^2)*SQRT(0^2+4.5^2+0 ^2))=0

....

Compute likes:

User A's preference for product 3 = similarity between product 1 and product 3 * user A's rating for product 1 + similarity between product 2 and product 3 * user A's rating for product 2 + similarity between product 4 and product 3 Degree*User A’s rating on product 4=0*5+0*1+0.6*0=0

User A's preference for product 4 = similarity between product 1 and product 4 * user A's rating for product 1 + similarity between product 2 and product 4 * user A's rating for product 2 + similarity between product 3 and product 4 Degree*User A’s rating on item 4=0.156892908*5+0.776114*1+0.6*0=1.560578541

User B's preference for product 1 = similarity between product 1 and product 2 * user B's rating for product 2 + similarity between product 1 and product 3 * user B's rating for product 3 + similarity between product 1 and product 4 Degree * user B's rating on product 4 = 0.47067872

another example:

The co-occurrence matrix C represents the number of users who like two items at the same time, which is calculated according to the user-item correspondence table.

 similarity matrix

Replenish:

1. Co-occurrence matrix (the number of users who like two items at the same time)

 2. Improved similarity algorithm

        As can be seen from the previous discussion, the similarity between two items in collaborative filtering is because they appear in many users' interest lists. In other words, each user's interest list contributes to the item similarity. So is the contribution of each user the same?
Suppose there is such a user who runs a bookstore and buys 80% of the books on Dangdang.com and plans to sell them himself. Then his shopping cart contains 80% of Dangdang's books. Assuming that Dangdang has 1 million books, that is to say, he bought 800,000 books. As can be seen from the previous discussion on ItemCF, this means that because of the existence of such a user, there are 800,000 books with similarities between them, that is to say, a 800,000 by 800,000 dense matrix .

        John S. Breese proposed a parameter called IUF (Inverse User Frequency) in paper 1, which is the reciprocal of the logarithm of user activity . He also believed that the contribution of active users to item similarity should be smaller than that of inactive users . It is proposed that the IUF parameter should be increased to correct the calculation formula of item similarity:

The above formula imposes a soft penalty on active users, but for many users who are too active, such as the above user who bought 80% of Dangdang’s books, in order to avoid the similarity matrix being too dense, we generally calculate Just ignore his interest list and not include it in the data set for similarity calculation.

3. Normalization of similarity matrix

        Karypis found in the research that if the similarity matrix of ItemCF is normalized by the maximum value, the accuracy of recommendation can be improved . Its research shows that if the item similarity matrix w has been obtained, the normalized similarity matrix w' can be obtained by the following formula:

Experiments show that the benefit of normalization is not only to enhance the accuracy of the recommendation, but also to improve the coverage and diversity of the recommendation.

(2 messages) User-based collaborative filtering algorithm (userCF)_overlordmax's Blog-CSDN Blog

https://blog.csdn.net/qq_52358403/article/details/112768902

Collaborative filtering algorithm | JIANG-HS

"Hive" Collaborative Filtering Recommendation System - Cosine Similarity - Know (zhihu.com)

Item-based Collaborative Filtering Algorithm (ItemCF) Principle and Code Practice - Short Book (jianshu.com)

Guess you like

Origin blog.csdn.net/weixin_52093896/article/details/130307676