Collaborative filtering of recommendation system team learning

1. Collaborative filtering algorithm

Collaborative Filtering (Collaborative Filtering) recommendation algorithm is the most classic and most commonly used recommendation algorithm.

The so-called collaborative filtering, the basic idea is to recommend items to users based on the user's previous preferences and the choices of other users with similar interests (based on the user's historical behavior data mining to find the user's preference bias, and predict the user's possible favorite products for recommendation) Generally , it is only based on the user's behavior data (evaluation, purchase, download, etc.), and does not rely on any additional information of the item (item characteristics) or any additional information of the user (age, gender, etc.) . The currently widely used collaborative filtering algorithm is a neighborhood-based method, and this method mainly has the following two algorithms:

  • User-based collaborative filtering algorithm (UserCF) : Recommend products that are liked by other users with similar interests
  • Item-based collaborative filtering algorithm (ItemCF) : Recommend to the user items similar to the items he liked before

Regardless of whether it is UserCF or ItemCF algorithm, one of the very important steps is to calculate the similarity between users and users or items and items, so the following is to sort out the commonly used similarity measurement methods, and then expand the specific details of each algorithm. .

 

2. Similarity measurement method

  1. Jaccard similarity coefficient
    This is an index to measure the similarity of two sets. The ratio of the intersection of two users u and v interaction products to the number of the union of these two user interaction products is called the Jackard similarity coefficient of the two sets, represented by the symbol simuv, where N(u),N( v) respectively represent the collection of interactive products of user u and user v.

    simuv = | N (u) ∩N (v) || N (u) ∪N (v) |

    Since the Jaccard similarity coefficient generally cannot reflect the rating preference information of a specific user, it is often used to evaluate whether a user will rate a product, rather than estimating how much a user will rate a product.

  2. Cosine similarity
    Cosine similarity measures the angle between two vectors. The smaller the angle, the more similar. First, describe the cosine similarity from a set perspective. Compared with the Jaccard formula, the denominator is different. It is not the number of unions of two user interaction products, but the product of the number of products interacted by two users. The formula is as follows:

    simuv = | N (u) | ∩ | N (v) | √ | N (u) | ⋅ | N (v) |

    Describing from the perspective of vectors, let matrix A be the user-commodity interaction matrix (because TopN recommendation does not require users to score items, it only needs to know whether the user interacts with the product), that is, each row of the matrix represents a user For the interaction of all commodities, the value of the commodity with interaction is 1 and the value of commodity with no interaction is 0, and the columns of the matrix represent all commodities. If the number of users and products are m and n respectively, the interaction matrix A is a matrix with m rows and n columns. At this time, the similarity of users can be expressed as (where u⋅v refers to the vector dot product):

    simuv = cos (u, v) = u⋅v | u | ⋅ | v |

    The above-mentioned user-commodity interaction matrix is ​​very sparse in reality. In order to avoid storing such a large sparse matrix, the user similarity is generally calculated in a collective manner. In theory, the similarity calculation formulas between vectors can be used to calculate the similarity between users, but different user similarity measurement methods will be selected according to the actual situation.

    This can be realized by using cosine_similarity:

    from sklearn.metrics.pairwise import cosine_similarity
    i = [1, 0, 0, 0]
    j = [1, 0.5, 0.5, 0]
    consine_similarity([i, j])
    
  3. Pearson correlation coefficient

    The formula of Pearson's correlation coefficient is very similar to the formula for calculating cosine similarity. First, the formula for calculating cosine similarity above is written in the form of summation, where rui and rvi respectively indicate whether user u and user v interact with product i (Or specific score value):

    simuv = ∑irui ∗ rvi√∑ir2ui√∑ir2vi

    The following is the Pearson correlation coefficient calculation formula, where rui and rvi respectively indicate whether user u and user v interact with product i (or a specific score value), ¯ru, ¯rv respectively represent all products interacted by user u and user v The number of interactions or the average value of a specific score.

    sim (u, v) = ∑i∈I (rui − ¯ru) (rvi − ¯rv) √∑i∈I (rui − ¯ru) 2√∑i∈I (rvi − ¯rv) 2

    Therefore, compared with the cosine similarity, the Pearson correlation coefficient uses the average score of the user to correct each independent score, reducing the impact of user score bias. For specific implementation, we can also adjust the package. There are many calculation methods. The following is one of them:

    from scipy.stats import pearsonr
    
    i = [1, 0, 0, 0]
    j = [1, 0.5, 0.5, 0]
    pearsonr(i, j)
    

The following is an explanation of the principles of collaborative filtering based on users and collaborative filtering based on items.

 

3. User-based collaborative filtering

User-based collaborative filtering (denoted by UserCF below), the idea is actually relatively simple. When a user A needs personalized recommendation, we can first find other users with similar interests, and then compare those users like, and the user a not heard of recommended items to a .

image

The UserCF algorithm mainly includes two steps:

  1. Find collections that are similar to the target user’s interests
  2. Find items that are liked by users in this set and that the target user has not heard of and recommend to the target user.

In the above two steps, in the first step, we will find users who have similar interests to the target user based on the similarity measurement method given earlier, and in the second step, how to base similar user’s favorite items Recommend to target users? This depends on the target user's preference for items that similar users like, so how to measure this degree? In order to better understand the above two steps, let's take a concrete example to illustrate the two steps.

 

The following figure is an example, this example will be used in various algorithms in this article

image

The process of recommending items to users can be visualized as a task of guessing the user’s scoring of the product . The above table shows the scores of 5 users for 5 items, which can be understood as the user’s preference for the item

Two steps to apply UserCF algorithm:

  1. First calculate the similarity between Alice and users 1, 2, 3, 4 based on the previous scoring situation (or the existing user vector), and find the n users most similar to Alice
  2. According to the ratings of these n users on item 5 and the similarity to Alice, Alice's rating on item 5 will be guessed. If the rating is relatively high, item 5 is recommended to user Alice, otherwise it is not recommended.

Regarding the first step, the method for calculating the similarity of two users has been given above, and I will not repeat it here. The second problem is mainly solved here, how to produce the prediction of the final result.

 

Prediction of final result

According to the above methods, we can calculate the similarity between the vectors, that is, we can calculate the similarity between Alice and other users. At this time, we can select the top n users that are closest to Alice, based on them The evaluation of Item 5 guesses Alice's score. So how is it calculated?

One of the commonly used methods here is to use the user similarity and the weighted average of the evaluations of similar users to obtain the user's evaluation prediction , which is expressed by the following formula:

Ru,p=∑s∈S(wu,s⋅Rs,p)∑s∈Swu,s

In this formula, the weight wu,s is the similarity between the user u and the user s, and Rs,p is the rating of the item p by the user s.

There is another way as follows. This way considers the previous one. The user similarity is still used as the weight, but the latter is not simply the rating of the item by other users, but the difference between the rating of the item and all the ratings of this user. Values ​​are weighted and averaged. At this time, it takes into account that some users have different inner scoring standards , that is, some users like to score high, and some users like to score low.

Pi, j = ¯Ri + ∑nk = 1 (Si, k (Rk, j − ¯Rk)) ∑nk = 1Sj, k

So this calculation method is more recommended. The following calculations will use this method.

After obtaining user u's evaluation predictions for different items, the final recommendation list is sorted according to the prediction scores. So far, the recommendation process based on the user-based collaborative filtering algorithm is completed.

According to the above question, let's do a hand calculation below:

Aim: Guess Alice’s score for item 5:

  1. Calculate the similarity between Alice and other users (Pearson correlation coefficient is used here)

     

    JavaCKT9KdW55iLxnNzt.png! Thumbnail698 × 371

     

    Here we use the Pearson correlation coefficient, that is, the similarity between Alice and user 1 is 0.85. In the same way, we can calculate the similarity with other users. Here we can use numpy's similarity function to get the user's similarity matrix:

     

    image

    Picture 778×441

     

    It can be seen from this that the similarity between user Alice and user 2, user 3, and user 4 is 0.7, 0, -0.79. So if n=2, the two users that are closest to Alice are user 1, and the similarity to Alice is 0.85, and the similarity to Alice is 0.7.

  2. According to the similarity, the user calculates Alice’s final score
    for item 5. User 1 ’s score for item 5 is 3, and user 2’s score for item 5 is 5. Then according to the above calculation formula, Alice’s final score for item 5 can be calculated as

PAlice, item 5=¯RAlice+∑2k=1(SAlice,userk(Ruserk,item 5−¯Ruserk))∑2k=1SAlice,userk=4+0.85∗(3−2.4)+0.7∗(5−3.8)0.85 +0.7=4.87

  1. Recommend users based on user ratings. At
    this time, we get that Alice’s score for item 5 is 4.87. According to Alice’s score, the items are ranked from large to small: Item 1>Item 5>Item 3=Item 4>Item 2At
    this time, if we want to recommend 2 products to Alice, we can recommend item 1 and item 5 to Alice

So far, the introduction of the principle of user-based collaborative filtering algorithm is complete.

 

4. UserCF programming implementation

Here is a simple program to implement the above case, to do a warm-up for the next big homework, to sort out the above process is actually three steps: calculate the user similarity matrix, get the first n similar users, and calculate the final score.

So our following procedure is also divided into these three steps:

  1. First of all, I first set up the data table.
    Here I used a dictionary method. The reason why pandas is not used is because the example given above is actually an example. In the real situation, we know that the user's rating of the item is not It will not be so complete. There will be a lot of null values, so the matrix will be very sparse. At this time, when using DataFrame, there will be a lot of NaN. Therefore, it is stored in the form of a dictionary. Two dictionaries are used. The first dictionary is an item-user rating mapping, the key is item 1-5, which is represented by AE, and each value is a dictionary that represents the rating of each user on the item. The second dictionary is a user-item rating mapping, the key is the above five users, represented by 1-5, and the value is the user's rating for each item.
# 定义数据集, 也就是那个表格, 注意这里我们采用字典存放数据, 因为实际情况中数据是非常稀疏的, 很少有情况是现在这样
def loadData():
    items={'A': {1: 5, 2: 3, 3: 4, 4: 3, 5: 1},
           'B': {1: 3, 2: 1, 3: 3, 4: 3, 5: 5},
           'C': {1: 4, 2: 2, 3: 4, 4: 1, 5: 5},
           'D': {1: 4, 2: 3, 3: 3, 4: 5, 5: 2},
           'E': {2: 3, 3: 5, 4: 4, 5: 1}
          }
    users={1: {'A': 5, 'B': 3, 'C': 4, 'D': 4},
           2: {'A': 3, 'B': 1, 'C': 2, 'D': 3, 'E': 3},
           3: {'A': 4, 'B': 3, 'C': 4, 'D': 3, 'E': 5},
           4: {'A': 3, 'B': 3, 'C': 1, 'D': 5, 'E': 4},
           5: {'A': 1, 'B': 5, 'C': 5, 'D': 2, 'E': 1}
          }
    return items,users

items, users = loadData()
item_df = pd.DataFrame(items).T
user_df = pd.DataFrame(users).T

 

  1. Calculating the user similarity matrix
    This is a co-occurrence matrix, 5*5, the row represents each user, the column represents each user, and the value represents the correlation between the user and the user. The idea here is this, because the user and the user are required to be in pairs Therefore, it is necessary to traverse the user-item rating data with a double-layer loop. When it is not the same user, we need to traverse the item-user rating data to find the two users who have rated the item at the same time The data is put into these two user vectors. Because there are a lot of NANs under normal circumstances, that is, the user may not have rated an item, and this cannot be regarded as a part of the user vector, and the similarity cannot be calculated. Let's look at the code, it doesn't feel good to describe:
"""计算用户相似性矩阵"""
similarity_matrix = pd.DataFrame(np.zeros((len(users), len(users))), index=[1, 2, 3, 4, 5], columns=[1, 2, 3, 4, 5])

# 遍历每条用户-物品评分数据
for userID in users:
    for otheruserId in users:
        vec_user = []
        vec_otheruser = []
        if userID != otheruserId:
            for itemId in items:   # 遍历物品-用户评分数据
                itemRatings = items[itemId]        # 这也是个字典  每条数据为所有用户对当前物品的评分
                if userID in itemRatings and otheruserId in itemRatings:  # 说明两个用户都对该物品评过分
                    vec_user.append(itemRatings[userID])
                    vec_otheruser.append(itemRatings[otheruserId])
            # 这里可以获得相似性矩阵(共现矩阵)
            similarity_matrix[userID][otheruserId] = np.corrcoef(np.array(vec_user), np.array(vec_otheruser))[0][1]
            #similarity_matrix[userID][otheruserId] = cosine_similarity(np.array(vec_user), np.array(vec_otheruser))[0][1]

The similarity_matrix here is our user similarity matrix, which looks like this:

image

With the similarity matrix, we can get the top n users most relevant to Alice.

 

  1. Calculate the first n similar users
"""计算前n个相似的用户"""
n = 2
similarity_users = similarity_matrix[1].sort_values(ascending=False)[:n].index.tolist()    # [2, 3]   也就是用户1和用户2

 

  1. Calculating the final score
    Here is the formula above.
"""计算最终得分"""
base_score = np.mean(np.array([value for value in users[1].values()]))
weighted_scores = 0.
corr_values_sum = 0.
for user in similarity_users:  # [2, 3]
    corr_value = similarity_matrix[1][user]            # 两个用户之间的相似性
    mean_user_score = np.mean(np.array([value for value in users[user].values()]))    # 每个用户的打分平均值
    weighted_scores += corr_value * (users[user]['E']-mean_user_score)      # 加权分数
    corr_values_sum += corr_value
final_scores = base_score + weighted_scores / corr_values_sum
print('用户Alice对物品5的打分: ', final_scores)
user_df.loc[1]['E'] = final_scores
user_df

The results are as follows:

image

At this point, we have completed the above small example with code. With this score, we can actually recommend the user. This is actually the working process of the micro version of UserCF.

Note: The complete code based on user collaborative filtering refers to UserCF.py in the source code file

 

5. Advantages and disadvantages of UserCF

There are two major problems with User-based algorithms:

  1. Data sparsity.
    A large-scale e-commerce recommendation system generally has a lot of items, and users may buy less than 1% of the items. The overlap of items bought between different users is low, which makes the algorithm unable to find the neighbors of a user, that is, the preferences are similar. User. This makes UserCF not suitable for application scenarios where it is difficult to obtain positive feedback (such as low-frequency applications such as hotel reservations, purchase of large items)

  2. Algorithm scalability.
    User-based collaborative filtering needs to maintain a user similarity matrix in order to quickly find Topn similar users. The storage overhead of this matrix is ​​very large, and the storage space increases with the increase of the number of users, which is not suitable for use with large user data .

Due to two technical flaws in UserCF, many e-commerce platforms did not adopt this algorithm, but adopted the ItemCF algorithm to implement the original recommendation system.

 

6. Item-based collaborative filtering

The basic idea of ​​item-based collaborative filtering (ItemCF) is to calculate the similarity between items based on the historical preference data of all users in advance, and then recommend items similar to the items the user likes to the user. For example, item a and c are very similar, because users who like a also like c, and user A likes a, so c is recommended to user A. The ItemCF algorithm does not use the content attributes of items to calculate the similarity between items. It mainly calculates the similarity between items by analyzing user behavior records. The algorithm believes that item a and item c have great similarity because they like Most users of item a like item c .

image

The item-based collaborative filtering algorithm is mainly divided into two steps:

  • Calculate the similarity between items
  • Generate a recommendation list for users based on the similarity of items and the user's historical behavior (other products that users who bought the product also often buy)

The item-based collaborative filtering algorithm is very similar to the user-based collaborative filtering algorithm, so we will directly take the example of Alice above.

image

If you want to know how many points Alice rated Item 5, the item-based collaborative filtering algorithm would do this:

  1. First calculate the similarity between item 5 and item 1, 2, 3, 4 (they are also in the form of vectors, and the value of each column is their vector representation, because ItemCF thinks that item a and item c have a great similarity This is because users who like item a mostly like item c, so the item can be vectorized based on each user’s rating or like degree of the item)
  2. Find the n items closest to item 5
  3. Calculate the score for item 5 based on Alice's score for the n closest items

 

Below we can calculate it in detail. The first is step 1:

 

image

Picture 737×245

 

Since the calculation is more troublesome, here is directly calculated with python:

 

image

Picture 781×528

 

According to the Pearson correlation coefficient, the two most similar items to item 5 are item1 and item4 (n=2). The final score is calculated based on the above formula:

PAlice, item 5 = ¯R item 5+∑2k=1 (S item 5, item k (RAlice, item k−¯R item k)) ∑2k=1S item k, item 5=134+0.97∗(5− 3.2)+0.58∗(4−3.4)0.97+0.58=4.6

At this time, item 5 can still be recommended to Alice. The following is also a simple programming realization, similar to the above:

"""计算物品的相似矩阵"""
similarity_matrix = pd.DataFrame(np.ones((len(items), len(items))), index=['A', 'B', 'C', 'D', 'E'], columns=['A', 'B', 'C', 'D', 'E'])

# 遍历每条物品-用户评分数据
for itemId in items:
    for otheritemId in items:
        vec_item = []         # 定义列表, 保存当前两个物品的向量值
        vec_otheritem = []
        #userRagingPairCount = 0     # 两件物品均评过分的用户数
        if itemId != otheritemId:    # 物品不同
            for userId in users:    # 遍历用户-物品评分数据
                userRatings = users[userId]    # 每条数据为该用户对所有物品的评分, 这也是个字典
                
                if itemId in userRatings and otheritemId in userRatings:   # 用户对这两个物品都评过分
                    #userRagingPairCount += 1
                    vec_item.append(userRatings[itemId])
                    vec_otheritem.append(userRatings[otheritemId])
            
            # 这里可以获得相似性矩阵(共现矩阵)
            similarity_matrix[itemId][otheritemId] = np.corrcoef(np.array(vec_item), np.array(vec_otheritem))[0][1]
            #similarity_matrix[itemId][otheritemId] = cosine_similarity(np.array(vec_item), np.array(vec_otheritem))[0][1]

Here is the similarity matrix of the items, which looks like this:

image

Then the first n items similar to item 5 are also obtained, and the final score is calculated.

"""得到与物品5相似的前n个物品"""
n = 2
similarity_items = similarity_matrix['E'].sort_values(ascending=False)[:n].index.tolist()       # ['A', 'D']

"""计算最终得分"""
base_score = np.mean(np.array([value for value in items['E'].values()]))
weighted_scores = 0.
corr_values_sum = 0.
for item in similarity_items:  # ['A', 'D']
    corr_value = similarity_matrix['E'][item]            # 两个物品之间的相似性
    mean_item_score = np.mean(np.array([value for value in items[item].values()]))    # 每个物品的打分平均值
    weighted_scores += corr_value * (users[1][item]-mean_item_score)      # 加权分数
    corr_values_sum += corr_value
final_scores = base_score + weighted_scores / corr_values_sum
print('用户Alice对物品5的打分: ', final_scores)
user_df.loc[1]['E'] = final_scores
user_df

The results are as follows:

image

Note: The complete code of the commodity-based collaborative filtering algorithm refers to ItemCF.py in the source code file

 

7. Algorithm evaluation

As UserCF and ItemCF result evaluation parts are common knowledge points, they are uniformly identified here. Here are the evaluation indicators:

  1. Recall rate

N items recommended to user u are recorded as R(u), and the set of items that user u likes on the test set is T(u), then the recall rate is defined as:

Recall=∑u|R(u)∩T(u)|∑u|T(u)|

This means that in the videos that users actually bought or watched, how much my model really predicts. This is to examine the comprehensiveness of model recommendations.

  1. Accuracy rate
    Accuracy rate is defined as:

Precision=∑u∣R(u)∩T(u)|∑u|R(u)|

This means that, among all the items I recommend, how many users actually watched is the accuracy of my model recommendation.
In order to improve the accuracy, the model needs to be very sure to recommend to users, so the number of recommendations is reduced at this time, and this often loses comprehensiveness, and there will be very few real predictions, so it should be Consider the balance of the two.

  1. Coverage
    rate The coverage rate reflects the ability of the recommendation algorithm to discover the long tail. The higher the coverage rate, the better the recommendation algorithm can recommend the items in the long tail to users.

 Coverage =∣∣⋃u∈UR(u)∣∣|I|

  1. The coverage rate indicates what percentage of items are included in the final recommendation list. If all items are recommended to at least one user, then the coverage rate is 100%.

  2. Novelty
    The average popularity of items in the recommendation list is used to measure the novelty of the recommendation result. If the recommended items are very popular, the recommended novelty is low. Since the popularity distribution of items is a long-tailed distribution, in order to make the average popularity more stable, the logarithm of the popularity of each item is taken when calculating the average popularity.

 

8. Weight improvement of collaborative filtering algorithm

 

image-20200923122142218

image-20200923122142218797×266 68.6 KB

 

  • Basic Algorithm
    Figure 1 is the simplest formula for calculating item relevance. The numerator is the number of users who like itemi and itemj at the same time.
  • Punishment for popular items
    There is a problem in Figure 1. If item-j is a very popular product, many users who like item-i like item-j, then wij will be very large. Similarly, almost all items have a very high correlation with item-j, which is obviously unreasonable. Therefore, the denominator in Figure 2 penalizes the popularity of item-j by introducing N(j). If the item is very popular, then N(j) will be larger and the corresponding weight will be smaller.
  • Further penalties for popular items
    If item-j is extremely popular, the above algorithm is not enough. For example, "Harry Potter" is very popular, and anyone who buys any book will buy it. Even if it is punished by the method in Figure 2, "Harry Potter" will still get a high degree of similarity. This is the famous Harry Potter Problem in the field of recommendation systems.
    If you need to further penalize popular items, you can continue to modify the formula as shown in Figure 3. By adjusting the parameter α, the greater the α, the greater the punishment, the lower the similarity of popular items, and the lower the average popularity of the overall result.
  • For the
    same penalty for active users , Item-based CF also needs to consider the impact of active users (that is, an active user (specially doing ordering) may buy a lot of items), and the contribution of active users to item similarity should be less than no active user. Figure 4 shows the algorithm that aggregates the weights.

 

9. Problem analysis of collaborative filtering algorithm

One of the problems of the collaborative filtering algorithm is its weak generalization ability , that is, collaborative filtering cannot extend the similar information of two items to the similarity of other items. The resulting problem is that popular items have a strong head effect and are likely to be similar to a large number of items, while tail items are rarely recommended due to sparse feature vectors . For example, the following example:

image

A, B, C, D are items. Looking at the item co-occurrence matrix on the right, you can find that item D has a relatively large similarity with A, B, and C, so it is very likely to recommend D to those who have used A, B, and C. user. However, the reason that item D is similar to other items is because D is a popular item, and the reason why the system cannot find the similarity between A, B, and C is that its features are too sparse and lack of direct data for similarity calculation. So this is the natural defect of collaborative filtering: the recommendation system has obvious head effects and weak ability to process sparse vectors .

In order to solve this problem and increase the generalization ability of the model at the same time, in 2006, Matrix Factorization (MF ) was proposed. This method uses more dense latent vectors to represent users and items based on the collaborative filtering co-occurrence matrix. , Mining the hidden interests and hidden features of users and items, to a certain extent, make up for the insufficient ability of the collaborative filtering model to process sparse matrices. The specific details will be sorted out later, so let's pave the way first.

 

10. Thinking after class

1. When to use UserCF and when to use ItemCF? why?

answer:


  1. Because UserCF recommends based on user similarity, it has stronger social characteristics. This feature is very suitable for occasions with few users, many items, and high timeliness , such as news recommendation scenarios, because news itself has scattered interest points and relatively Compared to the user's preference for different news, the timeliness and hotspots of news are often more important, so it is suitable for discovering hotspots and tracking the trend of hotspots. In addition, it also has the ability to recommend new information, and it is more likely to find surprises, because it looks at the similarity between people, the results may be more surprising, and you can discover the potential but not yet aware of the interests of users.

For occasions where there are few users and strong timeliness is required, UserCF can be considered.

  1. ItemCF is
    more suitable for applications where interest changes are more stable and closer to personalized recommendations. It is suitable for occasions where there are few items, many users, user interests are fixed and lasting, and items are not updated quickly , such as recommending artworks, music, and movies.
    The following is a comparison of the advantages and disadvantages of UserCF and ItemCF: (from Xiang Liang recommended system practice)

2. What are the computational disadvantages of collaborative filtering? Is there any better idea to solve (relieve)?

answer:

Poor sparse vector processing ability

The first problem is the weak generalization ability , that is, collaborative filtering cannot extend the similar information of two items to the similarity of other items. The resulting problem is that popular items have a strong head effect and are likely to be similar to a large number of items, while tail items are rarely recommended due to sparse feature vectors . For example, the following example:

image

A, B, C, D are items. Looking at the item co-occurrence matrix on the right, you can find that item D has a relatively large similarity with A, B, and C, so it is very likely to recommend D to those who have used A, B, and C. user. However, the reason that item D is similar to other items is because D is a popular item, and the reason why the system cannot find the similarity between A, B, and C is that its features are too sparse and lack of direct data for similarity calculation. So this is the natural defect of collaborative filtering: the recommendation system has obvious head effects and weak ability to process sparse vectors .

In order to solve this problem and increase the generalization ability of the model at the same time, in 2006, Matrix Factorization (MF ) was proposed. This method uses more dense latent vectors to represent users and items based on the collaborative filtering co-occurrence matrix. , Mining the hidden interests and hidden features of users and items, to a certain extent, make up for the insufficient ability of the collaborative filtering model to process sparse matrices. The specific details will be sorted out later, so let's pave the way first.

3. What are the advantages and disadvantages of the similarity calculation method introduced above?

Cosine similarity is still commonly used, and the general effect is not too bad, but when the scoring data is not standardized, that is to say, there are some users who like to score high, and some users like to score low. Of users like random scoring. At this time, the result of consine similarity calculation may not be so accurate, such as the following situation:

image

At this time, if you use the cosine similarity to calculate, you will find that user d and user f are relatively similar, but in fact, if you look at a trend of product preferences, d and e are actually relatively similar, but e prefers to score low. , D prefer to score high. So for this kind of user rating bias, the cosine similarity is not so good, you can consider using the following Pearson correlation coefficient.

4. What other deficiencies exist in collaborative filtering? Is there any better idea to solve (relieve)?

answer:

The characteristic of collaborative filtering is that it does not use the attributes of the item itself or the user at all, and only uses the interactive information between the user and the item to achieve recommendation. It is relatively simple and efficient, but this is also its shortcoming, because it cannot be effective. Introducing a series of user characteristics, item characteristics, and context characteristics such as user age, gender, product description, product classification, current time, location, etc. This results in the omission of effective information and makes it impossible to make full use of other characteristic data.

In order to solve this problem, more features are cited in the recommendation model. The recommendation system slowly shifted from collaborative filtering at the core to logistic regression model at the core , and a machine learning model that can integrate different types of features was proposed.

The timeline on the left side of the evolution diagram has been sorted out:

image

Guess you like

Origin blog.csdn.net/yichao_ding/article/details/109207992