Field-based collaborative filtering algorithm: UserCF and ItemCF

推荐系统


 

A user-based collaborative filtering algorithm (UserCF)


1.1 The basic idea

The algorithm to calculate the similarity between two users , the degree of similarity herein refers to two similarity of the user's interest.

Suppose user u and v for the user, N (u) N, and (v) they are used to have positive feedback set of items, then the degree of similarity may be calculated by the u and v Jaccard formula: 

 

Or to calculate their similarity by cosine similarity: 

 

 

For example

Suppose user A article {a, b, d} had behavior, the user of the article B {a, c, f} had behavior, then a similarity cosine similarity is calculated for the A and B: 

                           =                           =

 



In fact, the above-mentioned user similarity calculation formula is too rough, the next section will introduce improved algorithm on user similarity calculation.

1.2 improve computational efficiency

This method requires calculating the similarity, complexity between each two users. When the number of many users, this method is very time consuming, especially when a large number of users and not between the correlation (i.e. 0), is calculated for these users is completely unnecessary. Therefore, we need to first determine whether 0, then a non-zero degree of similarity between the users can be calculated. 

 

For the matrix, built from the first two-dimensional objects to the user's inverted list, each item on its own line in the table. For each row of the table, the first element is an item, if a user u produced a behavior of the items, will u be added to the row. For each row of the list of users, which the users of the existence of similarity between any two. 

 

Then, the establishment of a sparse matrix C, if the user u and user v simultaneously appeared in the inverted list of k line, then that u and v k together these two items produced a behavior, that is, C [u] [v] = k. Initially, the individual elements of C are zero. 

 

Traversing each row of the two-dimensional row down the list of users in the table, for any two of the users u and v wherein the C [u] [v] and C [v] [u] by one. Thus, after the traversal is complete, the value of C [u] [v] is equal to: 

 

Be seen, the matrix is ​​a symmetric matrix.

After calculating the pairwise similarity between all users, UserCF algorithm would recommend to his interest in the closest k users favorite items to the user, the following formula measures the level of interest of the user u i of items: 

 

Which contains a list of users and user interest closest u k, is the behavior of the items i had a list of users are interested in the similarity of the user u and user v, v represents the degree of user Item i like (due to use here implicit feedback data is a single action, so all = 1).

1.3 algorithm parameters

The parameter k is an important parameter UserCF algorithm, its recommendation algorithm indicators will have an impact of some columns:

  • Precision (precision and recall rate) : accuracy and recall the parameter k is not a linear relationship, but selecting the appropriate k important for obtaining high accuracy of the recommendation system.

  • Popularity : The larger the k, the more popular the UserCF recommended items.

  • Coverage : The larger the k, the greater popularity, while the smaller the coverage accordingly.

1.4 improvements: UserCF-IIF algorithm

This part skeptical.

Empirical Analysis of Predictive Algorithms for Collaborative Filtering

1.5 UserCF defects algorithm

In fact, little more is the use of UserCF algorithm in the industry to use collaborative filtering algorithm items (ItemCF) based.

The main drawback UserCF are:

  • As the number of users increases, the calculation of interest between all pairwise similarity users time complexity will grow, and a positive correlation between the number of users squared.
  • UserCF algorithm is difficult to make a convincing explanation of the recommended results.



 

2. collaborative filtering algorithm article (ItemCF) based


2.1 The basic idea

The algorithm to recommend other items and similar items before they like to the user. For example, if you buy the "Introduction to Data Mining" I would recommend "machine learning" to you.

ItemCF algorithm to calculate the user's historical behavior records to analyze the similarity between the goods : If you like the article A user's most liked item B, item A and then consider article B has a certain similarity. This makes it easy to make a reasonable explanation for the recommendation results.

Assumptions, and are the number of users like item A and item B, and is fond of both A and B like the number of users, then the similarity of item A and item B is (like the user A number of people also like B) : 

 

Formulas have a question: If B is a very hot commodity, it will be close to 1 (A because I like people like B), which will cause any other items with a certain popular items are very similar. Therefore, we make some changes to the formula, with a weighting factor B items punishment, has been the formula: 

 

2.2 calculation

Complete Works assume articles is to first establish a full-zero matrix C: 

 

For users - the list of items of interest: 

 

If the item appears on a user's interest in the list, and then are incremented. After such traversal, and got the final matrix C: 

 

After obtaining a degree of similarity between each two items, ItemCF user u is calculated by the equation of interest in items i. 
Suppose N (u) is a set of items the user likes, S (i, K) is the most similar to the item i K sets of items, item similarity degree j i of the article, the article is the user u j of interest ( for implicit feedback data set, if the user u j of the article produced a behavior can be so = 1), u is the user's interest in the item i, then: 

 

2.3 algorithm parameters

K parameter is an important parameter ItemCF algorithm, its recommendation algorithm indicators will have an impact of some columns:

  • Precision (precision and recall rate) : accuracy and recall the parameter k is not a positive correlation or negative correlation, but selecting the appropriate K important for obtaining high accuracy of the recommendation system.

  • Popularity : As K increases, the popularity of the recommendation results will gradually improve, but when K increases to a certain extent, popularity, there will be no significant changes.

  • Coverage : K greater coverage may be reduced correspondingly.

2.4 Effects on user activity article similarity (ItemCF-IUF)

In ItemCF, the similarity between the two can produce goods because they appeared together in the list of users interested in a number of items, so users will have to contribute to their similarity twenty-two items of interest list. However, the contribution of different users is not the same.

For example, librarians bought 90% of the books on Jingdong, but most are not his interest; and a young artists bought five novels, but they are his interest. So, librarian for a book he bought pairwise similarity is much less than the young artists.

Therefore, active users, compared to inactive users, the similarity between the contribution of smaller items. John S. Breese in paper proposed the concept of IUF (Inverse User Frequence) in. Suppose N (u) u is the user's favorite list items, then user u IUF parameters: 

 

IUF items increased similarity parameter formula 

 

Algorithm recorded as  ItemCF-IUF
In fact, for overactive user, such as the above librarian, simply ignored its general interest list item, not be included in the calculation of the similarity to the data set.

2.5 article normalized similarity

Karypis in the paper mentioned: In ItemCF, if the similarity matrix normalized according to the maximum value (? 1 is set to the maximum okay), you can improve the recommendation accuracy. 

 

In addition to improving the accuracy of recommendation results, the normalization can also improve the coverage and diversity recommendation results.

For example 
assume there are two items A and B, Class A similarity between the article is 0.5, the degree of similarity between the Class B articles 0.6, the degree of similarity between the Class A and Class B goods articles 0.2. In this case, if a user likes five articles A and Class 5 Class B articles, the algorithm ItemCF items are recommended to the user B, because the degree of similarity between the Class B relatively large articles. After similarity normalization processing, the degree of similarity between the Class A article becomes 1, the degree of similarity between the Class B article 1 is, in this case, if the user likes five articles A and Class 5 class B items, the system recommended to him the number of class a and B class goods items should be roughly equal.

What kind of classes within their class higher degree of similarity between the goods, the similarity between the lower classes within what their items are?  
In general, the more popular classes, the greater the similarity of its kind items, if not normalized, it will tend to recommend popular class of goods, resulting in low recommended coverage.

Defects 2.6 ItemCF algorithm

Mentioned above: The more popular class, the greater the degree of similarity within their class objects. In addition, the similarity between the different areas of the most popular items is often very high.

For example 
older people like to read the news network, do not look at other news programs. After they read the news network, immediately change the channel to see the central eight sets of domestic drama, other TV shows (such as drama) hardly looked. Then, the data ItemCF algorithm, we can easily believe that a high degree of similarity with the TV news network prime time, and the similarity with other news network news programs (such as Nanjing Zero) is very low. This is clearly unreasonable.

For such problems, the user data alone can not be solved, it is necessary to introduce the contents of data items. This is beyond the scope of the collaborative filtering.



 

3. UserCF V.S. ItemCFU


Application UserCF and ItemCF algorithm in reality

the company algorithm use
Digg UserCF Personalized web article recommendation
GroupLens UserCF Personalized news recommendation
NetFlix ItemCF Film recommendation
Amazon ItemCF Shopping Recommended



Why is news recommended UserCF algorithm, and shopping sites use ItemCF algorithm?  
Recommended result UserCF algorithm focuses on hot spots reflect those target users with similar interests of small groups, and recommendation results ItemCF algorithm focuses on the history of interest in the maintenance of the target user. In other words, UserCF recommended more social, and more personalized recommendations ItemCF.



Compare UserCF and ItemCF algorithm

  UseCF ItemCF
performance Suitable for a smaller number of users scenes, great if many users, then calculate the similarity matrix between the user's expense Applies to the number of items is significantly smaller than the number of user scenarios, if a great many items, the similarity matrix is ​​calculated between the cost of the article
field Timeliness strong, user personalization less obvious areas of interest Nagao items rich, strong user demand for personalized areas
real-time New user's behavior does not necessarily lead to an immediate change in recommendation results New user's behavior will lead to the results of real-time changes recommended


Cold start
When the new user behavior to produce a small amount of goods, we can not recommend him immediately, because the user similarity table is generally computed offline from time to time.

When a new line item, if there is a behavior of the user to produce an article, the article can be given to other users recommend the user similar
As long as the new user behavior to produce an item, you can recommend other items similar to the items he

must After updating the goods similarity table (offline) before you can recommend new items to other users
Recommended reason Recommended difficult to provide convincing explanation so that users Using the user's historical behavior as reasons for the recommendation, it is easy to convince users
Released three original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/dl2277130327/article/details/86648174