The recommendation algorithm based on item collaborative filtering

Collaborative filtering items based on (item-based collaborative filtering) algorithm is more the industry's previous algorithm. Whether Amazon or Netflix , Hulu , YouTube , the basis for its recommendation algorithm is the algorithm. For the convenience of drafting, hereinafter referred to as English ItemCF FIG. This article will talk in its basic algorithm, based on a step by step improvement MovieLens data set shows the code to achieve, bring you a taste of the beauty of classical algorithm.

1. Basic principles

Earlier we simply explain a bit user-based collaborative filtering (UserCF), and gives the implementation code. Do not know friends can go to the link ---- recommendation algorithm of collaborative filtering based on user . But we also talked about some shortcomings of the algorithm, first of all, with the growing number of users of the site, the similarity matrix to calculate the user's interest will be increasingly difficult, the computational complexity of time and space complexity and increase the number of users growth similar to the square. Secondly, it is difficult to interpret the results of collaborative filtering recommendation based on the user. Therefore, the well-known e-commerce company Amazon presents another algorithm --ItemCF.

ItemCF recommended items and items similar to those before them like to the user. For example, the algorithm because you purchased a "statistical learning" and give you recommended "machine learning." However, ItemCF algorithm does not use the content attribute items to calculate the similarity between the goods , it is mainly through the user's behavior record to calculate the similarity between the object analysis . The algorithm considered, item A and item B has a great similarity is because I like the article A user's most liked items B .

Based on collaborative filtering algorithm using the item may provide the historical behavior of users recommend explained to recommend the results , such as the user recommended "Rosemary" explanation may be because like "her eyelashes" before the user.

Based on article collaborative filtering algorithm is divided into two steps :

  1. Calculating a similarity between items
  2. To user-generated recommendation list based on the similarity of user behavior and history items

From another purchase the product user often purchased starting defined words, we can use the following equation defines similarity article:
\ [ij of W_} = {\ FRAC {| N (I) \ N bigcap (j) |} {| N
(i) |} \] here, the denominator \ (| N (i) | \) is like the number of users Item i, the numerator \ (N (i) \ bigcap N (j) \) is the number of users at the same time articles like i and j of goods. Accordingly, the above formula can also be understood as a user likes Item i what proportion of users like item j.

Although the above equation looks very reasonable, but there is a problem. If the item j very popular, a lot of people are like, so \ (w_ {ij} \) will be great, close to one. Therefore, the formula will cause any popular items and items will have great similarity, which is obviously not a good characteristic for trying to exploit the long tail information recommendation systems. In order to avoid the hot recommended items, you can use the following equation:
\ [ij of W_} = {\ FRAC {| N (I) \ bigcap N (J) |} {\ sqrt {| N (I) || N ( j) |}} \]
this formula penalizes items j weights, thus reducing the possibility of popular items and many items will be similar.

You can see from the above definition, collaborative filtering to produce two articles similarity because they collectively like many users , which means that each user can give items contributed by the similarity of their historical interest list . There implies a hypothesis , is that each user's interests are limited to certain areas, so if a user belongs to two items of interest list, then these two items may belong to a limited number of areas, and if the two a list of items belonging to the interest of many users, then they may belong to the same field, so there is a lot of similarity .

And UserCF algorithm similar, but here is to create a user - Items inverted list, and then calculate the similarity of the goods. ItemCF calculates the user by the following equation ufor an item jof interest:
\ [P_ {UJ} = \ sum_ {I \ in N (U) \ bigcap S (J, K)} {W_ {JI} R_ {UI}} \]
based MovieLens data set (data set explicit feedback) implements the algorithm, see address: ItemCF algorithm

2. Improved algorithms

2.1 introduces IUF parameters soft punishment active users

It can be seen in the previous paragraph, in two articles in ItemCF produce similarity because they appear together in the interest of many users in the list. In other words, interest has produced a list of each user's contribution to the similarity of goods. However, in real life, not every user contributions are the same .

So John S. Breese presented a paper called \ (IUF \) ( Inverse the User Frequence ), namely the number of user activity on
Reciprocal, he also believes that active users of articles similarity contribution should be less than inactive user, should he proposed to increase IUF
parameter calculation formula is corrected similarity article:
\ [ij of W_} = {\ FRAC {\ sum_ {U \ N in (I) \ bigcap N (J) {} \ {FRAC. 1 } {\ log1 + | N (
u) |}}} {\ sqrt {| N (i) || N (j) |}} \] of course, the above formula is only active user to make a soft punishment , but for too many active users, such as the above who bought books Dangdang 80% of users, in order to avoid the similarity matrix was too dense, we generally ignore his direct interest list in the actual calculation, not be included in the similarity calculation data set .

2.2 normalized similarity matrix

Karypis the study found that if the ItemCF the similarity matrix by the maximum normalized, can improve the recommendation accuracy . The research shows that if an article has been similarity matrix w, it can be obtained by the following equation similarity matrix 'w' after normalization: \
[W '_} = {ij of \ ij of FRAC {} {} {W_ \ max_ {j} w_ {ij }} \]
return a benefit of not only recommendation is to increase the accuracy, it can improve the recommendation of coverage and diversity. In general, items always belong to many different categories, each of the items more closely linked.

As an example, assume that a station, there are two movies - documentaries and animation. Then, ItemCFcalculated similarity is similarity or similarity typically animated cartoons and documentaries and documentary similarity greater than documentaries and animation. But the similarity between the similarity between documentary and animated films is not necessarily the same. Suppose the article into two categories - A & B, the degree of similarity between the Class A 0.5 articles, articles similarity between the Class B of 0.6, and the degree of similarity between the Class A and Class B articles are articles 0.2. In this case, if a user likes a five Class A Class B items and 5 items with ItemCFhim a recommendation, it is recommended Class B items,
because of the large degree of similarity between the class B items. But if, after normalization, the degree of similarity between the Class A article becomes 1, the degree of similarity between the Class B article 1 is, then in this case, if the user like the five articles A and Class 5 Class B items, his list of recommendations in the number of class a and B class goods items should be roughly equal. As can be seen from this example, the similarity of normalization can improve the recommendation of diversity.

So, for high similarity between the two different classes, what class of its items are, what kind of classes within their class low degree of similarity articles it? In general, the popular classes within their items are generally larger similarity. Without normalization, it will recommend the more popular items inside the class, and these items are also more popular. Therefore, the recommended coverage rate is relatively low. On the contrary, if the similarity normalization, it is possible to increase the coverage of the recommendation system.

3. Summary

ItemCF recommendation result focused on maintaining user interest in history , that is more personalized recommendation, reflecting the user's own interest in heritage. In books, movies and e-commerce sites, such as Amazon, watercress, Netflix in, ItemCF it can greatly advantage. First of all, in these sites, the user's interest is fixed and relatively long-lasting. These sites are personalized recommendation of the mission is to help users find items related to his field of study .

Item number algorithm applies to items significantly less than where the number of users, rich long tail items, strong user demand for personalized fields. The algorithm can be real-time, users have a new behavior, will lead to changes in real time recommendation results. And can give a good explanation of the recommendation. Cold start, the new user as long as the behavior of a produce item, we can recommend the item and other items related to him. But there is no way to recommend new items to the user without offline update item similarity table.

PS: (◠◡◠) authors feel pretty good, please point it is recommended Oh! Many thanks! )

Guess you like

Origin www.cnblogs.com/rainbowly/p/12128615.html