Recommended system 36 types - study notes (c)

Due to the need to start learning recommendation algorithm, geek reference [time] -> [punishment] no knife Daniel's recommendation system [36 types], learning and finishing.

3 articles principle of close recommendation

3.1 Collaborative Filtering

To say mentioned recommendation system, what algorithm the most renowned the world over, I think there must be collaborative filtering. On many occasions, even someone collaborative filtering and recommendation systems equate seen how closely the relationship between them.

The emphasis on collaborative filtering "synergy", the so-called synergy, that is, groups of mutual help, mutual support is the embodiment of the collective wisdom, collaborative filtering is so simple and straightforward, timeless.

When you spent the recommendation system can only be used after recommendation Stage content-based, there is a considerable user behavior. At this time user behavior it is usually positive, that is, the user explicitly or implicitly express like behavior. These behaviors can be expressed as a relationship matrix of users and items, or network, or a diagram, it is a thing.

This relationship matrix is ​​filled with items the user is the user attitude towards objects, but not every location has, the need is to fill those places not yet up. The relationship matrix is ​​the lifeblood of collaborative filtering, everything around it for.

Collaborative filtering algorithm is a relatively large areas. Usually divided into two categories:

1. The memory-based collaborative filtering (Memory-Based);

Memory-based collaborative filtering, is to remember that everyone has consumed something, then give him recommend something similar, or recommend something similar for human consumption.

2. The model-based collaborative filtering (Model-Based).

The article is from the user to learn a relationship matrix model based on collaborative filtering model, so that the matrix fills the space.

3.2 User-based collaborative filtering

3.2.1 idea: In detail is this: to help you find a group and you taste is very similar to the user based on historical consumer behavior; then you're depending on these and similar user then consume nothing new, you have not seen the article, I can recommend to you.

This process is actually a cluster to the user, the user interest in accordance with the tastes clustered into different groups, it is recommended to the user generated from the average of this group; it is recommended to do this, the key is how to quantify the "taste similar "this looks very straightforward thing. This is related to a user who will follow in the same room, in case into the wrong room, it will not affect.

3.2.2 Principle:

The core is the relationship matrix that user objects, this matrix is ​​the most primitive material.

The first step in preparation for user vectors, from this matrix, in theory, can be a vector for each user.

Why say it is "theoretically" mean? Because the resulting vector premise is: the user needs to have the father of behavioral data in our product, otherwise it would be the vector. This vector has three characteristics:

1. Vector dimension is the number of items;

2. The vector is sparse, that is not the value of each dimension has, of course, the reason is very simple, this is not a consumer user through all the items;

3. The value of the vector dimension can be a simple 0 or 1, which is a Boolean value, 1 indicates liked, 0 means no, of course, because it is a sparse vector, so a value of 0 would ignore.

The second step, with each user vector pairwise similarity between users is calculated, setting a similarity threshold or to set a maximum number of users most similar retention for each user.

The third step is to produce the recommendation results for each user.

And put him "like-minded" users who liked items Taken together, they have to remove the user-consumer items, sort the remaining output is recommended result, it is not very simple. Specific summary way we use a formula to express:

等号左边就是计算一个物品 i 和一个用户 u 的匹配分数,

等号右边是这个分数的计算过程,分母是把和用户 u 相似的 n 个用户的相似度加起来,

分子是把这 n 个用户各自对物品 i 的态度,按照相似度加权求和。

这里的态度最简单就是 0 或者 1,1 表示喜欢过,0 表示没有,如果是评分,则可以是 0 到 5 的取值。
整个公式就是相似用户们的态度加权平均值。

3.2.3 Practice

Reflections on the following questions:

1. Only original user behavior logs, we need to construct a matrix from which, how do?

2. If your vector is long, it takes a long time to calculate a similarity, how do?

3. If the number of users is large, and usually so pairwise similarity computing users is a pit, how to do?

4. In calculating the recommendation, looks to calculate his score for each user and each item, but also a pit, how to do?

1, matrix structure

When we are doing collaborative filtering calculations used in the matrix is ​​sparse, people say the word is: a lot of matrix elements do not exist, because it is zero. We describe the typical sparse matrix storage format here.

1. CSR: This somewhat more complex memory, encoding a whole. It has three components: the value, the column number and line offset common encoding.

2. COO: This storage is simple, represents each element (row number, column number value) with a triple, stores only element values, missing values ​​are not stored.

The storage format in common computational framework which are standard, such as in Spark, Python NumPy the package. Some well-known algorithms game also usually provide data in this format. Not repeat them here.

Convert your original behavior log into the above format, you can use the standard input of the common computational framework.

2, the similarity calculation

The first is a single similarity calculation problem, if you run into vector long, no matter what the similarity calculation method, we must traverse the vector, if the implementation is even more impressive with the cycle, it is usually ways to reduce the computational complexity of similarity of two ways:

1. Calculation of the vector samples. The reason is simple, one hundred two-dimensional vector calculated similarity is 0.7, I now endure some loss of accuracy, not 100-dimensional calculation, random 10-dimensional removed from the calculation, obtained similarity is 0.72, apparently with a 100-dimensional calculation some of the more credible 0.7, but in the case of ten times reduced computational complexity, it is 0.72, and the error is not large, which is more economical. The algorithm proposed by Twitter, called the DIMSUM algorithm has been implemented in Spark.

2. Calculate the vectorization. This is not so much a trick, rather it is a way of thinking. In the field of machine learning, calculated between the vectors is a common practice, is to be used for vector calculation cycle to achieve it? Not a modern linear algebra libraries directly support vector operations, much faster than the cycle. This is what we in any place, must find a way to convert into a vector loop direct calculation, generally used as a vector libraries are natural supports, such as the Python NumPy.

 

Second question is, if the user is large enough, between any two computational cost is enormous. There are two ways to solve computational much of a problem :

1. The first method is: the similarity is calculated split into Map Reduce task, the original bonding matrix Map for the user, for the same value of a product of two user rating of the article, then the Reduce stage summing these products , after the end of Map Reduce task of normalizing these values;

2. The second way is: do not have user-based collaborative filtering.

In addition, the task of calculating the similarity between two of the objects, if the data do not, in general, no more than one million, then the matrix is ​​sparse, so there are many stand-alone version of the tool is actually faster, such as KGraph, GraphCHI and so on.

3, recommended to calculate

Get the similarity between the user after. Then there is a tough one to calculate the recommended score. Obviously, the calculation for each user recommendation score of each item, the number of calculations is the number of all elements of the matrix, the price, of course you can not accept ah. At this time, your attention to the recall summary of the previous formula, there are several characteristics that we can make use of:

1. Only similar users liked items need to be calculated, this big praise, all compared to the number of items a lot less;

2. The calculation process is split into Map Reduce task.

拆 Map Reduce 任务的做法是:
1. 遍历每个用户喜欢的物品列表;
2. 获取该用户的相似用户列表;
3. 把每一个喜欢的物品 Map 成两个记录发射出去,一个是键为 < 相似用户 ID,物品 ID,1> 三元组,
    可以拼成一个字符串,值为 < 相似度 >,另一个是键为 < 相似用户 ID,物品 ID,0> 三元组,
    值为 < 喜欢程度 * 相似度 >,其中的 1 和 0 为了区分两者,在最后一步中会用到;
4. Reduce 阶段,求和后输出;
5. < 相似用户 ID,物品 ID, 0> 的值除以 < 相似用户 ID,物品 ID, 1> 的值

In general, small and medium companies, if not particularly necessary, do not use distributed computing, the looks tall, and big data slightest, in fact, more harm than good.

Split Map Reduce task does not necessarily have to use Hadoop Spark or achieve. This process can also be achieved with a single.

Because a Map process, in fact, the original calculation of coupling decoupling, and a flat shot, so we can take advantage of multi-threading technology to achieve Map effect. For example OpenMP C ++ library which allows us to use multiple threads painless, fully exploiting all nuclear computer.

4, some improvements

For user-based collaborative filtering There are some common ways to improve, improve mainly in the degree of user items like:

1. The degree of punishment for popular items like, this is because the popular thing is difficult to reflect the real interests of users, are more likely to be instigated, or just click on the boring situation, which is a common feature of group behavior;

2. Increase the amount of time decay like, generally use an exponential function, the index is a negative number, value and like behavior positively related to the time interval, it is well understood, a child does not mean I like things now taste, people is bound to change, this is human nature.

3.2.4 application scenarios

Which scenario has user-based collaborative filtering. Collaborative filtering based user has two outputs:

1. The similar user list;

2. Based on the result of the user's recommendation.

So we can only recommend items can also recommend the user! For example, we have seen in some social platforms: "similar to the fans" "you taste like people," and so can be calculated.

Calculated to recommend this method result in itself, because it is calculated based on taste, so more emphasis on personal privacy application scenario better, in such a scenario, the impact is not big V to better reflect the true interest groups, instead of being incited mob.

Based on article 3.3 (Item-Based) gossip

Collaborative filtering items based, also commonly called Item-Based, because the latter is easier to search for relevant articles, it is more mentioned.

Born in 1998, it was first proposed by the Amazon's collaborative filtering algorithm based articles, and published by the inventor of the corresponding paper (Item-Based Collaborative Filtering Recommendation Algorithms) in 2001.

This number nearly 7000 citations on Google Scholar, and was awarded the "time-tested Award" in WWW2016 conference, awards words are: "This outstanding paper deeply affected the practical application." After over 15 years still shine, it is clear that this award well-deserved.

Although today the companies are using this algorithm, if it is a public resource, however, it is not the case, Amazon as early as 1998, which is the paper published three years ago applied for a patent.

3.4 Based on article (Item-Based) principle

Before the emergence of collaborative filtering items based on the most frequently used information filtering system user based collaborative filtering. User-based collaborative filtering is first calculated similar to the user, and then recommend similar items based on user preferences, the algorithm there are so few questions:

1. The number of users tend to be large, it is very difficult to calculate, become a bottleneck;

2. In fact, the user's tastes change rather quickly, not static, it is difficult to be reflected interest in migration issues;

3. Data sparse, there is a common consumer behavior is actually relatively few users and users, and generally are some of the popular items, to find the user's interest not much help.

 

And based on different users, based on collaborative filtering items recommended for similar items calculated first, then the user has consumed, or are similar to consumer goods, items algorithms based on how to solve these problems above it?

First, the number of items, or strictly speaking, can recommend a number of items are often less than the number of users; it is the similarity between the general computing items will not become a bottleneck.

Secondly, the similarity between the goods relatively static, they change the speed without the user's tastes change quickly; so completely decoupled migration of user interest in this issue.

Finally, a large number of consumer items corresponding to, for calculating the degree of similarity between the sparsity article is better to calculate the similarity between users.

According to what I said in the previous article, collaborative filtering is the most dependent relationship matrix user items, based on collaborative filtering algorithm is no exception items, it's the basic steps is this:

1. Build a user item relation matrix, the matrix elements can be the user's consumer behavior, it can be evaluated post-consumer, it may also be some kind of consumer behavior quantify such as time, frequency and costs;

2. If the matrix represents the article row, column indicates the user, then to calculate a similarity between each two row vector, to obtain articles similarity matrix, the rows and columns are items;

3. Generate recommendation result, depending on the recommendable scenes, there are two forms of produce results. One is the recommended items related to a particular item, the other is home to generate something like "you may also like" recommendations result in individuals. Do not worry, I will say later, respectively.

3.5 similarity calculation items

Front more generally, to calculate the degree of similarity between the goods, now explain in detail this. Articles vector obtained from the user item matrix relationship look like it? I'll give you describe:

1. It is a sparse vector;

2. The vector dimension is the user, a user on behalf of the one-dimensional vector, the vector dimension is the total of the total number of users;

3. The vector values ​​of each dimension is the result of user consumption of the goods, the act itself may be a Boolean value, consumer behavior can be quantified, such as the length of time, how much the number, size and other expenses, but also to evaluate the consumption of scores;

4. No post-consumer no longer represented, so that is a sparse vector.

Then there is twenty-two calculating the similarity of the goods, generally choose cosine similarity, of course, there are other similarity calculation method can also be. Calculated as follows:

用文字解释一下这个公式:
分母是计算两个物品向量的长度,求元素值的平方和再开方。
分子是两个向量的点积,相同位置的元素值相乘再求和。

When the physical meaning of this formula is to calculate the cosine of the angle of the two vectors, the similarity is 1, the corresponding angle is 0, when the gluey like, the degree of similarity is zero, corresponding to an angle of 90 degrees, irrelevant to each other passers A.

1. The center of the article. The scores matrix, subtracting the mean score of the article; the first calculating the mean score for each item is received, and then the article score vector corresponding to the mean of the article is subtracted. What is the purpose of doing so is it? Remove items irrational factors hardcore fan base, for example, a flow of the movie star, its collective brain residual powder could play high score, it means to use the center of the items have a certain extent.

2. Users of the center. The score matrix, corresponding to the average user score of minus; first calculate the mean score for each user, and then he played all the scores are subtracted mean.

What is the purpose of doing so is it? Everyone standards are not the same, some stringent standards, and some loose, so the average user can only subtract retained in preference to a certain extent, remove the subjective component.

Similarity calculation method mentioned above, is not only applicable to the classification assessment matrix, the matrix also applies to the behavior. The so-called behavior matrix, i.e. the matrix element is the Boolean value 0 or 1, which is talked about in the previous column implicit feedback. Special values ​​implicit feedback, there are some items improved recommendation algorithm based can not be applied, such as the famous Slope One algorithm.

3.6 Recommendations for calculating results

After getting the items similarity, the next step is for users to recommend items he might be interested in, and collaborative filtering items based on two scenarios.

The first part of TopK recommendation, similar in form often belong to "guess you like" this.

Starting way is when users visit the home page, the summary and "user-consumer items have been similar" items, according to the summary score from high to low after launch. Summary of the formula is this:

   This formula: it is the core idea and user-based recommendation algorithm, as summarized by the similarity weighting.

To predict a score for a user u i of the article, traversing all items rated by user u, if after a total of m, each item i and the article to be calculated similarity score is multiplied by the user, so that the weighted sum, divided by the sum of all of these similarities, there was obtained a weighted average rating, as the user u i fractional prediction article.

And recommendations based on items, we do not have all the items are counted again in the calculation, according to user need only score with items taken out one by one and they are similar to the items out of it.

After this process is done off-line, get rid of those users have been consumed before, to retain the highest score of the k results are stored. When users visit the home page, you can check out directly.

The second part of the relevant recommendation, which is the scene today column title refers.

Such recommendation does not need to be aggregated in advance, when a user accesses an item's detail page, or face a complete result of consumer goods, direct access to similar articles Recommend this article is "looked and looked," or "buy and buy "the results of the recommendation.

3.7 Slope One algorithm

Based on the classic items recommendation, the similarity matrix calculation can not be updated in real time, the whole process is off-line computing, but there is another problem, there is no similarity when considering the similarity calculated confidence problem. For example, two items, they are like the same user, and only this is like a user, then the result cosine similarity calculation is 1, the 1 in the final summary calculation recommendation score, but the biggest impact on the results of .

Slope One algorithms have very good improvement to address these issues. First published in 2005, Slope One algorithm specifically for the scoring matrix, does not apply to the behavior matrix. Slope One similarity between the algorithm is not an article, but the distance between the items calculated similarity opposite. For example, at a glance, here is a simple scoring matrix:

↑ This matrix reflects the facts: the user 1 to an article A, B, C are the scores, which are 5,3,2; 2 user to the article A, B of the scores, are 3,4; article B to user 3 , C score, and 2,5 respectively.

Now first of all to calculate the gap between twenty-two items:

A和B的差距: ((5+3)-(3+4))/2 = 0.5;
A和C的差距:(5-2)/1 = 3;
B和A的差距:((3+4)-(5+3))/2 = -0.5;
B和C的差距:((3+2)-(2+5))/2 = -1;
C和A的差距:(2-5)/1 = -3;
C和B的差距:((2+5)-(3+2))/2 = 1;

In parentheses indicates the number of users together two items, representing the degree of confidence gap between the two items. Such as the gap between article A and article B is 0.5, the number of common users 2, and vice versa, the gap between the article and the article A, B is -0.5, the number of users or co 2. Knowing this gap, you can go to the prediction score another item with an item.

If only the user knows the score 3 to article B is 2, then the user is predicted to article 3 it is 2.5 A ratings, because the difference from article to article B of A is 0.5.

We continue to push forward on this basis, if the user to know the score of a plurality of articles, how these scores are summarized what?

A single prediction method is the fractional number of common users according to a weighted averaging. For example the user is now known not only to the articles 3 to B 2 score, score C is returned articles 5, articles B A prediction of the article is 2.5, the just calculated, the predicted article A to articles C is 8 minutes, and then weighted average.

3.8 similarity of nature

Recommendation system, the recommendation algorithm is divided into two sects, a machine learning school, another similarity is the martial art. Machine learning school is a rising star, but the similarity is sent Mount Rushmore, that hold up to half of the recommended system.

Neighbor recommended the name suggests is geographically live close. If you have a neighbor, the neighbor on the social networking software to recommend to him very reasonable intuitively, of course, if the neighbor surnamed Wang, then it is not recommended.

The neighbors here say, not necessarily just in the neighborhood location under the three-dimensional space, in an arbitrary high-dimensional space can be found neighbors, especially when feature dimensions and items are high users to find the user's next door neighbor, not so intuitive, easy to choose the appropriate way to measure the similarity.

Neighbor recommended core is the similarity calculation method of choice, because the neighbors do not recommend the use of optimization ideas, so the effect usually depends on the way the quantization matrix and the similarity of choice.

Similarity, there is another ancillary concept is the distance, both of which are used to quantify the degree of affinity of two objects in a high dimensional space, which are sides of the coin.

Recommended martial similarity algorithm, in fact, there is such a potential hypothesis: If two objects are very similar, which is very close, then the two objects can easily produce the same action.

If the two news very similar, so they are easily click to read the same person has, if two users are similar, then they can easily click on the same news. This hypothesis consistent with intuition, most of the time it worked.

In fact, the recommendation algorithm belong to another martial art - machine learning, there are many algorithms in a certain angle seen as a measure of similarity.

For example, linear regression or logistic regression, while the feature vector, the other side is a model parameter vector dot product between the two, it can be seen as a similarity calculation, but the values ​​of model parameters which human flesh is not specified but from the data is automatically summarized by the optimization algorithm out.

In neighboring recommendation, the similarity is most commonly used cosine similarity. However, it is not just optional cosine similarity degree of similarity, as well as the Euclidean distance, Pearson correlation, adaptive cosine similarity, etc. locality sensitive hashing. Usage scenarios vary, today I will introduce one by one, respectively, as follows.

3.9 similarity calculation method

3.9.1 Data Classification

Similarity calculation object is a vector, or called high-dimensional space coordinates, a meaning. That means there are two values ​​of this vector:

1. The real value;

2. Boolean value is 0 or 1.

1, Euclidean distance

Two objects are in the same space as two points, if called p and q, respectively are n-coordinate. Then the Euclidean distance is a measure of the distance between these two points, q from p to move to the distance traveled. Continental is not suitable distance between Boolean vectors.

显然,欧式距离得到的值是一个非负数,最大值是正无穷。通常相似度计算度量结果希望是 [-1,1] 或者 [0,1] 之间,所以欧式距离要么无法直接使用到这个场景中,要么需要经过二次转化得到      (0, 1)

2、余弦相似度

大名鼎鼎的余弦相似度,度量的是两个向量之间的夹角,其实就是用夹角的余弦值来度量,所以名字叫余弦相似度。当两个向量的夹角为 0 度时,余弦值为 1,当夹角为 90 度时,余弦值为 0,为 180 度时,余弦值则为 -1。

余弦相似度在度量文本相似度、用户相似度、物品相似度的时候都较为常用;但是在这里需要提醒你一点,余弦相似度的特点:它与向量的长度无关。因为余弦相似度计算需要对向量长度做归一化:

经过向量长度归一化后的相似度量方式,背后潜藏着这样一种思想:两个向量,只要方向一致,无论程度强弱,都可以视为“相似”。

比如,我用 140 字的微博摘要了一篇 5000 字的博客内容,两者得到的文本向量可以认为方向一致,词频等程度不同,但是余弦相似度仍然认为他们是相似的。

在协同过滤中,如果选择余弦相似度,某种程度上更加依赖两个物品的共同评价用户数,而不是用户给予的评分多少。这就是由于余弦相似度被向量长度归一化后的结果。

余弦相似度对绝对值大小不敏感这件事,在某些应用上仍然有些问题。

举个小例子,用户 A 对两部电影评分分别是 1 分和 2 分,用户 B 对同样这两部电影评分是 4 分和 5 分。用余弦相似度计算出来,两个用户的相似度达到 0.98。这和实际直觉不符,用户 A 明显不喜欢这两部电影。

针对这个问题,对余弦相似度有个改进,改进的算法叫做调整的余弦相似度(Adjusted Cosine Similarity)。调整的方法很简单,就是先计算向量每个维度上的均值,然后每个向量在各个维度上都减去均值后,再计算余弦相似度。

前面这个小例子,用调整的余弦相似度计算得到的相似度是 -0.1,呈现出两个用户口味相反,和直觉相符。

3、皮尔逊相关度

皮尔逊相关度,实际上也是一种余弦相似度,不过先对向量做了中心化,向量 p 和 q 各自减去向量的均值后,再计算余弦相似度。

皮尔逊相关度计算结果范围在 -1 到 1。-1 表示负相关,1 比表示正相关。皮尔逊相关度其实度量的是两个随机变量是不是在同增同减。

如果同时对两个随机变量采样,当其中一个得到较大的值另一也较大,其中一个较小时另一个也较小时,这就是正相关,计算出来的相关度就接近 1,这种情况属于沆瀣一气,反之就接近 -1。

由于皮尔逊相关度度量的时两个变量的变化趋势是否一致,所以不适合用作计算布尔值向量之间相关度,因为两个布尔向量也就是对应两个 0-1 分布的随机变量,这样的随机变量变化只有有限的两个取值,根本没有“变化趋势,高低起伏”这一说。

4、杰卡德(Jaccard)相似度

杰卡德相似度,是两个集合的交集元素个数在并集中所占的比例。由于集合非常适用于布尔向量表示,所以杰卡德相似度简直就是为布尔值向量私人定做的。对应的计算方式是:

1. 分子是两个布尔向量做点积计算,得到的就是交集元素个数;

2. 分母是两个布尔向量做或运算,再求元素和。

余弦相似度适用于评分数据,杰卡德相似度适合用于隐式反馈数据。例如,使用用户的收藏行为,计算用户之间的相似度,杰卡德相似度就适合来承担这个任务。

 

 

 

 

 

发布了366 篇原创文章 · 获赞 100 · 访问量 4万+

Guess you like

Origin blog.csdn.net/qq_34732729/article/details/103250147