Pearson correlation similarity

Pearson correlation-based similarity

The Pearson correlation coefficient reflects the degree of linear correlation between two variables, and its value is between [-1, 1]. When the linear relationship between two variables increases, the correlation coefficient tends to 1 or -1; when one variable increases, the other variable also increases, indicating that there is a positive correlation between them, and the correlation coefficient is greater than 0; if one variable increases If the correlation coefficient is greater than 0, the other variable decreases, indicating that they are negatively correlated, and the correlation coefficient is less than 0; if the correlation coefficient is equal to 0, it indicates that there is no linear correlation between them.

Covariance: Used in probability theory and statistics to measure the overall error of two variables. If the changes of the two variables tend to be consistent, that is, if one of them is greater than its own expected value, and the other is also greater than its own expected value, then the covariance between the two variables is positive; if the two variables change in opposite trends , the covariance is negative.

where u is the expected E(X) of X and v is the expected E(Y) of Y

Standard Deviation: The standard deviation is the square root of the

variance . Variance: In probability theory and statistics, the variance of a random variable expresses its degree of dispersion, that is, the distance between the variable and the expected value.

i.e. the expectation that the variance is equal to the sum of the squares of the errors

Similarity based on the Pearson correlation coefficient has two disadvantages:
(1) it does not take into account the effect of the number of overlapping rating items between users on the similarity;
(2) if there is only one rating in common between two users Items, similarity cannot be calculated either. In

the above table, the rows represent some rating values ​​for items (101-103) by users (1-5). Intuitively, User1 and User5 use 3 common rating items, and the difference between the given ratings is not large. It stands to reason that the similarity between them should be higher than that between User1 and User4, but User1 and User4 are similar. There is a higher similarity 1.

The same scenario often happens in real life, such as two users who watched 200 movies together, although they don't necessarily give the same or exactly similar ratings, they should be more similar than the other who watched only 2 The similarity of the same movie is high! But this is not the case. If the similarity given by two users is the same or very similar for these two movies, the similarity calculated by Pearson correlation will be significantly greater than that between users who have watched the same 200 movies. similarity.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324944039&siteId=291194637