The common practice of collaborative filtering algorithms is to search a large group of people and find a small group of people with similar tastes to us. Algorithms examine other content these people prefer and combine them to construct a ranked list of recommendations.
Methods for calculating similarity: Euclidean distance and Pearson correlation.
Euclidean distance evaluation:
In python, you can use the function pow(n,2) to square a number, and use the sqrt function to find the square root:
from math import sqrt
sqrt(pow(4.5-4,2)+pow(1-2,2))
The above formula can calculate the distance value, the more similar the preference, the shorter the distance. However, we also need a function that gives larger values for more similar preferences. To do this, we can add 1 to the function value (this avoids the divisible by 0 error), and take its inverse:
1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2)))
This new function always returns a value between 0 and 1, with a return of 1 indicating that both people have the same preference. Combining the previous knowledge, construct a function to calculate the similarity:
from math import sqrt
def sim_distance(prefers,person1,person2):
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
if len(si)==0:return 0
sum_of_sruares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sqrt(sum_of_squares))
Pearson correlation evaluation:
The correlation coefficient is a measure of how well two sets of data fit a line . The corresponding formula is more complex than the Euclidean distance evaluation formula, but it tends to give better results when the data is not very standardized.
def sim_pearson(prefs,p1,p2):
si=[]
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
n=len(si)
if n==0: return 1
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0:return 0
r=num/den
return r
The function will return a number between -1 and 1. A value of 1 indicates that two people have the exact same rating for each item. Unlike distance metrics, here we don't have to transform this value to get the correct ratio.