Collaborative Filtering Algorithm

The common practice of collaborative filtering algorithms is to search a large group of people and find a small group of people with similar tastes to us. Algorithms examine other content these people prefer and combine them to construct a ranked list of recommendations.

Methods for calculating similarity: Euclidean distance and Pearson correlation.

Euclidean distance evaluation:

In python, you can use the function pow(n,2) to square a number, and use the sqrt function to find the square root:

from math import sqrt

sqrt(pow(4.5-4,2)+pow(1-2,2))

The above formula can calculate the distance value, the more similar the preference, the shorter the distance. However, we also need a function that gives larger values for more similar preferences. To do this, we can add 1 to the function value (this avoids the divisible by 0 error), and take its inverse:

1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2)))

This new function always returns a value between 0 and 1, with a return of 1 indicating that both people have the same preference. Combining the previous knowledge, construct a function to calculate the similarity:

from math import sqrt
def sim_distance(prefers,person1,person2):
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
if len(si)==0:return 0
sum_of_sruares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sqrt(sum_of_squares))

Pearson correlation evaluation:

The correlation coefficient is a measure of how well two sets of data fit a line . The corresponding formula is more complex than the Euclidean distance evaluation formula, but it tends to give better results when the data is not very standardized.

def sim_pearson(prefs,p1,p2):
si=[]
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
n=len(si)
if n==0: return 1
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0:return 0
r=num/den
return r

The function will return a number between -1 and 1. A value of 1 indicates that two people have the exact same rating for each item. Unlike distance metrics, here we don't have to transform this value to get the correct ratio.

Collaborative Filtering Algorithm

Guess you like