Recommender System Evaluation Metrics
2.1 Coverage
Coverage describes the ability of a recommender system to discover the long tail of items. The simplest definition of coverage is the ratio of the items that the recommender system can recommend to the total item set. Assuming that the set of system users is U, the recommender system recommends an item list R(u) of length N for each user. Then the coverage of the recommendation system can be calculated by the following formula.
2.2 Diversity
The user's interests are extensive. If the user's recommendation list is relatively diverse, covering most of the user's interest points, it will increase the probability of the user finding the item of interest. The diversity describes the relationship between the items in the recommendation list. Dissimilarity, assuming that s(i,j) [0,1] defines the similarity between items i and j, then the diversity of user u's recommendation list R(u) is defined as follows:
The overall diversity of the recommender system can be defined as the average of the diversity of all user recommendation lists:
2.3 Accuracy
Prediction accuracy measures the ability of a recommender system or recommendation algorithm to predict user behavior. This indicator is the final offline evaluation indicator of the recommender system.
2.3.1 Score prediction :
The prediction accuracy of rating prediction is generally calculated by root mean square error (RMSE) and mean absolute error (MAE). For a user u and item i in the test set, let be the actual rating of item i by user u, but recommendation algorithm given the predicted score, then the RMSE is defined as:
The python code is as follows:
The records store the user's rating data, records[i]=[u,i,rui,pui], where rui is the actual rating of the item i by the user u, and pui is the rating of the item i by the user u predicted by the algorithm .
import math def RMSE(records): return math.sqrt (sum ([(rui-pui) * (rui-pui) for u, i, rui, pui in records]) / float (len (records)))MAE uses the absolute value to calculate the forecast error, which is defined as:
The python code is as follows:
import math def MAE(records): return sum ([abs (rui, pui) for u, i, pui, rui in records]) / float (len (records))2.3.2 TopN recommends :
When recommending a service again, if a recommendation list is generally given to the user, the recommendation is called TopN recommendation. The prediction accuracy of TopN recommendation is generally measured by the precision rate and recall rate. Let R(u) be based on the user's training set. The list of recommendations made by the behavior to the user, and T(u) is the list of user behaviors on the test set, then the recall rate of the recommendation result is:
The accuracy of the recommendation result is defined as:
The data set is data[i] = [u,i,source], where u is the user, i is the item, the source is optional, train is the training set, and test is the test set. For the code to divide the training set and test set, please refer to :
def SplitData(Data,M,k,seed): ''' Divide training set and test set :param data: incoming data :param M: test set proportion :param k: an arbitrary number used to randomly filter the test and training sets :param seed: random number seed, in the case of the same seed, the random number generated by it will not change :return:train: training set test: test set, all dictionaries, key is user id, value is movie id set ''' data=Data.keys() test=[] train=[] random.seed(seed) # In M experiments, we need the same random number seed, so the generated random sequence is the same for user,item in data: if random.randint(0,M)==k: # The probability of equality is 1/M, so M determines the proportion of the test set in all data # Selecting different k will select different training sets and test sets for label in Data[(user,item)]: test.append((user,item,label)) else: for label in Data[(user, item)]: train.append((user,item,label)) print "splitData successed!" return train,testWhen calling the function:(train, test) = SplitData(UI_label, 10, 5, 10)Different parameters can be passed in, and different results can be obtained.
The code to calculate recall and precision is as follows:
def recallAndPrecision(self, train=None, test=None, k=8, nitem=10): train = train or self.traindata test = test or self.testdata hit = 0 recall = 0 precision = 0 for user in train.keys(): tu = test.get(user, {}) rank = self.recommend(user, train=train, k=k, nitem=nitem) for item, _ in rank.items(): if item in tu: hit += 1 recall + = only (here) precision += nitem return (hit / (recall * 1.0), hit / (precision * 1.0))
Sometimes, we do not use the recall rate and the precision rate to evaluate the recommendation results, but comprehensively consider two worthy changes, and use the comprehensive evaluation index (F-measure). , which we need to consider comprehensively, the most common method is F-measure ( F-Score ), F-measure is the weighted harmonic average of precision and recall. The specific calculation is:
When the parameter α=1, it is the most common F1, that is, F1 = 2*P*R/(P+R), code: F1 = 2*precision*recall/(precision+recall)
The code that implements the calculation based on user similarity in detail is as follows, including the calculation process and evaluation indicators:
class UserBasedCF: def __init__(self, train=None, test=None): self.trainfile = train self.testfile = test self.readData() def readData(self, train=None, test=None): self.train_df = train or self.trainfile self.test_df = test or self.testfile self.traindata = self.testdata = {} box_id = list(self.train_df['box_id']) film_id = list(self.train_df['llable']) source = list(self.train_df['source']) box_id2 = list(self.test_df['box_id']) film_id2 = list(self.test_df['llable']) source2 = list(self.test_df['source']) for i in range(len(box_id)): userid, itemid, record = box_id[i], film_id[i], source[i] self.traindata.setdefault(userid, {}) self.traindata[userid][itemid] = record print self.traindata for i in range(len(box_id2)): userid, itemid, record = box_id2[i], film_id2[i], source2[i] self.testdata.setdefault(userid, {}) self.testdata[userid][itemid] = record print self.testdata def userSimilarityBest(self, train=None): train = train or self.traindata # self.userSimBest = dict() self.userSimBest = defaultdict(defaultdict) item_users = dict() for u, item in train.items(): for i in item.keys(): item_users.setdefault(i, set()) item_users[i].add(u) user_item_count = dict() count = dict() for item, users in item_users.items(): for u in users: user_item_count.setdefault(u, 0) user_item_count[u] += 1 for v in users: if u == v: continue count.setdefault(u, {}) count[u].setdefault(v, 0) count[u][v] += 1 for u, related_users in count.items(): self.userSimBest.setdefault(u, dict()) for v, cuv in related_users.items(): self.userSimBest[u][v] = cuv / math.sqrt(user_item_count[u] * user_item_count[v] * 1.0) def recommend(self, user, train=None, k=8, nitem=40): train = train or self.traindata rank = dict() interacted_items = train.get(user, {}) for v, wuv in sorted(self.userSimBest[user].items(), key=lambda x: x[1], reverse=True)[0:k]: for i, rvi in train[v].items(): if i in interacted_items: continue rank.setdefault(i, 0) rank[i] += wuv return dict(sorted(rank.items(), key=lambda x: x[1], reverse=True)[0:nitem]) # precision, recall def recallAndPrecision(self, train=None, test=None, k=8, nitem=10): train = train or self.traindata test = test or self.testdata hit = 0 recall = 0 precision = 0 for user in train.keys(): tu = test.get(user, {}) rank = self.recommend(user, train=train, k=k, nitem=nitem) for item, _ in rank.items(): if item in tu: hit += 1 recall + = only (here) precision += nitem return (hit / (recall * 1.0), hit / (precision * 1.0)) # coverage def coverage(self, train=None, test=None, k=8, nitem=10): train = train or self.traindata test = test or self.testdata recommend_items = set() all_items = set() for user in train.keys(): for item in train[user].keys(): all_items.add(item) rank = self.recommend(user, train, k=k, nitem=nitem) for item, _ in rank.items(): recommend_items.add(item) return len(recommend_items) / (len(all_items) * 1.0) # Popularity def popularity(self, train=None, test=None, k=8, nitem=10): train = train or self.traindata test = test or self.testdata item_popularity = dict() for user, items in train.items(): for item in items.keys(): item_popularity.setdefault(item, 0) item_popularity[item] += 1 ret = 0 n = 0 for user in train.keys(): rank = self.recommend(user, train, k=k, nitem=nitem) for item, _ in rank.items(): ret += math.log(1 + item_popularity[item]) n += 1 return ret / (n * 1.0) def testUserBasedCF(): path = 'C:\\...' train_filmname = '...' test_filmname = '...' train_data_df = pd.read_csv(path + train_filmname, sep=',') test_data_df = pd.read_csv(path + test_filmname, sep=',') Train_data = train_data_df Test_data = test_data_df cf = UserBasedCF(Train_data, Test_data) cf.userSimilarityBest() print("%3s%20s%20s%20s%20s" % ('K', "precision", 'recall', 'coverage', 'popularity')) # K users with the most similar interests for k in [5, 10, 20, 40, 80, 160]: recall, precision = cf.recallAndPrecision(k=k) coverage = cf.coverage(k=k) popularity = cf.popularity(k=k) print("%3d%19.3f%%%19.3f%%%19.3f%%%20.3f" % (k, precision * 100, recall * 100, coverage * 100, popularity)) if __name__ == "__main__": testUserBasedCF ()