Python implementation movie recommendation system based on the similarity to determine user preferences

The two most common types of content recommendation system and collaborative filtering (CF) based. Collaborative filtering based on user attitudes toward products generate recommendations, recommend items based on similarity of attribute-based content recommendation system. CF memory can be divided based on collaborative filtering, and model-based collaborative filtering.

We can use MovieLens data set, which is one of the most common data set at the time of implementation, and testing recommendation engine is used, contains the 1682 movie from a selection of 943 users and scores.

Import numpy libraries and pandas


import numpy as np

import pandas as pd

Read u.data data file


header = ['user_id', 'item_id', 'rating', 'timestamp']

df = pd.read_csv('u.data', sep = '\t', names = header)

Check the number of users and movies


n_users = df.user_id.unique().shape[0]

n_items = df.item_id.unique().shape[0]

print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)


Number of users = 943 | Number of movies = 1682

Scikit-learn library using the divided data sets into a training set and test set, call Cross_validation.train_test_split according to the ratio of the test sample (test_size) the data shuffling and split into two data sets.


from sklearn import cross_validation as cv

train_data,test_data = cv.train_test_split(df, test_size = 0.25)

Based on collaborative filtering memory

Memory-based collaborative filtering method may be divided into two parts: the user - and collaborative filtering products - Product collaborative filtering. Users - collaborative filtering products will select a specific user, based on the similarity score is similar to that found in the user's user and recommend that similar users like the product. - Product collaborative filtering will select a product, find like users of the product, and find other products similar to these users also liked.

Users - collaborative filtering products: "people who love this stuff like ......"

- Product Collaborative filtering: "People like you also like ......"

In both cases, a user item matrix constructed from the entire data set.

User product matrix example:

Calculating the similarity, and create a similarity matrix.

Product - the similarity between the product collaborative filtering the product is measured by observing all users scoring two products.

User - user similarity between collaborative filtering products is measured by observing all users simultaneously scoring two products.

Distance matrix is ​​generally used in the recommendation system is cosine similarity, wherein the scoring is seen as a vector in n-dimensional space, the similarity is based on the angle between the vectors calculated.

Create a user product matrix data for test and training data, create two matrices:


train_data_matrix = np.zeros((n_users,n_items))

for line in train_data.itertuples():

train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((n_users, n_items))

for line in test_data.itertuples():

test_data_matrix[line[1]-1, line[2]-1] = line[3]

It is calculated using the cosine similarity of pairwise_distances sklearn function.


from sklearn.metrics.pairwise import pairwise_distances

user_similarity = pairwise_distances(train_data_matrix, metric = "cosine")

item_similarity = pairwise_distances(train_data_matrix.T, metric = "cosine")

Have created similarity matrix: user_similarity and item_similarity, therefore, can make predictions based on the user's CF by applying the following formula:

As a similarity between the user rights and user k can be a weight multiplied by a similar user (user corrections Rating) score, where the normalized value required, so that the scoring is located between 1 and 5, to try to predict the final the average score summing users.

The average prediction score based on the following products CF applications company, this time without the need to correct users


def predict(rating, similarity, type = 'user'):

if type == 'user':

mean_user_rating = rating.mean(axis = 1)

rating_diff = (rating - mean_user_rating[:,np.newaxis])

pred = mean_user_rating[:,np.newaxis] + similarity.dot(rating_diff) / np.array([np.abs(similarity).sum(axis=1)]).T

elif type == 'item':

pred = rating.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])

return pred


item_prediction = predict(train_data_matrix, item_similarity, type = 'item')

user_prediction = predict(train_data_matrix, user_similarity, type = 'user')

Assess

Here root mean square error (RMSE) to measure the accuracy of the prediction score

You may be used in sklearn mean_square_error (MSE) function, wherein the square root of MSE RMSE only.


from sklearn.metrics import mean_squared_error

from math import sqrt

def rmse(prediction, ground_truth):

prediction = prediction[ground_truth.nonzero()].flatten()

ground_truth = ground_truth[ground_truth.nonzero()].flatten()

return sqrt(mean_squared_error(prediction, ground_truth))


print 'User based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))

print 'Item based CF RMSe: ' + str(rmse(item_prediction, test_data_matrix))


User based CF RMSE: 3.12466203536

Item based CF RMSe: 3.45056350625

As can be seen, it is easy to implement an algorithm based on content and produce a reasonable forecast quality.

Based Collaborative Filtering Model

Based collaborative filtering model is based on matrix factorization (MF), the matrix decomposition recommendation system is widely used, it is more than memory-based CF has better scalability and sparsity. MF goal is to learn from the user's known preferences and potential scoring potential product attributes, and then to predict the unknown score by dot product characteristics of potential users and products.

Calculation MovieLens dataset sparsity:


sparsity = round(1.0 - len(df) / float(n_users*n_items),3)

print 'The sparsity level of MovieLen100K is ' + str(sparsity * 100) + '%'


The sparsity level of MovieLen100K is 93.7%

SVD

Given m * n matrix X:

U is a (m * r) orthogonal matrix

S is a diagonal non-negative real numbers (r * r) diagonal matrix

V ^ T is a (r * n) orthogonal matrix

The diagonal elements of S are called singular values ​​of X.

X can be decomposed into matrix U, S and V. U represents a matrix corresponding to the characteristic of the characteristic matrix hidden user space, and the matrix V indicates characteristics corresponding to the characteristics of the hidden space matrix product.

Now, it can be predicted by the U, S and V ^ T of the dot product of:


import scipy.sparse as sp

from scipy.sparse.linalg import svds

u, s, vt = svds(train_data_matrix, k = 20)

s_diag_matrix = np.diag(s)

x_pred = np.dot(np.dot(u,s_diag_matrix),vt)

print 'User-based CF MSE: ' + str(rmse(x_pred, test_data_matrix))


User-based CF MSE: 2.72035726617

to sum up:

To achieve a simple collaborative filtering method, including content-based CF and CF-based model

Model is based on the content similarity between the users based on the product or, where using cosine similarity.

CD-based model is based on matrix decomposition using SVD to decompose matrix

Standard collaborative filtering method performed poorly in the face of cold start.

Reference material

Implementing your own recommender systems in Python

These are some of the popular Python recommendation system code Xiao Bian today as we bring.
There are confused I do not know how to learn in a school friend Editor's Choice Learning Python Learning qun 315 -346- 913 can learn and progress together learn together! Share free videos

 

Guess you like

Origin blog.csdn.net/weixin_44995023/article/details/92074987