Building a Movie Recommender System with Python

This paper uses cosine similarity with KNN, Seaborn, Scikit-learn and Pandas to create a movie recommendation system using user rating data. A technical exchange platform is provided at the end of the article, and the full version of the code can be obtained from me.

In daily data mining work, in addition to using Python to process classification or prediction tasks, sometimes it also involves recommender system-related tasks.

Recommender systems are used in various fields, common examples include playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms. In this project, we create a movie recommender.

Collaborative filtering automatically predicts (filters) a user's interests by collecting information about the preferences or tastes of many users. Recommender systems have been around for a long time by now, and their models are based on various techniques such as weighted average, correlation, machine learning, deep learning, and more.

The Movielens 20M dataset has over 20 million movie ratings and tagging campaigns since 1995. In this article, we will movie.csv & rating.csvretrieve information from a file. Using the Python libraries: Pandas, Seaborn, Scikit-learn and SciPy, train the model using cosine similarity in the k-nearest neighbor algorithm.

Here are the core steps of the project:

  1. Import and merge datasets and create Pandas DataFrames

  2. Add the necessary features to analyze the data

  3. Visualize and analyze data with Seaborn

  4. Filter invalid data by setting a threshold

  5. Create a pivot table with users as an index and movies as a column

  6. Create a KNN model and output 5 recommendations similar to each movie

Import Data

Import and merge datasets and create Pandas DataFrames

MovieLens 20M dataset with over 20 million movie rating and tagging campaigns since 1995. Data set acquisition: Reply in the background of the public account: "Machine Learning Institute": movies

# usecols 允许选择自己选择的特征,并通过dtype设定对应类型
movies_df=pd.read_csv('movies.csv', 
                      usecols=['movieId','title'], 
                      dtype={
    
    'movieId':'int32','title':'str'})
movies_df.head()

picture

ratings_df=pd.read_csv('ratings.csv',
                       usecols=['userId', 'movieId', 'rating','timestamp'],
                       dtype={
    
    'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})
ratings_df.head()

picture

Check for any null values ​​and the number of entries in both data.

# 检查缺失值
movies_df.isnull().sum()
movieId    0
title      0
dtype: int64
ratings_df.isnull().sum()
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64
print("Movies:",movies_df.shape)
print("Ratings:",ratings_df.shape)
Movies: (9742, 2)
Ratings: (100836, 4)

Dataframes on merged columns'movieId'

# movies_df.info()
# ratings_df.info()
movies_merged_df=movies_df.merge(ratings_df, on='movieId')
movies_merged_df.head()

picture

The imported datasets have now been merged successfully.

Add derived features

Add the necessary features to analyze the data.

'Average Rating' & 'Rating Count'Create a column by grouping user ratings by movie title .

movies_average_rating=movies_merged_df.groupby('title')['rating']\
           .mean().sort_values(ascending=False)\
            .reset_index().rename(columns={
    
    'rating':'Average Rating'})
movies_average_rating.head()

picture

movies_rating_count=movies_merged_df.groupby('title')['rating']\
              .count().sort_values(ascending=True)\
               .reset_index().rename(columns={
    
    'rating':'Rating Count'}) #ascending=False
movies_rating_count_avg=movies_rating_count.merge(movies_average_rating, on='title')
movies_rating_count_avg.head()

picture

2 new derived features have been created so far.

data visualization

Visualize the data with Seaborn:

  • After analysis, many movies have a perfect 5-star average rating on the dataset of nearly 100,000 user ratings. This indicates that there are outliers, which we need to confirm further with visualization.

  • The ratings of multiple movies are relatively single, and it is recommended to set a rating threshold in order to generate valuable recommendations.

Visualize data with seaborn & matplotlib for better observation and analysis of data.

Plot the newly created features as a histogram and see their distribution. Set the binsize to 80. The setting of this value needs to be analyzed and set reasonably.

# 导入可视化库
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(font_scale = 1)
plt.rcParams["axes.grid"] = False
plt.style.use('dark_background')
%matplotlib inline

# 绘制图形
plt.figure(figsize=(12,4))
plt.hist(movies_rating_count_avg['Rating Count'],bins=80,color='tab:purple')
plt.ylabel('Ratings Count(Scaled)', fontsize=16)
plt.savefig('ratingcounthist.jpg')

plt.figure(figsize=(12,4))
plt.hist(movies_rating_count_avg['Average Rating'],bins=80,color='tab:purple')
plt.ylabel('Average Rating',fontsize=16)
plt.savefig('avgratinghist.jpg')

picture

Figure 1 Average Rating histogram

picture

Figure 2 Histogram of Rating Count

Now create a joinplot2D chart to visualize these two features together.

plot=sns.jointplot(x='Average Rating',
                   y='Rating Count',
                   data=movies_rating_count_avg,
                   alpha=0.5, 
                   color='tab:pink')
plot.savefig('joinplot.jpg')

picture

2D plot of Average Rating and Rating Count

analyze

  • Figure 1 confirms that most movies are rated low. In addition to setting thresholds, we can also use some higher percentage quantiles for this use case.

  • Histogram 2 shows the “Average Rating”distribution function.

Data cleaning

Use describe()functions to get descriptive statistics of a dataset, such as quantiles and standard deviations.

pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(rating_with_RatingCount['Rating Count'].describe())
count   100836.000
mean        58.759
std         61.965
min          1.000
25%         13.000
50%         39.000
75%         84.000
max        329.000
Name: Rating Count, dtype: float64

Set thresholds and filter out data above the thresholds.

popularity_threshold = 50
popular_movies= rating_with_RatingCount[
          rating_with_RatingCount['Rating Count']>=popularity_threshold]
popular_movies.head()
# popular_movies.shape

picture

So far the data has been cleaned by filtering out movies with reviews below a threshold.

Create a pivot table

Create a pivot table with users as an index and movies as a column

In order to load data into the model later, a pivot table needs to be created. And set 'title'as index, 'userId'as column, 'rating'as value.

import os
movie_features_df=popular_movies.pivot_table(
      index='title',columns='userId',values='rating').fillna(0)
movie_features_df.head()
movie_features_df.to_excel('output.xlsx')

picture

Next load the created pivot table into the model.

Build a kNN model

Build a kNN model and output 5 recommendations similar to each movie

Using scipy.sparsethe methods in the module csr_matrix, convert the pivot table to a matrix of arrays used to fit the model.

from scipy.sparse import csr_matrix
movie_features_df_matrix = csr_matrix(movie_features_df.values)

Finally, use the previously generated matrix data to train sklearnthe NearestNeighborsalgorithm from . and set the parameters:metric = 'cosine', algorithm = 'brute'

from sklearn.neighbors import NearestNeighbors
model_knn = NearestNeighbors(metric = 'cosine',
                             algorithm = 'brute')
model_knn.fit(movie_features_df_matrix)

Now pass an index to the model, according to the 'kneighbors'algorithm requirements, need to convert the data to a single row array, and set n_neighborsthe value.

query_index = np.random.choice(movie_features_df.shape[0])
distances, indices = model_knn.kneighbors(movie_features_df.iloc[query_index,:].values.reshape(1, -1),
                                          n_neighbors = 6)

Finally query_index, output the movie recommendation in .

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'
              .format(movie_features_df.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'
              .format(i, movie_features_df.index[indices.flatten()[i]],
                      distances.flatten()[i]))
Recommendations for Harry Potter and the Order of the Phoenix (2007):

1: Harry Potter and the Half-Blood Prince (2009), with distance of 0.2346513867378235:
2: Harry Potter and the Order of the Phoenix (2007), with distance of 0.3396233320236206:
3: Harry Potter and the Goblet of Fire (2005), with distance of 0.4170845150947571:
4: Harry Potter and the Prisoner of Azkaban (2004), with distance of 0.4499547481536865:
5: Harry Potter and the Chamber of Secrets (2002), with distance of 0.4506162405014038:

So far we have been able to successfully build a recommendation engine based only on user ratings.

Summarize

Here is a summary of the steps we took to build a movie recommendation system:

  1. Import and merge datasets and create Pandas DataFrames

  2. Create derived variables for better analysis of data

  3. Visualize data with Seaborn

  4. Clean data by setting thresholds

  5. Created a pivot table with users as an index and movies as a column

  6. Build a kNN model and output the 5 most similar recommendations for each movie

write at the end

Here are some ways you can expand your project:

  • This dataset is not very large and the scope of this project can be extended by including other files in the dataset in the project.

  • ' ratings.csv'The change in ratings over time can be analyzed using timestamps in , and ratings can be weighted according to timestamps when parsing our model.

  • The performance of this model is much better than weighted average or correlation models, but there is still room for improvement, such as using advanced ML algorithms or even DL models.

recommended article

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/123890052