[Pearson correlation coefficient corrwith] use case: movie recommendation system

Collaborative filtering algorithms are used to discover the correlation between users and items, and there are two main types: user-based and item-based.

Based on user:

User 1 bought items A, B, C, and D, and gave good reviews; and user 2 also bought items A, B, and C, so user 1 and user 2 are considered to be the same type of users, and D can also be recommended to user 2 .

Item based:

If item A and item B have been bought by users 1, 2, and 3, it is considered that item A and item B have a high degree of similarity, and user 4 has bought item A, then item B can also be recommended to user 4.

The following example is an item-based collaborative filtering algorithm:


Project background: A video platform decided to build an intelligent recommendation system based on the user's viewing records to optimize the experience and increase user stickiness.

Processing datasets

# 导入数据
import pandas as pd
movies = pd.read_excel('/kaggle/input/movie-name-and-category/Movie Name and Category.xlsx')  
score = pd.read_excel('/kaggle/input/movie-name-and-category/Film rating.xlsx')  
# 关联为一张表
df = pd.merge(movies, score, on='电影编号')
df.head()

Detailed code:

The merge() function is used to associate two data sets, and the default is inner join. on: The column name used for the join, which must exist in both DataFrame objects.

The syntax of left join is: result2 = pd.merge(df1, df2, on='key', how='left'); how: join method, including 'left', 'right', 'outer', 'inner' Four (default 'inner').

(See the comment section for the data set, there are 9712 movies and 100836 ratings in the data set)

operation result:


 Convert to PivotTable

# 将原始数据转换为数据透视表
user_movie = df.pivot_table(index='用户编号', columns='名称', values= '评分')
user_movie.tail()

Detailed code:

pivot_table() is the pivot table function of pandas, index represents the index, and colums represents the column name.

tail() means to display the last few lines, the default is 5 lines.

operation result:

 


Calculating the Pearson Correlation Coefficient

# 从数据透视表中提取各用户对《阿甘正传》的评分
FG = user_movie['阿甘正传(1994)']
# corrwith()函数计算《阿甘正传》与其他电影间的皮尔逊相关系数
corr_FG = user_movie.corrwith(FG)  # 计算皮尔逊相关系数
similarity = pd.DataFrame(corr_FG, columns=['相关系数'])  # 整合成二维表格
# 使用DataFrame的dropna()函数进行剔除空值
similarity.dropna(inplace=True) 
# 展示5部电影的相关系数
similarity.head()

Detailed code:

user_movie is the pivot table we generated in the previous step, first filter out the target movie to be analyzed, and save it to FG.

Use the corrwith() function to calculate the Pearson correlation coefficient between item A and other items.

Due to the existence of null values, the dropna() function is used to remove the null values; the reason for the null values ​​is that no user has rated the two movies at the same time, so the covariance in the Pearson correlation coefficient cannot be calculated, resulting in the NaN values ​​appear.

operation result:


filter results

# 简单设置评分次数阈值为30,然后用sort_values()函数将表格按相关系数降序排列
similarity_new[similarity_new['评分次数'] > 30].sort_values(by='相关系数', ascending=False).head()

Detailed code:

If the number of screening scores is more than 30, it is to avoid accidental values ​​that appear when the sample size is too small.

sort_values() is a sorting function, and the parameter by is based on which field to sort; ascending=False means descending.

operation result:

 Interpretation of the results: "Crazy Twins", "Fatal Attraction" and "Forrest Gump" have a high degree of similarity. Users who like "Forrest Gump" may also like these two movies and can recommend them.

Guess you like

Origin blog.csdn.net/Sukey666666/article/details/131333596