Introduction
Personalized recommendation systems play an important role in the modern digital age, assisting users in discovering information, products, or media content that may be of interest to them. Collaborative filtering is one of the most popular and effective algorithms in personalized recommendation systems.
Table of contents
The principle of collaborative filtering algorithm
User-Based Collaborative Filtering
Pearson correlation coefficient
Choice of the user's neighbors
Personalized Similarity Weight
Item-Based Collaborative Filtering
Different variants of collaborative filtering
Creation of User-Item Rating Matrix
User-Based Collaborative Filtering
Item-Based Collaborative Filtering
Performance Optimization and Scaling
The principle of collaborative filtering algorithm
User-Based Collaborative Filtering
User Similarity Calculation
When calculating the similarity between users, measures such as cosine similarity and Pearson correlation coefficient are usually used
cosine similarity
Cosine similarity is a similarity measure that measures the angle between two non-zero vectors. In collaborative filtering, users can be viewed as vectors, where each dimension represents an item and the value represents the user's rating for that item.
The calculation steps of cosine similarity are as follows:
- Computes the dot product (inner product) of two user vectors.
- Computes the norm (modulo) of each user vector.
- Cosine similarity is computed using the product of the dot product and the norm.
The cosine similarity formula is as follows:
Demo
import numpy as np
# 两个用户的评分向量
user1_ratings = np.array([5, 4, 0, 0, 1])
user2_ratings = np.array([0, 0, 5, 4, 2])
# 计算余弦相似度
cosine_similarity = np.dot(user1_ratings, user2_ratings) / (np.linalg.norm(user1_ratings) * np.linalg.norm(user2_ratings))
print(f"余弦相似度: {cosine_similarity}")
Pearson correlation coefficient
The Pearson correlation coefficient is a statistical measure used to measure the strength and direction of a linear relationship between two variables. In collaborative filtering, it is used to measure the correlation between user ratings.
The steps to calculate the Pearson correlation coefficient are as follows:
- Computes the mean of two vectors of user ratings.
- Computes the difference of each user rating vector from the mean.
- Calculate the Pearson correlation coefficient of the difference.
The formula for the Pearson correlation coefficient is as follows:
Demo
import numpy as np
# 两个用户的评分向量
user1_ratings = np.array([5, 4, 0, 0, 1])
user2_ratings = np.array([0, 0, 5, 4, 2])
# 计算均值
mean_user1 = np.mean(user1_ratings)
mean_user2 = np.mean(user2_ratings)
# 计算差异
diff_user1 = user1_ratings - mean_user1
diff_user2 = user2_ratings - mean_user2
# 计算皮尔逊相关系数
pearson_correlation = np.sum(diff_user1 * diff_user2) / (np.sqrt(np.sum(diff_user1**2)) * np.sqrt(np.sum(diff_user2**2)))
print(f"皮尔逊相关系数: {pearson_correlation}")
Neighbor user selection
similarity measure
When selecting similar users, it is first necessary to define a similarity measurement method. Commonly used similarity measurement methods include cosine similarity, Pearson correlation coefficient, Jaccard similarity and so on. Choosing an appropriate similarity measure depends on the nature of the data and the characteristics of the problem. Cosine similarity is usually used for rating data, while Jaccard similarity is usually used for binary data (whether a user liked or clicked on an item).
Choice of the user's neighbors
Once the similarity measure method is selected, the next step is to determine how many similar users to select. Usually, the number of similar users selected is controlled by a parameter k, called "Nearest Neighbors". Increasing k improves coverage but may decrease accuracy because more users may include less similar users. Choosing an appropriate k is a matter of trade-offs, which can be determined by techniques such as cross-validation.
threshold filtering
Besides k-based selection, threshold filtering can also be used to select similar users. For example, only select users whose similarity with the target user is greater than a certain threshold. This approach can help filter out less similar users and improve recommendation accuracy. The choice of threshold usually needs to be adjusted based on actual problems and data.
Personalized Similarity Weight
In some cases, similarities between different users may have different importance. For example, certain users may be more relevant to the target user's behavior in a specific domain or time period. Therefore, each similar user can be assigned an individualized similarity weight to better reflect their contribution.
score prediction
First, we need to select a set of lookalike users who behave similarly to the target user in the past. We can measure the similarity between users using a previously calculated similarity measure such as cosine similarity or Pearson correlation coefficient.
Once similar users are selected, we need to obtain the historical rating data of these similar users for items that have not yet been rated. These rating data will be used to predict the target user's rating.
Next, we use the historical rating data of similar users to calculate the target user's predicted ratings for items that have not yet been rated.
You can use weighted average method or method based on weighted regression:
Note: The following parts will not be expanded in detail, but can be expanded on the basis of getting started
Item-Based Collaborative Filtering
- Item Similarity Calculation : Discusses in detail how to calculate the similarity between items, using measures such as cosine similarity.
- Neighbor Item Selection : An in-depth discussion on how to find similar items for target users to items they have rated to generate more precise recommendations.
- Rating Prediction : Explains how to generate final recommendations based on the historical ratings of these similar items.
Different variants of collaborative filtering
- Collaborative filtering based on implicit feedback : processing implicit feedback data, such as user browsing history and click records.
- Collaborative filtering in deep learning : Using deep learning models to improve the performance of collaborative filtering.
- Temporal Collaborative Filtering : Taking temporal factors into account to predict the evolution of user behavior and interests.
data preprocessing
- Data preparation : Prepare user-item rating data, usually expressed in the form of DataFrame.
- Data cleaning : Handle missing values, outliers, and duplicate data to ensure data quality.
- Data splitting : Divide a dataset into training, validation, and test sets for model training and evaluation.
python example
Creation of User-Item Rating Matrix
import pandas as pd
# 创建用户-项目评分矩阵
ratings = pd.DataFrame({
'User1': [5, 4, 0, 0, 1],
'User2': [0, 0, 5, 4, 2],
'User3': [4, 5, 0, 0, 0],
'User4': [0, 0, 4, 5, 0]
}, index=['Item1', 'Item2', 'Item3', 'Item4', 'Item5'])
User-Based Collaborative Filtering
from sklearn.metrics.pairwise import cosine_similarity
# 计算用户之间的相似性(余弦相似度)
user_similarity = cosine_similarity(ratings.fillna(0))
# 选择目标用户和要推荐的项目
target_user = 'User1'
target_item = 'Item3'
# 预测目标用户对目标项目的评分
target_user_ratings = ratings.loc[:, target_user]
similar_users = user_similarity[ratings.index == target_item]
predicted_rating = (similar_users @ target_user_ratings) / sum(similar_users[0])
print(f"预测用户{target_user}对项目{target_item}的评分为: {predicted_rating[0]}")
Item-Based Collaborative Filtering
# 预测目标用户对目标项目的评分
target_item_ratings = ratings.loc[target_item, :]
similar_items = item_similarity[ratings.columns == target_item]
predicted_rating = (similar_items @ target_item_ratings) / sum(similar_items[0])
print(f"预测用户{target_user}对项目{target_item}的评分为: {predicted_rating[0]}")
Performance Optimization and Scaling
On the basis of the example, optimization can also be made in the following directions
- Model improvement : Improve the collaborative filtering model, including using weighted scoring, considering time factors, etc., to improve the quality of recommendations.
- Large-scale data processing : processing large-scale data sets, including the use of distributed computing and distributed storage, to process rating data of massive users and items.
- Real-time recommendation : Introduces how to apply collaborative filtering algorithms to real-time recommendation systems to meet users' immediate needs.