In-depth understanding of collaborative filtering algorithm and its implementation

Introduction

        Personalized recommendation systems play an important role in the modern digital age, assisting users in discovering information, products, or media content that may be of interest to them. Collaborative filtering is one of the most popular and effective algorithms in personalized recommendation systems.

Table of contents

The principle of collaborative filtering algorithm

User-Based Collaborative Filtering

User Similarity Calculation

cosine similarity

Demo

Pearson correlation coefficient

Demo

Neighbor user selection

similarity measure

Choice of the user's neighbors

threshold filtering

Personalized Similarity Weight

score prediction

Item-Based Collaborative Filtering

Different variants of collaborative filtering

data preprocessing

python example

Creation of User-Item Rating Matrix

User-Based Collaborative Filtering

Item-Based Collaborative Filtering

Performance Optimization and Scaling


The principle of collaborative filtering algorithm

User-Based Collaborative Filtering

User Similarity Calculation

When calculating the similarity between users, measures such as cosine similarity and Pearson correlation coefficient are usually used

cosine similarity

Cosine similarity is a similarity measure that measures the angle between two non-zero vectors. In collaborative filtering, users can be viewed as vectors, where each dimension represents an item and the value represents the user's rating for that item.

The calculation steps of cosine similarity are as follows:

  1. Computes the dot product (inner product) of two user vectors.
  2. Computes the norm (modulo) of each user vector.
  3. Cosine similarity is computed using the product of the dot product and the norm.

The cosine similarity formula is as follows:

Demo

import numpy as np

# 两个用户的评分向量
user1_ratings = np.array([5, 4, 0, 0, 1])
user2_ratings = np.array([0, 0, 5, 4, 2])

# 计算余弦相似度
cosine_similarity = np.dot(user1_ratings, user2_ratings) / (np.linalg.norm(user1_ratings) * np.linalg.norm(user2_ratings))

print(f"余弦相似度: {cosine_similarity}")

Pearson correlation coefficient

The Pearson correlation coefficient is a statistical measure used to measure the strength and direction of a linear relationship between two variables. In collaborative filtering, it is used to measure the correlation between user ratings.

The steps to calculate the Pearson correlation coefficient are as follows:

  1. Computes the mean of two vectors of user ratings.
  2. Computes the difference of each user rating vector from the mean.
  3. Calculate the Pearson correlation coefficient of the difference.

The formula for the Pearson correlation coefficient is as follows:

Demo

import numpy as np

# 两个用户的评分向量
user1_ratings = np.array([5, 4, 0, 0, 1])
user2_ratings = np.array([0, 0, 5, 4, 2])

# 计算均值
mean_user1 = np.mean(user1_ratings)
mean_user2 = np.mean(user2_ratings)

# 计算差异
diff_user1 = user1_ratings - mean_user1
diff_user2 = user2_ratings - mean_user2

# 计算皮尔逊相关系数
pearson_correlation = np.sum(diff_user1 * diff_user2) / (np.sqrt(np.sum(diff_user1**2)) * np.sqrt(np.sum(diff_user2**2)))

print(f"皮尔逊相关系数: {pearson_correlation}")

Neighbor user selection

similarity measure

        When selecting similar users, it is first necessary to define a similarity measurement method. Commonly used similarity measurement methods include cosine similarity, Pearson correlation coefficient, Jaccard similarity and so on. Choosing an appropriate similarity measure depends on the nature of the data and the characteristics of the problem. Cosine similarity is usually used for rating data, while Jaccard similarity is usually used for binary data (whether a user liked or clicked on an item).

Choice of the user's neighbors

        Once the similarity measure method is selected, the next step is to determine how many similar users to select. Usually, the number of similar users selected is controlled by a parameter k, called "Nearest Neighbors". Increasing k improves coverage but may decrease accuracy because more users may include less similar users. Choosing an appropriate k is a matter of trade-offs, which can be determined by techniques such as cross-validation.

threshold filtering

        Besides k-based selection, threshold filtering can also be used to select similar users. For example, only select users whose similarity with the target user is greater than a certain threshold. This approach can help filter out less similar users and improve recommendation accuracy. The choice of threshold usually needs to be adjusted based on actual problems and data.

Personalized Similarity Weight

        In some cases, similarities between different users may have different importance. For example, certain users may be more relevant to the target user's behavior in a specific domain or time period. Therefore, each similar user can be assigned an individualized similarity weight to better reflect their contribution.

score prediction

        First, we need to select a set of lookalike users who behave similarly to the target user in the past. We can measure the similarity between users using a previously calculated similarity measure such as cosine similarity or Pearson correlation coefficient.

        Once similar users are selected, we need to obtain the historical rating data of these similar users for items that have not yet been rated. These rating data will be used to predict the target user's rating.

        Next, we use the historical rating data of similar users to calculate the target user's predicted ratings for items that have not yet been rated.

You can use weighted average method or method based on weighted regression:

        

Note: The following parts will not be expanded in detail, but can be expanded on the basis of getting started

Item-Based Collaborative Filtering

  • Item Similarity Calculation : Discusses in detail how to calculate the similarity between items, using measures such as cosine similarity.
  • Neighbor Item Selection : An in-depth discussion on how to find similar items for target users to items they have rated to generate more precise recommendations.
  • Rating Prediction : Explains how to generate final recommendations based on the historical ratings of these similar items.

Different variants of collaborative filtering

  • Collaborative filtering based on implicit feedback : processing implicit feedback data, such as user browsing history and click records.
  • Collaborative filtering in deep learning : Using deep learning models to improve the performance of collaborative filtering.
  • Temporal Collaborative Filtering : Taking temporal factors into account to predict the evolution of user behavior and interests.

data preprocessing

  • Data preparation : Prepare user-item rating data, usually expressed in the form of DataFrame.
  • Data cleaning : Handle missing values, outliers, and duplicate data to ensure data quality.
  • Data splitting : Divide a dataset into training, validation, and test sets for model training and evaluation.

python example

Creation of User-Item Rating Matrix

import pandas as pd

# 创建用户-项目评分矩阵
ratings = pd.DataFrame({
    'User1': [5, 4, 0, 0, 1],
    'User2': [0, 0, 5, 4, 2],
    'User3': [4, 5, 0, 0, 0],
    'User4': [0, 0, 4, 5, 0]
}, index=['Item1', 'Item2', 'Item3', 'Item4', 'Item5'])

User-Based Collaborative Filtering

from sklearn.metrics.pairwise import cosine_similarity

# 计算用户之间的相似性(余弦相似度)
user_similarity = cosine_similarity(ratings.fillna(0))

# 选择目标用户和要推荐的项目
target_user = 'User1'
target_item = 'Item3'

# 预测目标用户对目标项目的评分
target_user_ratings = ratings.loc[:, target_user]
similar_users = user_similarity[ratings.index == target_item]
predicted_rating = (similar_users @ target_user_ratings) / sum(similar_users[0])

print(f"预测用户{target_user}对项目{target_item}的评分为: {predicted_rating[0]}")

Item-Based Collaborative Filtering

# 预测目标用户对目标项目的评分
target_item_ratings = ratings.loc[target_item, :]
similar_items = item_similarity[ratings.columns == target_item]
predicted_rating = (similar_items @ target_item_ratings) / sum(similar_items[0])

print(f"预测用户{target_user}对项目{target_item}的评分为: {predicted_rating[0]}")

Performance Optimization and Scaling

        On the basis of the example, optimization can also be made in the following directions

  • Model improvement : Improve the collaborative filtering model, including using weighted scoring, considering time factors, etc., to improve the quality of recommendations.
  • Large-scale data processing : processing large-scale data sets, including the use of distributed computing and distributed storage, to process rating data of massive users and items.
  • Real-time recommendation : Introduces how to apply collaborative filtering algorithms to real-time recommendation systems to meet users' immediate needs.

Guess you like

Origin blog.csdn.net/qq_52213943/article/details/132627353