Python中用于推荐系统的一些开源库

最近新发现一个推荐系统库lightfm,为了防止忘记,和之前用过的surprise库一起做个小笔记。

surprise库

surprise可以算上是推荐系统中较为常用,知名度较高的库了,它是是scikit系列中的一个库。
官方文档:https://surprise.readthedocs.io/en/stable/getting_started.html
github:https://github.com/NicolasHug/Surprise
surprise库支持多种推荐算法

algorithm describe
random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
baseline_only.BaselineOnly Algorithm predicting the baseline estimate for given user and item.
knns.KNNBasic A basic collaborative filtering algorithm.
knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
knns.KNNWithZScore A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating.
matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization.
slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm.
co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering.

surprise库支持基于近邻的协同过滤算法的多种相似度度量标准

相似度度量标准 说明
cosine Compute the cosine similarity between all pairs of users (or items).
msd Compute the Mean Squared Difference similarity between all pairs of users (or items).
pearson Compute the Pearson correlation coefficient between all pairs of users (or items).
pearson_baseline Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.

surprise库支持多种评估准则

评估准则 说明
RMSE Compute RMSE (Root Mean Squared Error).
MAE Compute MAE (Mean Absolute Error).
fcp Compute FCP (Fraction of Concordant Pairs).

使用

import os 
from surprise import Reader, Dataset
from surprise import SVD, evaluate

# 添加本地数据库路径
file_path = os.path.expanduser("E:\RecommendSystem\code\my_project\SVD1\ml_data - item\data.txt")
# 读取数据特征
reader = Reader(line_format = 'user item rating', sep = '\t')
data = Dataset.load_from_file(file_path, reader=reader)
data.split(n_folds = 5)
# 使用SVD算法
algo = SVD()
# 使用RMSE和MAE衡量推荐精度
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

更多例子可以参考博客Python推荐系统库——Surprise

LightFM

官方文档:http://lyst.github.io/lightfm/docs/home.html
github:https://github.com/lyst/lightfm
LightFM是Python实现的一种流行的推荐算法,用于隐式和显式反馈,包括有效地实现BPR和翘曲排名损失。它易于使用,快速(通过多线程模型估计),并产生高质量的结果。
它还使得将项目和用户元数据合并到传统的矩阵分解算法中成为可能。它将每个用户和项表示为其特性的潜在表示的总和,从而允许将建议归纳为新项(通过项特性)和新用户(通过用户特性)。

import numpy as np
from lightfm.datasets import fetch_movielens
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

# 将下载数据集并自动将其预处理为适合进一步计算的稀疏矩阵。它准备了稀疏的user-item矩阵,其中包含用户与产品交互的正条目,否则为零。
data = fetch_movielens(min_rating=5.0)

# 损失函数选择warp,warp是一种隐式反馈模型:训练矩阵中的所有交互都被视为正向信号,用户没有与之交互的产品被视为用户不喜欢的产品。
# 该模型的目标是将这些隐含的积极因素得分较高,而将低得分分配给隐含的消极因素。
model = LightFM(loss='warp')
model.fit(data['train'], epochs=30, num_threads=2)

print("Train precision: %.2f" % precision_at_k(model, data['train'], k=5).mean())
print("Test precision: %.2f" % precision_at_k(model, data['test'], k=5).mean())

lightFM库支持四种loss function

loss function describe
logistic useful when both positive (1) and negative (-1) interactions are present.
BPR Bayesian Personalised Ranking pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
WARP Weighted Approximate-Rank Pairwise loss. Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision@k) is desired.
k-OS WARP k-th order statistic loss. A modification of WARP that uses the k-th positive example for any given user as a basis for pairwise updates.

lightfm.LightFM(no_components=10, k=5, n=10, learning_schedule=’adagrad’, loss=’logistic’, learning_rate=0.05, rho=0.95, epsilon=1e-06, item_alpha=0.0, user_alpha=0.0, max_sampled=10, random_state=None)
no_components (int, optional) – the dimensionality of the feature latent embeddings.
k (int, optional) – for k-OS training, the k-th positive example will be selected from the n positive examples sampled for every user.
n (int, optional) – for k-OS training, maximum number of positives sampled for each update.
learning_schedule (string, optional) – one of (‘adagrad’, ‘adadelta’).
loss (string, optional) – one of (‘logistic’, ‘bpr’, ‘warp’, ‘warp-kos’): the loss function.
learning_rate (float, optional) – initial learning rate for the adagrad learning schedule.
rho (float, optional) – moving average coefficient for the adadelta learning schedule.
epsilon (float, optional) – conditioning parameter for the adadelta learning schedule.
item_alpha (float, optional) – L2 penalty on item features. Tip: setting this number too high can slow down training. One good way to check is if the final weights in the embeddings turned out to be mostly zero. The same idea applies to the user_alpha parameter.
user_alpha (float, optional) – L2 penalty on user features.
max_sampled (int, optional) – maximum number of negative samples used during WARP fitting. It requires a lot of sampling to find negative triplets for users that are already well represented by the model; this can lead to very long training times and overfitting. Setting this to a higher number will generally lead to longer training times, but may in some cases improve accuracy.
random_state (int seed, RandomState instance, or None) – The seed of the pseudo random number generator to use when shuffling the data and initializing the parameters.
item_embeddings (np.float32 array of shape [n_item_features, n_components]) – Contains the estimated latent vectors for item features. The [i, j]-th entry gives the value of the j-th component for the i-th item feature. In the simplest case where the item feature matrix is an identity matrix, the i-th row will represent the i-th item latent vector.
user_embeddings (np.float32 array of shape [n_user_features, n_components]) – Contains the estimated latent vectors for user features. The [i, j]-th entry gives the value of the j-th component for the i-th user feature. In the simplest case where the user feature matrix is an identity matrix, the i-th row will represent the i-th user latent vector.
item_biases (np.float32 array of shape [n_item_features,]) – Contains the biases for item_features.
user_biases (np.float32 array of shape [n_user_features,]) – Contains the biases for user_features.

模型训练

fit(interactions, user_features=None, item_features=None, sample_weight=None, epochs=1, num_threads=1, verbose=False)
interactions (np.float32 coo_matrix of shape [n_users, n_items]) – the matrix containing user-item interactions. Will be converted to numpy.float32 dtype if it is not of that type.
user_features (np.float32 csr_matrix of shape [n_users, n_user_features], optional) – Each row contains that user’s weights over features.
item_features (np.float32 csr_matrix of shape [n_items, n_item_features], optional) – Each row contains that item’s weights over features.
sample_weight (np.float32 coo_matrix of shape [n_users, n_items], optional) – matrix with entries expressing weights of individual interactions from the interactions matrix. Its row and col arrays must be the same as those of the interactions matrix. For memory efficiency its possible to use the same arrays for both weights and interaction matrices. Defaults to weight 1.0 for all interactions. Not implemented for the k-OS loss.
epochs (int, optional) – number of epochs to run
num_threads (int, optional) – Number of parallel computation threads to use. Should not be higher than the number of physical cores.
verbose (bool, optional) – whether to print progress messages.

参考资料:
https://blog.csdn.net/mycafe_/article/details/79146764
https://blog.csdn.net/m0_37586991/article/details/79943400

猜你喜欢

转载自blog.csdn.net/qq_24852439/article/details/89476872