Collaborative filtering algorithms in recommendation systems are generally divided into two categories:
- The behavior-based collaborative filtering algorithm (Memory-Based CF) uses user behavior data to calculate the similarity, including the similarity between users and the similarity between items.
- Model-Based Collaborative Filtering Algorithm (Model-Based CF), which uses machine learning algorithms to predict user preferences, is more suitable when user data is sparse.
This article mainly introduces the Model-Based collaborative filtering algorithm
1. Model-Based CF based on model collaborative filtering algorithm
Use the user-item rating matrix to train the machine learning model to predict the user's rating on the item, which can be mainly divided into the following categories:
- Based on splitting, regression or clustering algorithms
- Recommendation Algorithm Based on Matrix Factorization
- Based on neural network algorithm
- Graphical Model Based Algorithm
2. Collaborative filtering based on regression model algorithm
The premise of the regression model is a continuous value. We regard the score as a continuous value and adopt the following Baseline (baseline prediction) implementation strategy. The idea is to use everyone's preferences differently:
Some users are kind, and their ratings are higher than other users; some users are harsh, and their ratings are lower than other users; while some items are more popular, their ratings are higher than general items, and some items may be disliked, its Rating will be lower than normal items.
The Baseline is to find out the bias value bu b_u of each user and other users.bu, the bias value bi b_i of each item to other itemsbi, the ultimate goal becomes to find the optimal bu b_ubuJapanese bi b_ibi. So the steps of the Baseline algorithm are as follows:
- Calculate the average rating uu of all moviesu;
- Calculate the bias value bu b_u of each user's rating and the average ratingbu;
- Calculate the bias value bi b_i of the rating of each movie and the average ratingbi;
- R ^ ui = bui = u + bu + bi \hat{r}_{ui} = b_{ui} = u+b_u+ b_i
r^ui=bui=u+bu+bi
Take user A's rating of "Fengshen Part I" as an example:
- First calculate the average rating of all movies u = 3.5 u=3.5u=3.5;
- User A is more kind, generally 1 point higher than the average score, bias value bu = 1 b_u=1bu=1;
- "Fengshen Part I" had a lot of bad reviews at the beginning, and the score was 0.5 points lower than the average score, and the bias value bi = − 0.5 b_i=-0.5bi=−0.5;
- Then user A's rating for "Fengshen Part I" is: 3.5+1-0.5=4.1 points.
In the online problem, we use the square difference construction loss function:
C ost = ∑ u , i ∈ R ( rui − r ^ ui ) 2 = ∑ u , i ∈ R ( rui − u − bu − bi ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat{r}_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i)^ 2Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−u−bu−bi)2
to prevent failure, demand addition L2 formula, the final announcement is as follows:
Cost = ∑ u , i ∈ R ( rui − u − bu − bi ) 2 + λ ( ∑ ubu 2 + ∑ ibi 2 ) Cost = \ sum_{u,i∈R}(r_{ui}-u-b_u-b_i)^2 + \lambda(\sum_u{b_u}^2+\sum_i{b_i}^2)Cost=u,i∈R∑(rui−u−bu−bi)2+l (u∑bu2+i∑bi2 )
We hope to get the minimum value of the loss function, and generally use the stochastic gradient descent method or the least squares method to optimize the realization.
2.1 Baseline stochastic gradient descent algorithm
step1 : Gradient descent method derivation:
J ( θ ) = f ( bu , bi ) J(θ) = f(b_u,b_i)J(θ)=f(bu,bi)
The original formula for gradient descent parameter update:
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j :=\theta_j-\alpha\frac{∂}{∂\theta_j}J(\theta)ij:=ij−a∂θj∂J ( θ )
vs. reference number equation:
∂ ∂ bu J ( θ ) = ∂ ∂ buf ( bu , bi ) = − 2 ∑ u , i ∈ R ( rui − u − bu − bi ) + 2 λ bu \frac {∂}{∂b_u}J(\theta) = \frac{∂}{∂b_u}f(b_u,b_i) = -2\sum_{u,i∈R}(r_{ui}-u-b_u- b_i) + 2\lambda b_u∂bu∂J(θ)=∂bu∂f(bu,bi)=−2u,i∈R∑(rui−u−bu−bi)+2 λ bu
Substitution ladder descending reference number update formula:
bu : = bu + α ( ∑ u , i ∈ R ( rui − u − bu − bi ) − λ bu ) b_u:=b_u+\alpha(\sum_{u,i∈R}( r_{ui}-u-b_u-b_i) -\lambda b_u)bu:=bu+a (u,i∈R∑(rui−u−bu−bi)−λbu)
b i : = b i + α ( ∑ u , i ∈ R ( r u i − u − b u − b i ) − λ b i ) b_i:=b_i+\alpha(\sum_{u,i∈R}(r_{ui}-u-b_u-b_i) -\lambda b_i) bi:=bi+a (u,i∈R∑(rui−u−bu−bi)−λbi)
step2: Stochastic Gradient Descent
The stochastic gradient descent method essentially uses the loss of each sample to update the parameters, instead of calculating the total loss sum each time.
One-sample loss value:
error = rui − r ^ ui = rui − u − bu − bi error = r_{ui} - \hat{r}_{ui} = r_{ui} - u-b_u-b_ierror=rui−r^ui=rui−u−bu−bi
So the gradient descent formula can be updated as:
bu : = bu + α ( error − λ bu ) b_u:=b_u+\alpha(error -\lambda b_u)bu:=bu+α(error−λbu)
b i : = b i + α ( e r r o r − λ b i ) b_i:=b_i+\alpha(error -\lambda b_i) bi:=bi+α(error−λbi)
step3: Algorithm implementation
Import modules and data
# 随机梯度下降算法实现
import pandas as pd
import numpy as np
df = pd.read_csv("ml-latest-small/ratings.csv", usecols=range(3))
df
Implementation of Baseline Gradient Descent Algorithm
class BaselineCFBySGD(object):
'''max_epochs 梯度下降迭代次数
alpha 学习率
reg 过拟合参数
columns 数据字段名称'''
def __init__(self,max_epochs, alpha,reg,columns=['uid','mid','rating']):
self.max_epochs = max_epochs
self.alpha = alpha
self.reg = reg
self.columns = columns
def fit(self,data):
'''
:param data:uid,mid,rating
:return:'''
self.data = data
# 用户评分数据
self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 电影评分数据
self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 全局平均分
self.global_mean = self.data[self.columns[2]].mean()
# 调用随机梯度下降训练模型参数
self.bu,self.bi = self.sgd()
def sgd(self):
'''
随机梯度下降,优化bu和bi值
:return: bu bi'''
bu = dict(zip(users_rating.index, np.zeros(len(users_rating))))
bi = dict(zip(items_rating.index, np.zeros(len(items_rating))))
for i in range(max_epochs):
# 将dataframe的每一行数据单独读出来,代入梯度下降参数公式
for uid, mid, real_rating in df.itertuples(index=False):
error = real_rating - (global_mean+bu[uid]+bi[mid])
bu[uid] += alpha*(error - reg*bu[uid])
bi[mid] += alpha*(error - reg*bi[mid])
return bu,bi
def predict(self,uid,mid):
'''
使用评分公式进行预测
param uid,mid;
return predict_rating;'''
predict_rating = self.global_mean+self.bu[uid]+self.bi[mid]
return predict_rating
def test(self,testset):
'''
使用预测函数预测测试集数据
param testset;
return yield;'''
for uid,mid,real_rating in testset.itertuples(index=False):
try:
# 使用predict函数进行预测
pred_rating = self.predict(uid,mid)
except Exception as e:
print(e)
else:
# 返回生成器对象
yield uid,mid,real_rating,pred_rating
Test set and training set partition function
# 训练集和测试集的划分
def data_split(data_path, x=0.8, random=False):
ratings = pd.read_csv(data_path, usecols=range(3))
testset_index = []
for uid in ratings.groupby('userId').any().index:
user_rating_data = ratings.where(ratings['userId']==uid).dropna()
if random:
index = list(user_rating_data.index)
np.random.shuffle(index)
_index = round(len(user_rating_data)*x)
testset_index += list(index[_index:])
else:
index = round(len(user_rating_data)*x)
testset_index += list(user_rating_data.index.values[index:])
testset = ratings.loc[testset_index]
trainset = ratings.drop(testset_index)
return trainset,testset
Algorithm evaluation function
def accuray(predict_reselts, method='all'):
# 计算均方根误差
def rmse(predict_reselts):
length = 0
_rmse_sum = 0
for uid,mid, real_rating, pred_rating in predict_reselts.itertuples(index=False):
length+=1
_rmse_sum += (pred_rating - real_rating)**2
return round(np.sqrt(_rmse_sum/length),4)
# 计算绝对值误差
def mae(predict_reselts):
length=0
_mae_sum=0
for uid,mid,real_rating,pred_rating in predict_reselts.itertuples(index=False):
length +=1
_mae_sum += abs(pred_rating-real_rating)
return round(_mae_sum/length,4)
# 两个都计算
def rmse_mae(predict_reselts):
length = 0
_rmse_sum=0
_mae_sum=0
for uid,mid,real_rating,pred_rating in predict_reselts.itertuples(index=False):
length +=1
_mae_sum += abs(pred_rating-real_rating)
_rmse_sum += (pred_rating - real_rating)**2
return round(np.sqrt(_rmse_sum/length),4),round(_mae_sum/length,4)
# 根据输入的参数放回对应的评估方法
if method.lower() =='rmse':
return rmse(predict_reselts)
elif method.lower() == 'mae':
return mae(predict_reselts)
else:
return rmse_mae(predict_reselts)
Substituting data into algorithms and evaluation functions
trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bcf = BaselineCFBySGD(20,0.1,0.1,['userId','movieId','rating'])
bcf.fit(trainset)
pred_test = bcf.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred,'all')
print('rmse:',rmse,';mae:',mae)
rmse: 0.8647 ;mae: 0.6595
2.2 Baseline Alternating Least Squares Algorithm
step1 : Derivation of Alternating Least Squares Method
Core idea: Calculate the partial derivative of the loss function, and then let the partial derivative be 0.
The loss function is as follows:
J ( θ ) = f ( bu , bi ) J(θ) = f(b_u,b_i)J(θ)=f(bu,bi)
Comparison number finding bias:
∂ ∂ bu J ( θ ) = ∂ ∂ buf ( bu , bi ) = − 2 ∑ u , i ∈ R ( rui − u − bu − bi ) + 2 λ bu \frac{∂} {∂b_u}J(\theta) = \frac{∂}{∂b_u}f(b_u,b_i) = -2\sum_{u,i∈R}(r_{ui}-u-b_u-b_i) + 2\lambda b_u∂bu∂J(θ)=∂bu∂f(bu,bi)=−2u,i∈R∑(rui−u−bu−bi)+2 λ bu
Deviation is 0, can be obtained:
∑ u , i ∈ R ( rui − u − bu − bi ) = 2 λ bu \sum_{u,i∈R}(r_{ui}-u-b_u-b_i) = 2\lambda b_uu,i∈R∑(rui−u−bu−bi)=2 λ bu
∑ u , i ∈ R ( r u i − u − b i ) = ∑ u ∈ R b u + λ b u \sum_{u,i∈R}(r_{ui}-u-b_i) = \sum_{u∈R}b_u+\lambda b_u u,i∈R∑(rui−u−bi)=u∈R∑bu+λbu
为了方便计计,例∑ u ∈ R bu ≈ ∣ R ( u ) ∣ ∗ bu \sum_{u∈R}b_u≈|R(u)|*b_u∑u∈Rbu≈∣R(u)∣∗bu, can be obtained:
bu : = ∑ u , i ∈ R ( rui − u − bi ) λ 1 + ∣ R ( u ) ∣ b_u:=\frac{\sum_{u,i∈R}(r_{ui} -u-b_i)}{\lambda_1+|R(u)|}bu:=l1+∣R(u)∣∑u,i∈R(rui−u−bi)
∣ R ( u ) ∣ |R(u)| ∣ R ( u ) ∣ indicates the number of ratings for user u
Theorem:
bi : = ∑ u , i ∈ R ( rui − u − bu ) λ 2 + ∣ R ( i ) ∣ b_i:=\frac{\sum_{u,i∈R}(r_{ui} -u-b_u)}{\lambda_2+|R(i)|}bi:=l2+∣R(i)∣∑u,i∈R(rui−u−bu)
step2 : Alternating Least Squares (ALS)
We have derived their respective expressions, but the expressions contain each other, so we use the Alternating Least Squares method for calculation:
- Fix one of the values first, and find the other value;
- Then fix the value of another item and find the value of the first item; update the values of the two repeatedly in this way, and finally get the result
request bu b_ubu, first set bi b_ibiConsider it known; seek bi b_ibi, first put bu b_uburegarded as known
step3 : Algorithm implementation
The overall code is similar to the stochastic gradient descent
# 最小二乘法算法实现
class BaselineCFByALS(object):
'''max_epochs 梯度下降迭代次数
alpha 学习率
reg 过拟合参数
columns 数据字段名称'''
def __init__(self,max_epochs,reg_bu,reg_bi,columns=['userId','movieId','rating']):
self.max_epochs = max_epochs
self.reg_bu = reg_bu
self.reg_bi = reg_bi
self.columns = columns
def fit(self,data):
'''
:param data:uid,mid,rating
:return:'''
self.data = data
# 用户评分数据
self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 电影评分数据
self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 全局平均分
self.global_mean = self.data[self.columns[2]].mean()
# 调用随机梯度下降训练模型参数
self.bu,self.bi = self.als()
def als(self):
'''
最小二乘法,优化bu和bi值
:return: bu bi'''
bu = dict(zip(users_rating.index, np.zeros(len(users_rating))))
bi = dict(zip(items_rating.index, np.zeros(len(items_rating))))
for i in range(max_epochs):
# 计算bi
for mid, uids, real_ratings in items_rating.itertuples(index=True):
_sum=0
for uid,rating in zip(uids,real_ratings):
_sum += rating - global_mean-bu[uid]
bi[mid] = _sum/(self.reg_bi+len(uids))
# 计算bu
for uid,mids,real_ratings in users_rating.itertuples(index=True):
_sum=0
for mid,rating in zip(mids,real_ratings):
_sum+= rating -self.global_mean-bi[mid]
bu[uid] = _sum/(self.reg_bu+len(mids))
return bu,bi
def predict(self,uid,mid):
'''
使用评分公式进行预测
param uid,mid;
return predict_rating;'''
predict_rating = self.global_mean+self.bu[uid]+self.bi[mid]
return predict_rating
def test(self,testset):
'''
使用预测函数预测测试集数据
param testset;
return yield;'''
for uid,mid,real_rating in testset.itertuples(index=False):
try:
# 使用predict函数进行预测
pred_rating = self.predict(uid,mid)
except Exception as e:
print(e)
else:
# 返回生成器对象
yield uid,mid,real_rating,pred_rating
Applying the Least Squares Algorithm
trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bcf = BaselineCFByALS(20,25,15,['userId','movieId','rating'])
bcf.fit(trainset)
pred_test = bcf.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_als = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)
rmse: 0.8403 ;mae: 0.6462