There are many methods of matrix decomposition: SVD, LFM, BiasSVD and SVD++.
- Traditional SVD
General SVD matrix decomposition refers to SVD singular value decomposition, which decomposes the matrix into three multiplication matrices, and the intermediate matrix is the singular value matrix. The premise of SVD decomposition is that the matrix is dense, and the data in the real scene is sparse. SVD cannot be used, and it needs to be filled by the mean value or other methods, which will have a certain impact on the data. The specific formula is as follows:
M m × n = U m × k ∑ k × k K k × n T M_{m×n} = U_{m×k}\sum_{k×k}K^T_{k×n}Mm×n=Um×k∑k×kKk×nT - FunkSVD (LFM)
FunkSVD is the most primitive LFM model, which decomposes the matrix into two matrices: user-hidden feature matrix, item-hidden feature matrix. Generally, this process is similar to linear regression. The following formula is the loss function after matrix decomposition, p is the user-hidden feature matrix, and q is the item-hidden feature matrix. We calculate p and q by finding the minimum value of the loss function, and the latter part is the L2 paradigm to prevent excessive fit. The following formula uses stochastic gradient descent to find the optimal solution.
minqp ∑ ( u , i ) ∈ R ( rui − qi T pu ) 2 + λ ( ∣ ∣ qi ∣ ∣ 2 + ∣ ∣ pu ∣ ∣ 2 ) \mathop{min}\limits_{qp}\sum_{(u, i)∈R}(r_{ui}-q^T_ip_u)^2+\lambda(||q_i||^2+||p_u||^2)qpmin(u,i)∈R∑(rui−qiTpu)2+λ ( ∣∣ qi∣∣2+∣∣pu∣∣2) - BiasSVD
BiasSVD is based on FunkSVD plus the bias term of Baseline benchmark prediction, the specific formula is as follows:
minqp ∑ ( u , i ) ∈ R ( rui − u − bu − bi − qi T pu ) 2 + λ ( ∣ ∣ qi ∣ ∣ 2 + ∣ ∣ pu ∣ ∣ 2 + ∣ ∣ bu ∣ ∣ 2 + ∣ ∣ bi ∣ ∣ 2 ) \mathop{min}\limits_{qp}\sum_{(u,i)∈R}(r_{ ui}-u-b_u-b_i-q^T_ip_u)^2+\lambda(||q_i||^2+||p_u||^2+||b_u||^2+||b_i||^2 )qpmin(u,i)∈R∑(rui−u−bu−bi−qiTpu)2+λ ( ∣∣ qi∣∣2+∣∣pu∣∣2+∣∣bu∣∣2+∣∣bi∣∣2) - SVD++
improved BiasSVD, that is, on the basis of BiasSVD plus the user's implicit feedback information. Not much introduction here.
1 Analysis of LFM principle
The core idea of LFM latent semantic model is to link users and items through latent features.
P is the user and hidden feature matrix, which represents three hidden features;
Q is the hidden feature and item matrix
Q is User-Item, which is obtained by P*Q.
We use matrix decomposition technology to decompose the User-Item scoring matrix into P and Q matrices, and then use P*Q to restore the scoring matrix. where R 11 = P 1 , k ⃗ ⋅ Q k , 1 ⃗ R_{11} = \vec{P_{1,k}}·\vec{Q_{k,1}}R11=P1,k⋅Qk,1
So our score is:
rui ^ = pu , k ⃗ ⋅ qi , k ⃗ = ∑ k = 1 kpukqik \hat{r_{ui}} = \vec{p_{u,k}}\vec{q_{i ,k}} = \sum_{k=1}^{k}p_{uk}q_{ik}rui^=pu,k⋅qi,k=k=1∑kpukqi
loss function
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})^2 Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−k=1∑kpukqi)2Add
L2 paradigm to prevent overfitting, as follows:
C ost = ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) 2 + λ ( ∑ U puk 2 + ∑ I qik 2 ) Cost = \sum_{u, i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})^2 +\lambda(\sum_U p^2_{uk}+\sum_I q^2_ {ik})Cost=u,i∈R∑(rui−k=1∑kpukqi)2+l (U∑puk2+I∑qI2)
for the partial derivative of the loss function:
∂ ∂ puk C ost = 2 ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( − qik ) + 2 λ puk \frac{∂}{∂p_{uk }}Cost = 2\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(-q_{ik}) + 2\ lambda p_{uk}∂puk∂Cost=2u,i∈R∑(rui−k=1∑kpukqi)(−qi)+2 l puk
∂ ∂ q i k C o s t = 2 ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) ( − p u k ) + 2 λ q i k \frac{∂}{∂q_{ik}}Cost = 2\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(-p_{uk}) + 2\lambda q_{ik} ∂qi∂Cost=2u,i∈R∑(rui−k=1∑kpukqi)(−puk)+2λqi
Stochastic Gradient Descent Optimization
Gradient descent parameters can be updated according to partial derivatives:
puk : = puk + α [ ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( qik ) − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(q_{ik}) - \lambda_1 p_{uk }]puk:=puk+a [u,i∈R∑(rui−k=1∑kpukqi)(qi)−l1puk]
qik : = pik + α [ ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( puk ) − λ 2 qik ] q_{ik}:=p_{ik} + \alpha[\sum_{u, i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(p_{uk}) -\lambda_2 q_{ik}]qi:=pi+a [u,i∈R∑(rui−k=1∑kpukqi)(puk)−l2qi]
Stochastic gradient descent:
applied to each vector
puk : = puk + α [ ( rui − ∑ k = 1 kpukqik ) ( qik ) − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[( r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(q_{ik}) - \lambda_1 p_{uk}]puk:=puk+a [( rui−k=1∑kpukqi)(qi)−l1puk]
qik : = pik + α [ ( rui − ∑ k = 1 kpukqik ) ( puk ) − λ 2 qik ] q_{ik}:=p_{ik} + \alpha[(r_{ui}-\sum_{k= 1}^{k}p_{uk}q_{ik})(p_{uk}) -\lambda_2 q_{ik}]qi:=pi+a [( rui−k=1∑kpukqi)(puk)−l2qi]
Algorithm implementation
Only the code of the LFM algorithm is shown here. For the data set segmentation code and evaluation code, see: Big Data - Collaborative Filtering Recommendation Algorithm: Linear Regression Algorithm
# FunkSVD
class LFM(object):
'''max_epochs 梯度下降迭代次数
alpha 学习率
reg 过拟合参数
columns 数据字段名称'''
def __init__(self,max_epochs,p_reg,q_reg,alpha,number_LatentFactors=30,columns=['userId','movieId','rating']):
self.max_epochs = max_epochs
self.p_reg = p_reg
self.q_reg = q_reg
self.number_LatentFactors=number_LatentFactors #隐式特征的数量
self.alpha = alpha
self.columns = columns
def fit(self,data):
'''
:param data:uid,mid,rating
:return:'''
self.data = data
# 用户评分数据
self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 电影评分数据
self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 全局平均分
self.global_mean = self.data[self.columns[2]].mean()
# 调用随机梯度下降训练模型参数
self.p,self.q = self.sgd()
def _init_matrix(self):
# 构建p的矩阵,其中第二项元素的大小为用户数量*隐式特征
p = dict(zip(self.users_rating.index,
np.random.rand(len(self.users_rating), self.number_LatentFactors).astype(np.float32)))
# 构建q的矩阵,其中第二项元素的大小为物品数量*隐式特征
q = dict(zip(self.items_rating.index,
np.random.rand(len(self.items_rating), self.number_LatentFactors).astype(np.float32)))
return p,q
def sgd(self):
'''
最小二乘法,优化q和p值
:return: q p'''
p,q = self._init_matrix()
for i in range(self.max_epochs):
error_list = []
for uid,mid,r_ui in self.data.itertuples(index=False):
# user-lf p
# item-lf q
v_pu = p[uid]
v_qi = q[mid]
err = np.float32(r_ui - np.dot(v_pu, v_qi))
v_pu += self.alpha*(err*v_qi - self.p_reg*v_pu)
v_qi += self.alpha*(err*v_pu - self.q_reg*v_qi)
p[uid] = v_pu
q[mid] = v_qi
error_list.append(err**2)
return p,q
def predict(self,uid,mid):
'''
使用评分公式进行预测
param uid,mid;
return predict_rating;'''
if uid not in self.users_rating.index or mid not in self.items_rating.index:
return self.global_mean
p_u = self.p[uid]
q_i = self.q[mid]
return np.dot(p_u,q_i)
def test(self,testset):
'''
使用预测函数预测测试集数据
param testset;
return yield;'''
for uid,mid,real_rating in testset.itertuples(index=False):
try:
# 使用predict函数进行预测
pred_rating = self.predict(uid,mid)
except Exception as e:
print(e)
else:
# 返回生成器对象
yield uid,mid,real_rating,pred_rating
Test the LFM algorithm
trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
lfm = LFM(100,0.01,0.01,0.02,10,['userId','movieId','rating'])
lfm.fit(trainset)
pred_test = lfm.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_LFM = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)
rmse: 1.0718 ;mae: 0.8067
2 Principle of BiasSVD
BiasSVD adds a bias term based on the FunkSVD matrix decomposition. The scoring formula is as follows:
r ^ ui = u + bu + bi + puk ⃗ ⋅ qki ⃗ = u + bu + bi + ∑ k = 1 kpuk ⋅ qki \hat r_{ui} = u+b_u+b_i+\vec{p_{ uk}}\vec{q_{ki}} = u+b_u+b_i+\sum^{k}_{k=1}{p_{uk}}{q_{ki}}r^ui=u+bu+bi+puk⋅qto=u+bu+bi+k=1∑kpuk⋅qto
损失函数
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})^2 Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−u−bu−bi−k=1∑kpuk⋅qto)2
加入L2范式
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})^2 +\lambda(\sum_Ub^2_u+\sum_Ib^2_i+\sum_Up^2_{uk}+\sum_Iq^2_{ik}) Cost=u,i∈R∑(rui−r^ui)2=u,i∈R∑(rui−u−bu−bi−k=1∑kpuk⋅qto)2+l (U∑bu2+I∑bi2+U∑puk2+I∑qI2)
Stochastic gradient descent optimization
Gradient descent update parameters:
puk : = puk + α [ ( rui − u − bu − bi − ∑ k = 1 kpuk ⋅ qki ) qik − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})q_{ik}-\lambda_1p_{ uk}]puk:=puk+a [( rui−u−bu−bi−k=1∑kpuk⋅qto)qi−l1puk]
q i k : = p i k + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) p u k − λ 2 q i k ] q_{ik}:=p_{ik} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})p_{uk}-\lambda_2q_{ik}] qi:=pi+a [( rui−u−bu−bi−k=1∑kpuk⋅qto)puk−l2qi]
b u : = b u + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) − λ 3 b u ] b_{u}:=b_{u} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})-\lambda_3b_{u}] bu:=bu+a [( rui−u−bu−bi−k=1∑kpuk⋅qto)−l3bu]
b i : = b i + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) − λ 4 b i ] b_{i}:=b_{i} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})-\lambda_4b_{i}] bi:=bi+a [( rui−u−bu−bi−k=1∑kpuk⋅qto)−l4bi]
Algorithm implementation
# BiasSVD
class BiasSvd(object):
'''max_epochs 梯度下降迭代次数
alpha 学习率
reg 过拟合参数
columns 数据字段名称'''
def __init__(self,alpha,p_reg,q_reg,bu_reg,bi_reg,number_LatentFactors=10,max_epochs=10,columns=['userId','movieId','rating']):
self.max_epochs = max_epochs
self.p_reg = p_reg
self.q_reg = q_reg
self.bu_reg = bu_reg
self.bi_reg = bi_reg
self.number_LatentFactors=number_LatentFactors #隐式特征的数量
self.alpha = alpha
self.columns = columns
def fit(self,data):
'''
:param data:uid,mid,rating
:return:'''
self.data = data
# 用户评分数据
self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
# 电影评分数据
self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
# 全局平均分
self.global_mean = self.data[self.columns[2]].mean()
# 调用随机梯度下降训练模型参数
self.p,self.q, self.bu,self.bi= self.sgd()
def _init_matrix(self):
# 构建p的矩阵,其中第二项元素的大小为用户数量*隐式特征
p = dict(zip(self.users_rating.index,
np.random.rand(len(self.users_rating), self.number_LatentFactors).astype(np.float32)))
# 构建q的矩阵,其中第二项元素的大小为物品数量*隐式特征
q = dict(zip(self.items_rating.index,
np.random.rand(len(self.items_rating), self.number_LatentFactors).astype(np.float32)))
return p,q
def sgd(self):
'''
随机梯度下降,优化q和p值
:return: q p'''
p,q = self._init_matrix()
bu = dict(zip(self.users_rating.index, np.zeros(len(self.users_rating))))
bi = dict(zip(self.items_rating.index,np.zeros(len(self.items_rating))))
for i in range(self.max_epochs):
error_list = []
for uid,mid,r_ui in self.data.itertuples(index=False):
# user-lf p
# item-lf q
v_pu = p[uid]
v_qi = q[mid]
err = np.float32(r_ui - self.global_mean - bu[uid] - bi[mid]-np.dot(v_pu, v_qi))
v_pu += self.alpha*(err*v_qi - self.p_reg*v_pu)
v_qi += self.alpha*(err*v_pu - self.q_reg*v_qi)
p[uid] = v_pu
q[mid] = v_qi
bu[uid] += self.alpha*(err - self.bu_reg*bu[uid])
bi[mid] += self.alpha*(err - self.bi_reg*bi[mid])
error_list.append(err**2)
return p,q,bu,bi
def predict(self,uid,mid):
'''
使用评分公式进行预测
param uid,mid;
return predict_rating;'''
if uid not in self.users_rating.index or mid not in self.items_rating.index:
return self.global_mean
p_u = self.p[uid]
q_i = self.q[mid]
return self.global_mean+self.bu[uid]+self.bi[mid]+np.dot(p_u,q_i)
def test(self,testset):
'''
使用预测函数预测测试集数据
param testset;
return yield;'''
for uid,mid,real_rating in testset.itertuples(index=False):
try:
# 使用predict函数进行预测
pred_rating = self.predict(uid,mid)
except Exception as e:
print(e)
else:
# 返回生成器对象
yield uid,mid,real_rating,pred_rating
algorithm used
trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bsvd = BiasSvd(0.02,0.01,0.01,0.01,0.01,10,20,['userId','movieId','rating'])
bsvd.fit(trainset)
pred_test = bsvd.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_LFM = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)
rmse: 1.0718 ;mae: 0.8067