Big Data - Collaborative Filtering Recommendation Algorithm: Matrix Decomposition

There are many methods of matrix decomposition: SVD, LFM, BiasSVD and SVD++.

  • Traditional SVD
    General SVD matrix decomposition refers to SVD singular value decomposition, which decomposes the matrix into three multiplication matrices, and the intermediate matrix is ​​the singular value matrix. The premise of SVD decomposition is that the matrix is ​​dense, and the data in the real scene is sparse. SVD cannot be used, and it needs to be filled by the mean value or other methods, which will have a certain impact on the data. The specific formula is as follows:
    M m × n = U m × k ∑ k × k K k × n T M_{m×n} = U_{m×k}\sum_{k×k}K^T_{k×n}Mm×n=Um×kk×kKk×nT
  • FunkSVD (LFM)
    FunkSVD is the most primitive LFM model, which decomposes the matrix into two matrices: user-hidden feature matrix, item-hidden feature matrix. Generally, this process is similar to linear regression. The following formula is the loss function after matrix decomposition, p is the user-hidden feature matrix, and q is the item-hidden feature matrix. We calculate p and q by finding the minimum value of the loss function, and the latter part is the L2 paradigm to prevent excessive fit. The following formula uses stochastic gradient descent to find the optimal solution.
    minqp ∑ ( u , i ) ∈ R ( rui − qi T pu ) 2 + λ ( ∣ ∣ qi ∣ ∣ 2 + ∣ ∣ pu ∣ ∣ 2 ) \mathop{min}\limits_{qp}\sum_{(u, i)∈R}(r_{ui}-q^T_ip_u)^2+\lambda(||q_i||^2+||p_u||^2)qpmin(u,i)R(ruiqiTpu)2+λ ( ∣∣ qi2+∣∣pu2)
  • BiasSVD
    BiasSVD is based on FunkSVD plus the bias term of Baseline benchmark prediction, the specific formula is as follows:
    minqp ∑ ( u , i ) ∈ R ( rui − u − bu − bi − qi T pu ) 2 + λ ( ∣ ∣ qi ∣ ∣ 2 + ∣ ∣ pu ∣ ∣ 2 + ∣ ∣ bu ∣ ∣ 2 + ∣ ∣ bi ∣ ∣ 2 ) \mathop{min}\limits_{qp}\sum_{(u,i)∈R}(r_{ ui}-u-b_u-b_i-q^T_ip_u)^2+\lambda(||q_i||^2+||p_u||^2+||b_u||^2+||b_i||^2 )qpmin(u,i)R(ruiububiqiTpu)2+λ ( ∣∣ qi2+∣∣pu2+∣∣bu2+∣∣bi2)
  • SVD++
    improved BiasSVD, that is, on the basis of BiasSVD plus the user's implicit feedback information. Not much introduction here.

1 Analysis of LFM principle

The core idea of ​​LFM latent semantic model is to link users and items through latent features.
insert image description here

P is the user and hidden feature matrix, which represents three hidden features;
Q is the hidden feature and item matrix
Q is User-Item, which is obtained by P*Q.

We use matrix decomposition technology to decompose the User-Item scoring matrix into P and Q matrices, and then use P*Q to restore the scoring matrix. where R 11 = P 1 , k ⃗ ⋅ Q k , 1 ⃗ R_{11} = \vec{P_{1,k}}·\vec{Q_{k,1}}R11=P1,k Qk,1
So our score is:
rui ^ = pu , k ⃗ ⋅ qi , k ⃗ = ∑ k = 1 kpukqik \hat{r_{ui}} = \vec{p_{u,k}}\vec{q_{i ,k}} = \sum_{k=1}^{k}p_{uk}q_{ik}rui^=pu,k qi,k =k=1kpukqi

loss function

C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})^2 Cost=u,iR(ruir^ui)2=u,iR(ruik=1kpukqi)2Add
L2 paradigm to prevent overfitting, as follows:
C ost = ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) 2 + λ ( ∑ U puk 2 + ∑ I qik 2 ) Cost = \sum_{u, i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})^2 +\lambda(\sum_U p^2_{uk}+\sum_I q^2_ {ik})Cost=u,iR(ruik=1kpukqi)2+l (Upuk2+IqI2)
for the partial derivative of the loss function:
∂ ∂ puk C ost = 2 ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( − qik ) + 2 λ puk \frac{∂}{∂p_{uk }}Cost = 2\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(-q_{ik}) + 2\ lambda p_{uk}pukCost=2u,iR(ruik=1kpukqi)(qi)+2 l puk
∂ ∂ q i k C o s t = 2 ∑ u , i ∈ R ( r u i − ∑ k = 1 k p u k q i k ) ( − p u k ) + 2 λ q i k \frac{∂}{∂q_{ik}}Cost = 2\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(-p_{uk}) + 2\lambda q_{ik} qiCost=2u,iR(ruik=1kpukqi)(puk)+2λqi

Stochastic Gradient Descent Optimization

Gradient descent parameters can be updated according to partial derivatives:
puk : = puk + α [ ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( qik ) − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[\sum_{u,i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(q_{ik}) - \lambda_1 p_{uk }]puk:=puk+a [u,iR(ruik=1kpukqi)(qi)l1puk]
qik : = pik + α [ ∑ u , i ∈ R ( rui − ∑ k = 1 kpukqik ) ( puk ) − λ 2 qik ] q_{ik}:=p_{ik} + \alpha[\sum_{u, i∈R}(r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(p_{uk}) -\lambda_2 q_{ik}]qi:=pi+a [u,iR(ruik=1kpukqi)(puk)l2qi]
Stochastic gradient descent:
applied to each vector
puk : = puk + α [ ( rui − ∑ k = 1 kpukqik ) ( qik ) − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[( r_{ui}-\sum_{k=1}^{k}p_{uk}q_{ik})(q_{ik}) - \lambda_1 p_{uk}]puk:=puk+a [( ruik=1kpukqi)(qi)l1puk]
qik : = pik + α [ ( rui − ∑ k = 1 kpukqik ) ( puk ) − λ 2 qik ] q_{ik}:=p_{ik} + \alpha[(r_{ui}-\sum_{k= 1}^{k}p_{uk}q_{ik})(p_{uk}) -\lambda_2 q_{ik}]qi:=pi+a [( ruik=1kpukqi)(puk)l2qi]

Algorithm implementation

Only the code of the LFM algorithm is shown here. For the data set segmentation code and evaluation code, see: Big Data - Collaborative Filtering Recommendation Algorithm: Linear Regression Algorithm

# FunkSVD
class LFM(object):
   '''max_epochs 梯度下降迭代次数
       alpha 学习率
       reg 过拟合参数
       columns 数据字段名称'''
   def __init__(self,max_epochs,p_reg,q_reg,alpha,number_LatentFactors=30,columns=['userId','movieId','rating']):
       self.max_epochs = max_epochs
       self.p_reg = p_reg
       self.q_reg = q_reg
       self.number_LatentFactors=number_LatentFactors #隐式特征的数量
       self.alpha = alpha
       self.columns = columns
       
   def fit(self,data):
       '''
       :param data:uid,mid,rating
       :return:'''
       self.data = data
       # 用户评分数据
       self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
       # 电影评分数据
       self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
       # 全局平均分
       self.global_mean = self.data[self.columns[2]].mean()
       # 调用随机梯度下降训练模型参数
       self.p,self.q = self.sgd()
       
   def _init_matrix(self):
       # 构建p的矩阵,其中第二项元素的大小为用户数量*隐式特征
       p = dict(zip(self.users_rating.index, 
                    np.random.rand(len(self.users_rating), self.number_LatentFactors).astype(np.float32)))
       # 构建q的矩阵,其中第二项元素的大小为物品数量*隐式特征
       q = dict(zip(self.items_rating.index,
                    np.random.rand(len(self.items_rating), self.number_LatentFactors).astype(np.float32)))
       return p,q
   
   def sgd(self):
       '''
       最小二乘法,优化q和p值
       :return: q p'''
       p,q = self._init_matrix()
       
       for i in range(self.max_epochs):
           error_list = []
           for uid,mid,r_ui in self.data.itertuples(index=False):
               # user-lf p
               # item-lf q
               v_pu = p[uid]
               v_qi = q[mid]
               err = np.float32(r_ui - np.dot(v_pu, v_qi))
               
               v_pu += self.alpha*(err*v_qi - self.p_reg*v_pu)
               v_qi += self.alpha*(err*v_pu - self.q_reg*v_qi)
               
               p[uid] = v_pu
               q[mid] = v_qi
               
               error_list.append(err**2)
               
       return p,q
   
   def predict(self,uid,mid):
       '''
       使用评分公式进行预测
       param uid,mid;
       return predict_rating;'''
       if uid not in self.users_rating.index or mid not in self.items_rating.index:
           return self.global_mean
       p_u = self.p[uid]
       q_i = self.q[mid]
       return np.dot(p_u,q_i)
   
   def test(self,testset):
       '''
       使用预测函数预测测试集数据
       param testset;
       return yield;'''
       for uid,mid,real_rating in testset.itertuples(index=False):
           try:
               # 使用predict函数进行预测
               pred_rating = self.predict(uid,mid)
           except Exception as e:
               print(e)
           else:
               # 返回生成器对象
               yield uid,mid,real_rating,pred_rating

Test the LFM algorithm

trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
lfm = LFM(100,0.01,0.01,0.02,10,['userId','movieId','rating'])
lfm.fit(trainset)
pred_test = lfm.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_LFM = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)

rmse: 1.0718 ;mae: 0.8067

2 Principle of BiasSVD

BiasSVD adds a bias term based on the FunkSVD matrix decomposition. The scoring formula is as follows:
r ^ ui = u + bu + bi + puk ⃗ ⋅ qki ⃗ = u + bu + bi + ∑ k = 1 kpuk ⋅ qki \hat r_{ui} = u+b_u+b_i+\vec{p_{ uk}}\vec{q_{ki}} = u+b_u+b_i+\sum^{k}_{k=1}{p_{uk}}{q_{ki}}r^ui=u+bu+bi+puk qto =u+bu+bi+k=1kpukqto

损失函数
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})^2 Cost=u,iR(ruir^ui)2=u,iR(ruiububik=1kpukqto)2
加入L2范式
C o s t = ∑ u , i ∈ R ( r u i − r ^ u i ) 2 = ∑ u , i ∈ R ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) 2 + λ ( ∑ U b u 2 + ∑ I b i 2 + ∑ U p u k 2 + ∑ I q i k 2 ) Cost = \sum_{u,i∈R}(r_{ui}-\hat r_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})^2 +\lambda(\sum_Ub^2_u+\sum_Ib^2_i+\sum_Up^2_{uk}+\sum_Iq^2_{ik}) Cost=u,iR(ruir^ui)2=u,iR(ruiububik=1kpukqto)2+l (Ubu2+Ibi2+Upuk2+IqI2)

Stochastic gradient descent optimization
Gradient descent update parameters:
puk : = puk + α [ ( rui − u − bu − bi − ∑ k = 1 kpuk ⋅ qki ) qik − λ 1 puk ] p_{uk}:=p_{uk} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})q_{ik}-\lambda_1p_{ uk}]puk:=puk+a [( ruiububik=1kpukqto)qil1puk]
q i k : = p i k + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) p u k − λ 2 q i k ] q_{ik}:=p_{ik} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})p_{uk}-\lambda_2q_{ik}] qi:=pi+a [( ruiububik=1kpukqto)pukl2qi]
b u : = b u + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) − λ 3 b u ] b_{u}:=b_{u} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})-\lambda_3b_{u}] bu:=bu+a [( ruiububik=1kpukqto)l3bu]
b i : = b i + α [ ( r u i − u − b u − b i − ∑ k = 1 k p u k ⋅ q k i ) − λ 4 b i ] b_{i}:=b_{i} + \alpha[(r_{ui}-u-b_u-b_i-\sum^{k}_{k=1}{p_{uk}}·{q_{ki}})-\lambda_4b_{i}] bi:=bi+a [( ruiububik=1kpukqto)l4bi]

Algorithm implementation

# BiasSVD
class BiasSvd(object):
   '''max_epochs 梯度下降迭代次数
       alpha 学习率
       reg 过拟合参数
       columns 数据字段名称'''
   def __init__(self,alpha,p_reg,q_reg,bu_reg,bi_reg,number_LatentFactors=10,max_epochs=10,columns=['userId','movieId','rating']):
       self.max_epochs = max_epochs
       self.p_reg = p_reg
       self.q_reg = q_reg
       self.bu_reg = bu_reg
       self.bi_reg = bi_reg
       self.number_LatentFactors=number_LatentFactors #隐式特征的数量
       self.alpha = alpha
       self.columns = columns
       
   def fit(self,data):
       '''
       :param data:uid,mid,rating
       :return:'''
       self.data = data
       # 用户评分数据
       self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
       # 电影评分数据
       self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
       # 全局平均分
       self.global_mean = self.data[self.columns[2]].mean()
       # 调用随机梯度下降训练模型参数
       self.p,self.q, self.bu,self.bi= self.sgd()
       
   def _init_matrix(self):
       # 构建p的矩阵,其中第二项元素的大小为用户数量*隐式特征
       p = dict(zip(self.users_rating.index, 
                    np.random.rand(len(self.users_rating), self.number_LatentFactors).astype(np.float32)))
       # 构建q的矩阵,其中第二项元素的大小为物品数量*隐式特征
       q = dict(zip(self.items_rating.index,
                    np.random.rand(len(self.items_rating), self.number_LatentFactors).astype(np.float32)))
       return p,q
   
   def sgd(self):
       '''
       随机梯度下降,优化q和p值
       :return: q p'''
       p,q = self._init_matrix()
       bu = dict(zip(self.users_rating.index, np.zeros(len(self.users_rating))))
       bi = dict(zip(self.items_rating.index,np.zeros(len(self.items_rating))))
       
       for i in range(self.max_epochs):
           error_list = []
           for uid,mid,r_ui in self.data.itertuples(index=False):
               # user-lf p
               # item-lf q
               v_pu = p[uid]
               v_qi = q[mid]
               err = np.float32(r_ui - self.global_mean - bu[uid] - bi[mid]-np.dot(v_pu, v_qi))
               
               v_pu += self.alpha*(err*v_qi - self.p_reg*v_pu)
               v_qi += self.alpha*(err*v_pu - self.q_reg*v_qi)
               
               p[uid] = v_pu
               q[mid] = v_qi
               
               bu[uid] += self.alpha*(err - self.bu_reg*bu[uid])
               bi[mid] += self.alpha*(err - self.bi_reg*bi[mid])
               
               error_list.append(err**2)
               
       return p,q,bu,bi
   
   def predict(self,uid,mid):
       '''
       使用评分公式进行预测
       param uid,mid;
       return predict_rating;'''
       if uid not in self.users_rating.index or mid not in self.items_rating.index:
           return self.global_mean
       p_u = self.p[uid]
       q_i = self.q[mid]
       return self.global_mean+self.bu[uid]+self.bi[mid]+np.dot(p_u,q_i)
   
   def test(self,testset):
       '''
       使用预测函数预测测试集数据
       param testset;
       return yield;'''
       for uid,mid,real_rating in testset.itertuples(index=False):
           try:
               # 使用predict函数进行预测
               pred_rating = self.predict(uid,mid)
           except Exception as e:
               print(e)
           else:
               # 返回生成器对象
               yield uid,mid,real_rating,pred_rating

algorithm used

trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bsvd = BiasSvd(0.02,0.01,0.01,0.01,0.01,10,20,['userId','movieId','rating'])
bsvd.fit(trainset)
pred_test = bsvd.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_LFM = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)

rmse: 1.0718 ;mae: 0.8067

Guess you like

Origin blog.csdn.net/gjinc/article/details/132224891