Big Data - Collaborative Filtering Recommendation Algorithm: Linear Regression Algorithm

Collaborative filtering algorithms in recommendation systems are generally divided into two categories:

  • The behavior-based collaborative filtering algorithm (Memory-Based CF) uses user behavior data to calculate the similarity, including the similarity between users and the similarity between items.
  • Model-Based Collaborative Filtering Algorithm (Model-Based CF), which uses machine learning algorithms to predict user preferences, is more suitable when user data is sparse.

This article mainly introduces the Model-Based collaborative filtering algorithm

1. Model-Based CF based on model collaborative filtering algorithm

Use the user-item rating matrix to train the machine learning model to predict the user's rating on the item, which can be mainly divided into the following categories:

  • Based on splitting, regression or clustering algorithms
  • Recommendation Algorithm Based on Matrix Factorization
  • Based on neural network algorithm
  • Graphical Model Based Algorithm

2. Collaborative filtering based on regression model algorithm

The premise of the regression model is a continuous value. We regard the score as a continuous value and adopt the following Baseline (baseline prediction) implementation strategy. The idea is to use everyone's preferences differently:

Some users are kind, and their ratings are higher than other users; some users are harsh, and their ratings are lower than other users; while some items are more popular, their ratings are higher than general items, and some items may be disliked, its Rating will be lower than normal items.

The Baseline is to find out the bias value bu b_u of each user and other users.bu, the bias value bi b_i of each item to other itemsbi, the ultimate goal becomes to find the optimal bu b_ubuJapanese bi b_ibi. So the steps of the Baseline algorithm are as follows:

  1. Calculate the average rating uu of all moviesu;
  2. Calculate the bias value bu b_u of each user's rating and the average ratingbu;
  3. Calculate the bias value bi b_i of the rating of each movie and the average ratingbi
  4. R ^ ui = bui = u + bu + bi \hat{r}_{ui} = b_{ui} = u+b_u+ b_i
    r^ui=bui=u+bu+bi

Take user A's rating of "Fengshen Part I" as an example:

  1. First calculate the average rating of all movies u = 3.5 u=3.5u=3.5
  2. User A is more kind, generally 1 point higher than the average score, bias value bu = 1 b_u=1bu=1
  3. "Fengshen Part I" had a lot of bad reviews at the beginning, and the score was 0.5 points lower than the average score, and the bias value bi = − 0.5 b_i=-0.5bi=0.5
  4. Then user A's rating for "Fengshen Part I" is: 3.5+1-0.5=4.1 points.

In the online problem, we use the square difference construction loss function:
C ost = ∑ u , i ∈ R ( rui − r ^ ui ) 2 = ∑ u , i ∈ R ( rui − u − bu − bi ) 2 Cost = \sum_{u,i∈R}(r_{ui}-\hat{r}_{ui})^2 = \sum_{u,i∈R}(r_{ui}-u-b_u-b_i)^ 2Cost=u,iR(ruir^ui)2=u,iR(ruiububi)2
to prevent failure, demand addition L2 formula, the final announcement is as follows:
Cost = ∑ u , i ∈ R ( rui − u − bu − bi ) 2 + λ ( ∑ ubu 2 + ∑ ibi 2 ) Cost = \ sum_{u,i∈R}(r_{ui}-u-b_u-b_i)^2 + \lambda(\sum_u{b_u}^2+\sum_i{b_i}^2)Cost=u,iR(ruiububi)2+l (ubu2+ibi2 )
We hope to get the minimum value of the loss function, and generally use the stochastic gradient descent method or the least squares method to optimize the realization.

2.1 Baseline stochastic gradient descent algorithm

step1 : Gradient descent method derivation:
J ( θ ) = f ( bu , bi ) J(θ) = f(b_u,b_i)J(θ)=f(bu,bi)
The original formula for gradient descent parameter update:
θ j : = θ j − α ∂ ∂ θ j J ( θ ) \theta_j :=\theta_j-\alpha\frac{∂}{∂\theta_j}J(\theta)ij:=ijaθjJ ( θ )
vs. reference number equation:
∂ ∂ bu J ( θ ) = ∂ ∂ buf ( bu , bi ) = − 2 ∑ u , i ∈ R ( rui − u − bu − bi ) + 2 λ bu \frac {∂}{∂b_u}J(\theta) = \frac{∂}{∂b_u}f(b_u,b_i) = -2\sum_{u,i∈R}(r_{ui}-u-b_u- b_i) + 2\lambda b_ubuJ(θ)=buf(bu,bi)=2u,iR(ruiububi)+2 λ bu
Substitution ladder descending reference number update formula:
bu : = bu + α ( ∑ u , i ∈ R ( rui − u − bu − bi ) − λ bu ) b_u:=b_u+\alpha(\sum_{u,i∈R}( r_{ui}-u-b_u-b_i) -\lambda b_u)bu:=bu+a (u,iR(ruiububi)λbu)
b i : = b i + α ( ∑ u , i ∈ R ( r u i − u − b u − b i ) − λ b i ) b_i:=b_i+\alpha(\sum_{u,i∈R}(r_{ui}-u-b_u-b_i) -\lambda b_i) bi:=bi+a (u,iR(ruiububi)λbi)
step2: Stochastic Gradient Descent
The stochastic gradient descent method essentially uses the loss of each sample to update the parameters, instead of calculating the total loss sum each time.
One-sample loss value:
error = rui − r ^ ui = rui − u − bu − bi error = r_{ui} - \hat{r}_{ui} = r_{ui} - u-b_u-b_ierror=ruir^ui=ruiububi
So the gradient descent formula can be updated as:
bu : = bu + α ( error − λ bu ) b_u:=b_u+\alpha(error -\lambda b_u)bu:=bu+α(errorλbu)
b i : = b i + α ( e r r o r − λ b i ) b_i:=b_i+\alpha(error -\lambda b_i) bi:=bi+α(errorλbi)
step3: Algorithm implementation
Import modules and data

# 随机梯度下降算法实现
import pandas as pd
import numpy as np
df = pd.read_csv("ml-latest-small/ratings.csv", usecols=range(3))
df

Implementation of Baseline Gradient Descent Algorithm

class BaselineCFBySGD(object):
    '''max_epochs 梯度下降迭代次数
        alpha 学习率
        reg 过拟合参数
        columns 数据字段名称'''
    def __init__(self,max_epochs, alpha,reg,columns=['uid','mid','rating']):
        self.max_epochs = max_epochs
        self.alpha = alpha
        self.reg = reg
        self.columns = columns
        
    def fit(self,data):
        '''
        :param data:uid,mid,rating
        :return:'''
        self.data = data
        # 用户评分数据
        self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
        # 电影评分数据
        self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
        # 全局平均分
        self.global_mean = self.data[self.columns[2]].mean()
        # 调用随机梯度下降训练模型参数
        self.bu,self.bi = self.sgd()
        
    def sgd(self):
        '''
        随机梯度下降,优化bu和bi值
        :return: bu bi'''
        bu = dict(zip(users_rating.index, np.zeros(len(users_rating))))
        bi = dict(zip(items_rating.index, np.zeros(len(items_rating))))
        
        for i in range(max_epochs):
            # 将dataframe的每一行数据单独读出来,代入梯度下降参数公式
            for uid, mid, real_rating in df.itertuples(index=False):
                error = real_rating - (global_mean+bu[uid]+bi[mid])
                bu[uid] += alpha*(error - reg*bu[uid])
                bi[mid] += alpha*(error - reg*bi[mid])
        return bu,bi
    
    def predict(self,uid,mid):
        '''
        使用评分公式进行预测
        param uid,mid;
        return predict_rating;'''
        predict_rating = self.global_mean+self.bu[uid]+self.bi[mid]
        return predict_rating
    
    def test(self,testset):
        '''
        使用预测函数预测测试集数据
        param testset;
        return yield;'''
        for uid,mid,real_rating in testset.itertuples(index=False):
            try:
                # 使用predict函数进行预测
                pred_rating = self.predict(uid,mid)
            except Exception as e:
                print(e)
            else:
                # 返回生成器对象
                yield uid,mid,real_rating,pred_rating

Test set and training set partition function

# 训练集和测试集的划分
def data_split(data_path, x=0.8, random=False):
    ratings = pd.read_csv(data_path, usecols=range(3))
    testset_index = []
    
    for uid in ratings.groupby('userId').any().index:
        user_rating_data = ratings.where(ratings['userId']==uid).dropna()
        if random:
            index = list(user_rating_data.index)
            np.random.shuffle(index)
            _index = round(len(user_rating_data)*x)
            testset_index += list(index[_index:])
        else:
            index = round(len(user_rating_data)*x)
            testset_index += list(user_rating_data.index.values[index:])
            
    testset = ratings.loc[testset_index]
    trainset = ratings.drop(testset_index)
    return trainset,testset

Algorithm evaluation function

def accuray(predict_reselts, method='all'):
    # 计算均方根误差
    def rmse(predict_reselts):
        length = 0
        _rmse_sum = 0
        for uid,mid, real_rating, pred_rating in predict_reselts.itertuples(index=False):
            length+=1
            _rmse_sum += (pred_rating - real_rating)**2
        return round(np.sqrt(_rmse_sum/length),4)
    
    # 计算绝对值误差
    def mae(predict_reselts):
        length=0
        _mae_sum=0
        for uid,mid,real_rating,pred_rating in predict_reselts.itertuples(index=False):
            length +=1
            _mae_sum += abs(pred_rating-real_rating)
        return round(_mae_sum/length,4)
    
    # 两个都计算
    def rmse_mae(predict_reselts):
        length = 0
        _rmse_sum=0
        _mae_sum=0
        for uid,mid,real_rating,pred_rating in predict_reselts.itertuples(index=False):
            length +=1
            _mae_sum += abs(pred_rating-real_rating)
            _rmse_sum += (pred_rating - real_rating)**2
        return round(np.sqrt(_rmse_sum/length),4),round(_mae_sum/length,4)
    
    # 根据输入的参数放回对应的评估方法
    if method.lower() =='rmse':
        return rmse(predict_reselts)
    elif method.lower() == 'mae':
        return mae(predict_reselts)
    else:
        return rmse_mae(predict_reselts)

Substituting data into algorithms and evaluation functions

trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bcf = BaselineCFBySGD(20,0.1,0.1,['userId','movieId','rating'])
bcf.fit(trainset)
pred_test = bcf.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])

rmse, mae = accuray(df_pred,'all')
print('rmse:',rmse,';mae:',mae)

rmse: 0.8647 ;mae: 0.6595

2.2 Baseline Alternating Least Squares Algorithm

step1 : Derivation of Alternating Least Squares Method
Core idea: Calculate the partial derivative of the loss function, and then let the partial derivative be 0.
The loss function is as follows:
J ( θ ) = f ( bu , bi ) J(θ) = f(b_u,b_i)J(θ)=f(bu,bi)
Comparison number finding bias:
∂ ∂ bu J ( θ ) = ∂ ∂ buf ( bu , bi ) = − 2 ∑ u , i ∈ R ( rui − u − bu − bi ) + 2 λ bu \frac{∂} {∂b_u}J(\theta) = \frac{∂}{∂b_u}f(b_u,b_i) = -2\sum_{u,i∈R}(r_{ui}-u-b_u-b_i) + 2\lambda b_ubuJ(θ)=buf(bu,bi)=2u,iR(ruiububi)+2 λ bu
Deviation is 0, can be obtained:
∑ u , i ∈ R ( rui − u − bu − bi ) = 2 λ bu \sum_{u,i∈R}(r_{ui}-u-b_u-b_i) = 2\lambda b_uu,iR(ruiububi)=2 λ bu
∑ u , i ∈ R ( r u i − u − b i ) = ∑ u ∈ R b u + λ b u \sum_{u,i∈R}(r_{ui}-u-b_i) = \sum_{u∈R}b_u+\lambda b_u u,iR(ruiubi)=uRbu+λbu
为了方便计计,例∑ u ∈ R bu ≈ ∣ R ( u ) ∣ ∗ bu \sum_{u∈R}b_u≈|R(u)|*b_uuRbuR(u)bu, can be obtained:
bu : = ∑ u , i ∈ R ( rui − u − bi ) λ 1 + ∣ R ( u ) ∣ b_u:=\frac{\sum_{u,i∈R}(r_{ui} -u-b_i)}{\lambda_1+|R(u)|}bu:=l1+R(u)u,iR(ruiubi)

∣ R ( u ) ∣ |R(u)| R ( u ) indicates the number of ratings for user u

Theorem:
bi : = ∑ u , i ∈ R ( rui − u − bu ) λ 2 + ∣ R ( i ) ∣ b_i:=\frac{\sum_{u,i∈R}(r_{ui} -u-b_u)}{\lambda_2+|R(i)|}bi:=l2+R(i)u,iR(ruiubu)
step2 : Alternating Least Squares (ALS)
We have derived their respective expressions, but the expressions contain each other, so we use the Alternating Least Squares method for calculation:

  1. Fix one of the values ​​first, and find the other value;
  2. Then fix the value of another item and find the value of the first item; update the values ​​of the two repeatedly in this way, and finally get the result

request bu b_ubu, first set bi b_ibiConsider it known; seek bi b_ibi, first put bu b_uburegarded as known

step3 : Algorithm implementation
The overall code is similar to the stochastic gradient descent

# 最小二乘法算法实现
class BaselineCFByALS(object):
    '''max_epochs 梯度下降迭代次数
        alpha 学习率
        reg 过拟合参数
        columns 数据字段名称'''
    def __init__(self,max_epochs,reg_bu,reg_bi,columns=['userId','movieId','rating']):
        self.max_epochs = max_epochs
        self.reg_bu = reg_bu
        self.reg_bi = reg_bi
        self.columns = columns
        
    def fit(self,data):
        '''
        :param data:uid,mid,rating
        :return:'''
        self.data = data
        # 用户评分数据
        self.users_rating = data.groupby(self.columns[0]).agg([list])[[self.columns[1], self.columns[2]]]
        # 电影评分数据
        self.items_rating = data.groupby(self.columns[1]).agg([list])[[self.columns[0], self.columns[2]]]
        # 全局平均分
        self.global_mean = self.data[self.columns[2]].mean()
        # 调用随机梯度下降训练模型参数
        self.bu,self.bi = self.als()
        
    def als(self):
        '''
        最小二乘法,优化bu和bi值
        :return: bu bi'''
        bu = dict(zip(users_rating.index, np.zeros(len(users_rating))))
        bi = dict(zip(items_rating.index, np.zeros(len(items_rating))))
        
        for i in range(max_epochs):
            # 计算bi
            for mid, uids, real_ratings in items_rating.itertuples(index=True):
                _sum=0
                for uid,rating in zip(uids,real_ratings):
                    _sum += rating - global_mean-bu[uid]
                bi[mid] = _sum/(self.reg_bi+len(uids))
                    
                # 计算bu
            for uid,mids,real_ratings in users_rating.itertuples(index=True):
                _sum=0
                for mid,rating in zip(mids,real_ratings):
                    _sum+= rating -self.global_mean-bi[mid]
                bu[uid] = _sum/(self.reg_bu+len(mids))
        return bu,bi
    
    def predict(self,uid,mid):
        '''
        使用评分公式进行预测
        param uid,mid;
        return predict_rating;'''
        predict_rating = self.global_mean+self.bu[uid]+self.bi[mid]
        return predict_rating
    
    def test(self,testset):
        '''
        使用预测函数预测测试集数据
        param testset;
        return yield;'''
        for uid,mid,real_rating in testset.itertuples(index=False):
            try:
                # 使用predict函数进行预测
                pred_rating = self.predict(uid,mid)
            except Exception as e:
                print(e)
            else:
                # 返回生成器对象
                yield uid,mid,real_rating,pred_rating

Applying the Least Squares Algorithm

trainset, testset = data_split('ml-latest-small/ratings.csv',random=True)
bcf = BaselineCFByALS(20,25,15,['userId','movieId','rating'])
bcf.fit(trainset)
pred_test = bcf.test(testset)
# 生成器对象用list进行转化,然后转化为dataframe格式
df_pred_als = pd.DataFrame(list(pred_test), columns=[['userId','movieId','rating','pred_rating']])
rmse, mae = accuray(df_pred_als,'all')
print('rmse:',rmse,';mae:',mae)

rmse: 0.8403 ;mae: 0.6462

Guess you like

Origin blog.csdn.net/gjinc/article/details/132201822