[Basics of Machine Learning] Linear regression model predicts the relationship between the number of video likes and collections of station B

Linear regression model predicts the relationship between the number of video likes and collections of station B (Hua Nong Brothers)


Preface

Linear regression models can be used to predict the trend of data. Through the training of the existing data set, a linear function Y= w *X+ b can be obtained , and the subsequent value can be predicted through this linear function.


1. Linear regression model

Linear regression is based on the assumption of a linear correlation between the target value X and the eigenvalue Y. The linear model
Linear function
is solved through a known data set . The specific solution method is to construct a loss function, making the value of the loss function more and more The smaller it is until the accuracy requirement or the number of iterations is met. Loss function can be understood as the difference between the predicted value and the true value of the calculation process of the kind obtained, so that the gap is smaller, the model with the true value like. Definition of loss function:
Loss function
To minimize the loss function, minimize Loss(w, b) . Introducing the gradient descent algorithm , the speed of descent along the direction of the gradient is the fastest. Update w and b in each iteration until the requirements are met.
Gradient descent
Calculate the partial derivatives of Loss(w,b) with respect to w and b, respectively (you can import y=w*x+b into the loss function):
w partial derivative
b partial lead

Two, get data

Crawling the video information on BILIBILI, this article obtained the video information of " Hua Nong Brothers ". You can refer to blog all video B crawling details of station UP . Take the number of likes and collections of the videos, and establish a linear regression model to predict their relationship. The scatter plot of video likes (x-axis) and collections (y-axis) is as follows:
Scatter plot
due to poor data concentration, the data needs to be normalized. This article uses the maximum and minimum values ​​to normalize.

Three, model training

After training, w=0.7229486928307687 b=0.20322045504258518 The trained model is as follows:
Training model

Fourth, the code

# 线型回归模型预测B站视频点赞量与收藏量的关系(华农兄弟)
import json
import numpy as np
import time
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

class LR(object):
    def __init__(self, max_iterator = 1000, learn_rate = 0.01):
        self.max_iterator = max_iterator
        self.learn_rate = learn_rate
        self.w = np.random.normal(1, 0.1)
        self.b = np.random.normal(1, 0.1)

    def cal_day(self, release_date, now_date):
        # 计算天数
        start_time = time.mktime(time.strptime(release_date.split(' ')[0], '%Y-%m-%d'))
        end_time = time.mktime(time.strptime(now_date.split(' ')[0], '%Y-%m-%d'))
        return int((end_time - start_time)/(24*60*60))


    def load_data(self, url):
        with open(url, 'r', encoding='utf-8') as f:
            data_dect = json.load(f)
        # print(data_dect)
        
        # 视频播放数量以及发布距离现在的天数
        watched_number_list = []
        time_list = []
        dm_number_list = []
        liked_list = []
        collected_list = []
        for sample in data_dect:
            # 去掉坏点
            if sample['watched'] != '':
                watched_number_list.append([float(sample['watched'])]) #观看数量
                liked_list.append([float(sample['liked'])])	#点赞数
                collected_list.append([float(sample['collected'])])	#收藏数
                dm_number_list.append([float(sample['bullet_comments'])])	#弹幕数
                time_list.append([float(self.cal_day(sample['date'], sample['now_date']))]) #视频发布距离现在时间

        return np.array(time_list), np.array(watched_number_list), np.array(liked_list), np.array(collected_list), np.array(dm_number_list)

    def train_set_normalize(self, train_set):
        data_range = np.max(train_set) - np.min(train_set)
        return (train_set - np.min(train_set)) / data_range



    def cal_gradient(self, x, y):
    	# 计算梯度
        # print(x, y)
        dw = np.mean((x * self.w + self.b - y) * x)
        db = np.mean(self.b + x * self.w - y)
        return dw, db
    
    
    def train(self, x, y):
        # 训练模型,使用梯度下降
        train_w = []
        train_b = []
        for i in range(self.max_iterator):
            print(self.w, self.b)
            train_w.append(self.w)
            train_b.append(self.b)
            i += 1
            # 计算梯度值,向着梯度下降的方向
            dw, db = self.cal_gradient(x, y)
            self.w -= self.learn_rate*dw
            self.b -= self.learn_rate*db
        return train_w, train_b

    def predict(self, x):
        # 预测
        return x * self.w + self.b
    
    def myplot(self, x, y, train_w, train_b):
        
        plt.pause(2)
        plt.ion()
        # 动态绘图
        for i in range(0, self.max_iterator, 30):
            
            plt.clf()
            # 原始散点图
            plt.scatter(x, y, marker = 'o',color = 'yellow', s = 40)
            plt.xlabel('liked')
            plt.ylabel('collected')
            plt.plot(x, train_w[i] * x  + train_b[i], c='red')
            plt.title('step: %d learning-rate: %.2f function: y=%.2f * x + %.2f' %(i, self.learn_rate, train_w[i], train_b[i]))
            plt.pause(0.5) 
            
        plt.show()
        plt.ioff()
        plt.pause(200)

        
  


lr = LR()
time_list, watched_number_list, liked_list, collected_list, dm_number_list = lr.load_data(r'2020\Crawl\Bilibili\Item1\data\video_detial.json')
# 需要对数据进行归一化处理

tw, tb = lr.train(lr.train_set_normalize(liked_list), lr.train_set_normalize(collected_list))
lr.myplot(lr.train_set_normalize(liked_list), lr.train_set_normalize(collected_list), tw, tb)

references

  1. https://www.cnblogs.com/geo-will/p/10468253.html

Guess you like

Origin blog.csdn.net/qq_37753409/article/details/109004339