Linear regression model predicts the relationship between the number of video likes and collections of station B (Hua Nong Brothers)
Article Directory
Preface
Linear regression models can be used to predict the trend of data. Through the training of the existing data set, a linear function Y= w *X+ b can be obtained , and the subsequent value can be predicted through this linear function.
1. Linear regression model
Linear regression is based on the assumption of a linear correlation between the target value X and the eigenvalue Y. The linear model
is solved through a known data set . The specific solution method is to construct a loss function, making the value of the loss function more and more The smaller it is until the accuracy requirement or the number of iterations is met. Loss function can be understood as the difference between the predicted value and the true value of the calculation process of the kind obtained, so that the gap is smaller, the model with the true value like. Definition of loss function:
To minimize the loss function, minimize Loss(w, b) . Introducing the gradient descent algorithm , the speed of descent along the direction of the gradient is the fastest. Update w and b in each iteration until the requirements are met.
Calculate the partial derivatives of Loss(w,b) with respect to w and b, respectively (you can import y=w*x+b into the loss function):
Two, get data
Crawling the video information on BILIBILI, this article obtained the video information of " Hua Nong Brothers ". You can refer to blog all video B crawling details of station UP . Take the number of likes and collections of the videos, and establish a linear regression model to predict their relationship. The scatter plot of video likes (x-axis) and collections (y-axis) is as follows:
due to poor data concentration, the data needs to be normalized. This article uses the maximum and minimum values to normalize.
Three, model training
After training, w=0.7229486928307687 b=0.20322045504258518 The trained model is as follows:
Fourth, the code
# 线型回归模型预测B站视频点赞量与收藏量的关系(华农兄弟)
import json
import numpy as np
import time
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
class LR(object):
def __init__(self, max_iterator = 1000, learn_rate = 0.01):
self.max_iterator = max_iterator
self.learn_rate = learn_rate
self.w = np.random.normal(1, 0.1)
self.b = np.random.normal(1, 0.1)
def cal_day(self, release_date, now_date):
# 计算天数
start_time = time.mktime(time.strptime(release_date.split(' ')[0], '%Y-%m-%d'))
end_time = time.mktime(time.strptime(now_date.split(' ')[0], '%Y-%m-%d'))
return int((end_time - start_time)/(24*60*60))
def load_data(self, url):
with open(url, 'r', encoding='utf-8') as f:
data_dect = json.load(f)
# print(data_dect)
# 视频播放数量以及发布距离现在的天数
watched_number_list = []
time_list = []
dm_number_list = []
liked_list = []
collected_list = []
for sample in data_dect:
# 去掉坏点
if sample['watched'] != '':
watched_number_list.append([float(sample['watched'])]) #观看数量
liked_list.append([float(sample['liked'])]) #点赞数
collected_list.append([float(sample['collected'])]) #收藏数
dm_number_list.append([float(sample['bullet_comments'])]) #弹幕数
time_list.append([float(self.cal_day(sample['date'], sample['now_date']))]) #视频发布距离现在时间
return np.array(time_list), np.array(watched_number_list), np.array(liked_list), np.array(collected_list), np.array(dm_number_list)
def train_set_normalize(self, train_set):
data_range = np.max(train_set) - np.min(train_set)
return (train_set - np.min(train_set)) / data_range
def cal_gradient(self, x, y):
# 计算梯度
# print(x, y)
dw = np.mean((x * self.w + self.b - y) * x)
db = np.mean(self.b + x * self.w - y)
return dw, db
def train(self, x, y):
# 训练模型,使用梯度下降
train_w = []
train_b = []
for i in range(self.max_iterator):
print(self.w, self.b)
train_w.append(self.w)
train_b.append(self.b)
i += 1
# 计算梯度值,向着梯度下降的方向
dw, db = self.cal_gradient(x, y)
self.w -= self.learn_rate*dw
self.b -= self.learn_rate*db
return train_w, train_b
def predict(self, x):
# 预测
return x * self.w + self.b
def myplot(self, x, y, train_w, train_b):
plt.pause(2)
plt.ion()
# 动态绘图
for i in range(0, self.max_iterator, 30):
plt.clf()
# 原始散点图
plt.scatter(x, y, marker = 'o',color = 'yellow', s = 40)
plt.xlabel('liked')
plt.ylabel('collected')
plt.plot(x, train_w[i] * x + train_b[i], c='red')
plt.title('step: %d learning-rate: %.2f function: y=%.2f * x + %.2f' %(i, self.learn_rate, train_w[i], train_b[i]))
plt.pause(0.5)
plt.show()
plt.ioff()
plt.pause(200)
lr = LR()
time_list, watched_number_list, liked_list, collected_list, dm_number_list = lr.load_data(r'2020\Crawl\Bilibili\Item1\data\video_detial.json')
# 需要对数据进行归一化处理
tw, tb = lr.train(lr.train_set_normalize(liked_list), lr.train_set_normalize(collected_list))
lr.myplot(lr.train_set_normalize(liked_list), lr.train_set_normalize(collected_list), tw, tb)
references
- https://www.cnblogs.com/geo-will/p/10468253.html