The basic elements of linear regression and the stochastic gradient descent method and optimization have been introduced before, and now the linear regression is explained:

linear regression

vectorization acceleration
Normal distribution and squared loss
From Linear Regression to Deep Networks
- Neural Network Diagram
- biology

vectorization acceleration

When training models, we often want to be able to process small batches of samples at the same time, so we need to vectorize the calculations to use linear algebra libraries instead of using inefficient for loops. The following code shows the efficiency of vectorization intuitively:

import math
import time
import numpy as np
import torch
from d2l import torch as d2l

n = 10000
a = torch.ones([n])
b = torch.ones([n])

# 定义一个计时器
# 注释#@save是一个特殊的标记，会将对应的函数、类或语句保存在d2l包中
class Timer:  #@save
    """记录多次运行时间"""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """启动计时器"""
        self.tik = time.time()

    def stop(self):
        """停止计时器并将时间记录在列表中"""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """返回平均时间"""
        return sum(self.times) / len(self.times)

    def sum(self):
        """返回时间和"""
        return sum(self.times)

    def cumsum(self):
        """返回累计时间"""
        return np.array(self.times).cumsum().tolist()

# 计算for循环时间
c = torch.zeros(n)
timer = Timer()
for i in range(n):
    c[i] = a[i] + b[i]
print(f'for循环所用时间：{
      
      timer.stop():.9f} sec')

# 计算重载的+运算符来计算按元素的和
timer.start()
d = a + b
print(f'矢量化加速所用时间：{
      
      timer.stop():.9f} sec')

result:

for循环所用时间：0.133013725 sec
矢量化加速所用时间：0.001002550 sec

Normal distribution and squared loss

Here, the squared loss objective function is interpreted through an assumption on the noise distribution.
The normal distribution (Gaussian distribution) is closely related to linear regression.
The probability distribution probability density function is as follows:
$p(x)=\frac{1}{\sqrt{2π\sigma^2 }}exp(-\frac{1}{2\sigma^2}(x-μ)^2)$
Next, visualize the normal distribution, the code is as follows:

import math
import numpy as np
import torch
from d2l import torch as d2l
import matplotlib.pyplot as plt

# 定义正态分布函数
def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)

# 可视化正态分布

# 使用numpy进行可视化
x = np.arange(-7, 7, 0.01)
# 均值和标准差对
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mean {
      
      mu}, std{
      
      sigma}' for mu, sigma in params])
d2l.plt.show()

insert image description here
It can be seen that changing the mean produces a shift along the x-axis, and increasing the variance spreads the distribution and reduces the peak value.
Review the maximum likelihood estimation before learning the following content. In many places, I am very impressed.
The reason why the mean square error loss function (mean square loss for short) can be used for linear regression is: we assume that the observation contains noise, The noise follows a normal distribution. The normal distribution of noise is as follows:
$y=w^Tx+b+\delta\\ Among them, \delta conforms to the normal distribution N (0,\sigma^2)$
We can thus write the likelihood of observing a particular y given x:
$P(y|x)=\frac{1}{\sqrt{2π\sigma^2}}exp(-\frac{1}{2\sigma^2}(yw^Tx-b)^2)$
Now, according to maximum likelihood estimation, the optimal values of parameters w and b are those that maximize the likelihood of the entire dataset: P (
$P(y|X)=\prod_{i=1}^np(y^{(i)}|x^{(i)})$
The estimator selected according to the maximum likelihood estimation method is called the maximum likelihood estimator. Although it seems difficult to maximize the product of many exponential functions, it can be simplified by maximizing the logarithm of the likelihood without changing the objective, and finally the formula is obtained: − log P ( y ∣ X ) = ∑
$-logP(y|X)=\sum_{i=1}^n \frac{1}{2}log(2π\sigma^2)+\frac{1}{2\sigma^2}(y^{(i)}-w^Tx^{(i)}-b) ^2$
Now only need to assume that σ is some fixed constant to ignore the first term, because the first term does not depend on w and b. Except for the constant, the second term is the same as the mean square error introduced earlier.
The solution of the above formula does not depend on σ, therefore, under the assumption of Gaussian noise, minimizing the mean square error is equivalent to the maximum likelihood estimation of the linear model.

From Linear Regression to Deep Networks

Neural Network Diagram

insert image description here
As shown in the figure above, the linear regression model is described as a neural network. It is easy to see that it is a single-layer neural network. The figure only shows the connection mode, omitting the weights and biases. The feature dimension (the number of inputs in the input layer)
in this figure is d. For linear regression, each input is connected to the output, and this transformation is a fully connected layer

biology

Dendrites receive information xi from other neurons, which is weighted by synaptic weights wi to determine the impact of the input (activated or inhibited by multiplying xi by wi).
Weighted inputs from multiple sources with weighted sum
$by=\sum x_iw_i+b$
gathers in the nucleus and sends this information to the axon y for further processing, usually via σ(y) for some nonlinear processing. It then either goes to its destination (like a muscle) or goes to another neuron.

Machine Learning && Deep Learning - Linear Regression