Before learning about deep neural networks, you need to understand the basics of neural network training. Includes: defining a simple neural network architecture, data processing, specifying loss functions, and how to train the model. To make it easier to learn, start with the classic algorithm, Linear Neural Networks, and learn the basics of Neural Networks.

Article directory

1.1 Basic elements in linear regression
1.2 Vectorization acceleration
1.3 Normal distribution and squared loss
1.4 From Linear Regression to Deep Networks
1.5 Summary

1.1 Basic elements in linear regression

Linear regression is based on several simple assumptions:
First, the relationship between the independent variable x and the dependent variable y is assumed to be linear , that is, y can be expressed as a weighted sum of the elements in x, which is usually allowed to include some noise in the observations; second , we assume that any noise is relatively normal, such as the noise follows a normal distribution.
Let's take a very classic model for estimating house prices based on the size (square meters) and age (years) of the house.
First, we need to prepare 训练数据集（training data set）或训练集（training set）, including the real house price, size and age.
We call the goal we are trying to predict (such as predicting the price of a house) as 标签（label）或目标（target）. The independent variables (area and age) on which the prediction is based are called 特征（feature）或协变量（covariate）.

1.1.1 Linear Model

According to our initial linear assumption, our forecast target (house price) can be expressed as a weighted sum of (area and house age), expressed in a mathematical expression as follows:
insert image description here

This expression can be seen as one of the input features 仿射变换（affine transformation）. The affine transformation is characterized by the weighted sum of the features 线性变换（linear transformation）and the bias term 平移（translation）.

w _area and w _{age are} called 权重（weight）,The weights determine the influence of each feature on our predicted value.
b is called 偏置（bias）、偏移量（offset）或截距（intercept）. (It may be more familiar to express it as an intercept.) It can be clearly seen that b can represent what the predicted value is when all the features take 0. Even if in reality there won't be any houses with an area of 0 or an age of exactly 0 years, we still need an offset term.Without the bias term, the expressive power of our model would be limited。

In machine learning, we usually use high-dimensional data sets, and many feature inputs
insert image description here
are used, so it is more convenient to use linear algebra to represent them as vectors. At the same time, the dot product of vectors can represent weighted sums. Simplified into the following form:

After expressing this mathematical model, it can be found that there are two 模型参数（model parameters）w and b in the expression. The next step is to determine these two parameters, and we introduce two other concepts:

损失函数: a measure of model quality
随机梯度下降: A method capable of updating the model to improve the quality of the model's predictions.

1.1.2 Loss function

The loss function quantifies the difference between the actual value of the target and the predicted value.
Usually we choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss for perfect prediction is 0.
The most commonly used loss function in regression problems is the squared error function. When the predicted value of sample i is _y^ (i) and its corresponding true label is _y (i), the squared error can be defined as the following formula
insert image description here

The constant 1/2 makes no essential difference, but it is slightly simpler in form (since the constant coefficient is 1 when we take the derivative of the loss function).

Fitting the data with a linear model
_{A larger difference between the estimated value y^} (i) and the observed value _y (i) will result in a larger loss due to the quadratic term in the squared error function . In order to measure the quality of the model on the entire dataset, we need to calculate the mean (also equivalent to summation) of the loss over the n samples of the training set.
insert image description here
When training the model, we want to find a set of parameters ( w ∗, b ∗ ) , this set of parameters minimizes the total loss over all training samples.

1.1.3 Stochastic Gradient Descent

梯度下降（gradient descent）, this method can optimize almost all deep learning models. It reduces the error by continuously updating the parameters in the direction of decreasing loss function.
The simplest use of gradient descent is to compute the derivative of the loss function with respect to the model parameters(It can also be called gradient here).
But in practice the execution can be very slow: we have to traverse the entire dataset before every parameter update. Therefore, we usually randomly draw a small batch of samples each time an update needs to be computed. This variant is called an 小批量随机梯度下降（minibatch stochastic gradient descent）
algorithm step: (1) Initialize the values of the model parameters, such as random
initialization;
Batch samples, compute the derivative (also called gradient) of the average loss with respect to the model parameters for the mini-batch. Finally, we multiply the gradient by a predetermined positive number η and subtract it from the current parameter value. And keep iterating on this step.
insert image description here

|B| represents the number of samples in each mini-batch, which is also called 批量大小（batch size）. n represents 学习率（learning rate）. The values for batch size and learning rate are usually pre-specified manually, rather than model training. These parameters that can be adjusted but not updated during training are called 超参数（hyperparameter）.
调参（hyperparameter tuning）is the process of choosing hyperparameters. Hyperparameters are usually tuned by us based on the results of training iterations, which are evaluated on an independent validation dataset.

For example for w and B in squared loss :
insert image description here
Linear regression happens to be a learning problem with only one minimum in the entire domain. But for complex models like deep neural networks, the loss plane usually contains multiple minima. It takes a lot of effort to find such a set of parameters that minimizes the loss on the training set. In fact, what is harder to do is to find a set of parameters that achieves low loss on data we have never seen before, a challenge known as generalization.

1.2 Vectorization acceleration

In fact, it is simply to use the linear algebra library instead of the for loop to simplify the operation process. Time the process of adding two vectors in two ways to represent the difference in efficiency between them

n = 100000
a = torch.ones(n)
b = torch.ones(n)

A commonly used timer:

class Timer:  #@save
    """记录多次运行时间"""
    def __init__(self):
        self.times = []
        self.start()

    def start(self):
        """启动计时器"""
        self.tik = time.time()

    def stop(self):
        """停止计时器并将时间记录在列表中"""
        self.times.append(time.time() - self.tik)
        return self.times[-1]

    def avg(self):
        """返回平均时间"""
        return sum(self.times) / len(self.times)

    def sum(self):
        """返回时间总和"""
        return sum(self.times)

    def cumsum(self):
        """返回累计时间"""
        return np.array(self.times).cumsum().tolist()

The first : use a for loop to perform the addition of one bit at a time:

c = torch.zeros(n)
timer = Timer()
for i in range(n):
    c[i] = a[i] + b[i]
f'{
      
      timer.stop():.5f} sec'

Output duration:

'0.76498 sec'

Second : use the overloaded + operator to compute the element-wise sum.

timer.start()
d = a + b
f'{
      
      timer.stop():.5f} sec'

Output duration:

'0.00100 sec'

It turns out that the second method is much faster than the first method. Vectorizing code often results in orders of magnitude speedups. Plus, we put more math into the library instead of having to write as many calculations ourselves, reducing the chance of errors.

1.3 Normal distribution and squared loss

The relationship between the normal distribution and linear regression is close. 正态分布（normal distribution）, also known as 高斯分布（Gaussian distribution）its normal distribution probability density function as follows:
insert image description here
Below we define a Python function to compute the normal distribution.

def normal(x, mu, sigma):
    p = 1 / math.sqrt(2 * math.pi * sigma**2)
    return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)

To visualize:

# 再次使用numpy进行可视化
x = np.arange(-7, 7, 0.01)

# 均值和标准差对
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
         ylabel='p(x)', figsize=(4.5, 2.5),
         legend=[f'mu {
      
      mu}, sigma {
      
      sigma}' for mu, sigma in params])

insert image description here
As shown in the figure, it is the understanding of the normal distribution: changing the mean (mu) will produce a shift along the x-axis, and increasing the variance (sigma) will spread the distribution and reduce its peak.
One reason why the mean squared loss can be used for linear regression is: We assume that the observations contain noise, where the noise follows a normal distribution. The normal distribution of noise is as follows:
insert image description here
The likelihood of observing a particular y with a given x can be written:

According to the maximum likelihood estimation method, the optimal values of the parameters w and b are the likelihoods that make the entire data set The largest value

The estimator selected according to the maximum likelihood estimation method is called the maximum likelihood estimator. Although maximizing the product of many exponential functions may seem difficult, we can simplify it by computing the maximizing logarithm of the likelihood in probability theory. For historical reasons, optimization usually means minimizing rather than maximizing. We can instead minimize the negative log-likelihood −logP(y∣X) .
insert image description here
Now we just need to assume that σ is some fixed constant and ignore the first term because it does not depend on w and b. Now the second term is the same as the mean squared error introduced earlier, except for the constant 1/ _{σ 2 .}
Therefore, under the assumption of Gaussian noise,Minimizing the mean squared error is equivalent to a maximum likelihood estimate for a linear model。

1.4 From Linear Regression to Deep Networks

Although the neural network covers more and richer models, we can still describe the linear model in the same way as describing the neural network, so as to think of the linear model as a neural network, the diagram only shows the connection mode, that is, only each input How to connect to the output, omitting the values of weights and biases.

insert image description here
In the figure, the inputs are x ₁ ,…,x _d , so d in the input layer 输入数（或称为特征维度，feature dimensionality）. The output of the network is o ₁ , so the number of outputs in the output layer is 1. It should be noted that the input values are all given and there is only one computational neuron.
Since the model focuses on where the computation takes place, usually we do not consider the input layer when calculating the number of layers. So the number of layers in the neural network in the figure is 1. We can think of a linear regression model as a neural network consisting of only a single artificial neuron, otherwise known as a single-layer neural network.
For linear regression, each input is connected to each output (in this case there is only one output), and we call this transformation (the output layer in Figure 3) 全连接层（fully-connected layer）或称为稠密层（dense layer）.

1.5 Summary

The key elements in a machine learning model are the training data, the loss function, the optimization algorithm, and the model itself.
Vectorization makes the math simpler and faster.
Minimizing the objective function is equivalent to performing maximum likelihood estimation.
A linear regression model is also a simple neural network.

Hands-on Deep Learning Notes (2) - Linear Neural Network