x.1 Classification classification problem theory
x.1.1 The difference between Classification and Regression
Note that, broadly speaking, both Classification/Softmax Regression and Linear Regression are linear models. But people are more accustomed to using Classification to represent Softmax Regression and Regression to represent Linear Regression in spoken language.
For example, predicting the sale price of a house is a typical Linear Regression linear regression problem, and sometimes we care more about category than how much, such as judging whether a patient has cancer by some features, and the resulting classification problem is Classification.
x.1.2 One-hot Encoding
The output of the Classification problem is often not related to the natural order between categories, so sometimes the problem cannot be transformed into a Regression problem. For this reason, statisticians have invented a simple method for representing categorical data: one-hot encoding One-hot encoding. One-hot encoding is a vector whose length corresponds to the total number of categories, the position of the component corresponding to the category is set to 1, and the rest of the positions are set to 0, for example y ∈ { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}y∈{(1,0,0),(0,1,0),(0,0,1)}
x.1.3 Model Architecture
The difference between Classification and Regression models is that Classification is multi-output, so we also need the same amount of bias as the output. For example, if we have 4 inputs and 3 outputs, we need 12 weight scalars and 3 bias scalars, a total of 15 learnable parameters, as follows:
To simplify the model, we express operations in matrix form, where W is a 3x4 matrix and b is a vector of length 3.
It can be seen that the network model generated by the fully connected layer with a layer number of 2 has always had O ( dp ), d = input, p = output O(dp), d=input, p=outputO(dp),d=input ,p=output .
x.1.4 Softmax
There is a problem:
- The sum of the output values is not necessarily 1
- The output may be negative or even greater than 1
All of the above do not conform to the basic axioms of probability theory. In order to solve the problem, Softmax is introduced.
The benefits of softmax,
- Transform predictions to be non-negative and sum to 1.
- Let the model maintain the derivable nature, that is, the derivative is easy to find.
- Softmax will not change the order between unnormalized predictions, so the following formula still holds,
Finally, our network model introducing softmax is as follows:
x.1.5 Mini-batch Gradient Descent
batch size = 1, stochastic gradient descent (SGD)
batch size = n, 1 < n < 256, stochastic gradient descent (MBSGD)
batch size = 256, mini-batch gradient descent (MBGD/BGD), because all data sets are used
The mini-batch SGD small batch stochastic gradient descent (hereinafter referred to as MBSGD) algorithm is used here, because MBSGD speeds up the matrix-vector multiplication of XW compared to processing one sample at a time.
For example, we read a batch of samples X, where the input feature dimension is d, the batch size is n, and the number of output categories is q. Then immediately push the feature of the small sample as X ∈ R n × d , the weight is W ∈ R d × q , the bias is b ∈ R 1 × q X \in R^{n\times d}, and the weight is W \in R^{d \times q}, biased by b \in R^{1 \times q}X∈Rn×d,Weight is W∈Rd×q,biased to b∈R1 × q , as follows,
The batch_size is not necessarily as big as possible. Considering the existence of the BN layer, it generally needs to be greater than 8, and the batch_size should match the learning rate. So the number of learnable parameters has nothing to do with batch_size .
Here is a very good article about the gradient descent algorithm:Deep Learning - Detailed Explanation of Optimizer Algorithm Optimizer (BGD, SGD, MBGD, Momentum, NAG, Adagrad, Adadelta, RMSprop, Adam)https://www.cnblogs.com/guoyaohua/p/8542554.html
It is recommended to use the Adam optimizer, and the hyperparameters are set to β1 = 0.9, β2 = 0.999, ϵ = 10e−8
x.1.5 Cross-entropy Loss
Due to MLE maximum likelihood estimation we get the loss function as cross-entropy loss cross entropy loss. Suppose we have q categories, and finally we get the loss function of target - y and predict - y_hat as,
Note that yi in the formula is a one-hot encoding form. We introduce the softmax activation function into formula (4, 1, 8), so that the output_j also becomes a form of probability distribution,
For the above loss for a specific output oj o_jojFind the partial derivative, and get the partial derivative of the loss for a specific output o_j is the softmax o_j minus the one-hot encoding of y_j, so the derivative of the network model that introduces the softmax activation layer is easy to solve .
x.2 Classification classification problem practice
Flatten the image data into one-dimensional and pass it into the fully connected layer for training.
F.cross_entropy()
There is not much difference between and nn.CrossEntropyLoss()
, but it should be noted nn.Dropout
that F.dropout()
there is a big difference between and. Although dropout has no learnable parameters, model.train() and model.eval() cannot be controlled F.dropout
but can be controlled nn.Dropout
.