d2l_Chapter Four Learning_Classification/Softmax Regression

x.1 Classification classification problem theory

x.1.1 The difference between Classification and Regression

Note that, broadly speaking, both Classification/Softmax Regression and Linear Regression are linear models. But people are more accustomed to using Classification to represent Softmax Regression and Regression to represent Linear Regression in spoken language.

For example, predicting the sale price of a house is a typical Linear Regression linear regression problem, and sometimes we care more about category than how much, such as judging whether a patient has cancer by some features, and the resulting classification problem is Classification.

x.1.2 One-hot Encoding

The output of the Classification problem is often not related to the natural order between categories, so sometimes the problem cannot be transformed into a Regression problem. For this reason, statisticians have invented a simple method for representing categorical data: one-hot encoding One-hot encoding. One-hot encoding is a vector whose length corresponds to the total number of categories, the position of the component corresponding to the category is set to 1, and the rest of the positions are set to 0, for example y ∈ { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}y{(1,0,0),(0,1,0),(0,0,1)}

x.1.3 Model Architecture

The difference between Classification and Regression models is that Classification is multi-output, so we also need the same amount of bias as the output. For example, if we have 4 inputs and 3 outputs, we need 12 weight scalars and 3 bias scalars, a total of 15 learnable parameters, as follows:

Please add a picture description

Please add a picture description

To simplify the model, we express operations in matrix form, where W is a 3x4 matrix and b is a vector of length 3.

Please add a picture description

It can be seen that the network model generated by the fully connected layer with a layer number of 2 has always had O ( dp ), d = input, p = output O(dp), d=input, p=outputO(dp),d=input ,p=output .

x.1.4 Softmax

There is a problem:

  1. The sum of the output values ​​is not necessarily 1
  2. The output may be negative or even greater than 1

All of the above do not conform to the basic axioms of probability theory. In order to solve the problem, Softmax is introduced.

Please add a picture description

The benefits of softmax,

  1. Transform predictions to be non-negative and sum to 1.
  2. Let the model maintain the derivable nature, that is, the derivative is easy to find.
  3. Softmax will not change the order between unnormalized predictions, so the following formula still holds,

Please add a picture description

Finally, our network model introducing softmax is as follows:

Please add a picture description

x.1.5 Mini-batch Gradient Descent

batch size = 1, stochastic gradient descent (SGD)

batch size = n, 1 < n < 256, stochastic gradient descent (MBSGD)

batch size = 256, mini-batch gradient descent (MBGD/BGD), because all data sets are used

The mini-batch SGD small batch stochastic gradient descent (hereinafter referred to as MBSGD) algorithm is used here, because MBSGD speeds up the matrix-vector multiplication of XW compared to processing one sample at a time.

For example, we read a batch of samples X, where the input feature dimension is d, the batch size is n, and the number of output categories is q. Then immediately push the feature of the small sample as X ∈ R n × d , the weight is W ∈ R d × q , the bias is b ∈ R 1 × q X \in R^{n\times d}, and the weight is W \in R^{d \times q}, biased by b \in R^{1 \times q}XRn×d,Weight is WRd×q,biased to bR1 × q , as follows,

Please add a picture description

The batch_size is not necessarily as big as possible. Considering the existence of the BN layer, it generally needs to be greater than 8, and the batch_size should match the learning rate. So the number of learnable parameters has nothing to do with batch_size .

Here is a very good article about the gradient descent algorithm:Deep Learning - Detailed Explanation of Optimizer Algorithm Optimizer (BGD, SGD, MBGD, Momentum, NAG, Adagrad, Adadelta, RMSprop, Adam)https://www.cnblogs.com/guoyaohua/p/8542554.html

It is recommended to use the Adam optimizer, and the hyperparameters are set to β1 = 0.9, β2 = 0.999, ϵ = 10e−8

x.1.5 Cross-entropy Loss

Due to MLE maximum likelihood estimation we get the loss function as cross-entropy loss cross entropy loss. Suppose we have q categories, and finally we get the loss function of target - y and predict - y_hat as,

Please add a picture description

Note that yi in the formula is a one-hot encoding form. We introduce the softmax activation function into formula (4, 1, 8), so that the output_j also becomes a form of probability distribution,

Please add a picture description

For the above loss for a specific output oj o_jojFind the partial derivative, and get the partial derivative of the loss for a specific output o_j is the softmax o_j minus the one-hot encoding of y_j, so the derivative of the network model that introduces the softmax activation layer is easy to solve .

Please add a picture description

x.2 Classification classification problem practice

Flatten the image data into one-dimensional and pass it into the fully connected layer for training.

F.cross_entropy()There is not much difference between and nn.CrossEntropyLoss(), but it should be noted nn.Dropoutthat F.dropout()there is a big difference between and. Although dropout has no learnable parameters, model.train() and model.eval() cannot be controlled F.dropoutbut can be controlled nn.Dropout.

Guess you like

Origin blog.csdn.net/qq_43369406/article/details/131177046