Hands-on learning and deep learning -Softmax classification model _pytorch

softmax and classification model

Content includes:

  1. The basic concept of the return of softmax
  2. How to obtain Fashion-MNIST data set and the read data
  3. Softmax from scratch to achieve regression model, implementation model one pair of Fashion-MNIST training set of image data classification
  4. Use pytorch re-implement the regression model softmax

The basic concept of softmax

  • Classification of
    a simple image classification, and a high input image width are 2 pixels, color to grayscale.
    4 pixels in the image are denoted as x1, x2, x3, x4x1, x2, x3, x4.
    Assuming that real tags for dogs, cats or chickens, these labels correspond to discrete values y1, y2, y3y1, y2, y3.
    We typically use discrete values to represent categories such as y1 = 1, y2 = 2, y3 = 3y1 = 1, y2 = 2, y3 = 3.

  • Weight vector

     

    o1=x1w11+x2w21+x3w31+x4w41+b1o1=x1w11+x2w21+x3w31+x4w41+b1

       o2=x1w12+x2w22+x3w32+x4w42+b2o2=x1w12+x2w22+x3w32+x4w42+b2

       o3 = x1w13 + x2w23 + x3w33 + x4w43 + b3o3 = x1w13 + x2w23 + x3w33 + x4w43 + b3

  • Neural Network FIG
    lower view of the neural network calculation above is depicted in FIG. softmax return, like linear regression, is a single-layer neural network. Since each output o1, o2, o3o1, o2, o3 is calculated to be dependent on all of the inputs x1, x2, x3, x4x1, x2, x3, x4, softmax regression output layer is a layer fully connected.

Image Name

softmax return is a single-layer neural network softmax return is a single-layer neural network

Since the classification needs to be predicted discrete output, a simple approach is to output the value as the predicted class is oioi ii confidence, and the maximum value is output as the prediction type corresponding to the output, i.e. the output argmaxioiarg⁡maxioi. For example, if o1, o2, o3o1, o2, o3 respectively 0.1,10,0.10.1,10,0.1, due o2o2 maximum, then the predicted category 2, which represents the cat.

  • Problems output
    directly output the output layer has two problems:
    1. On the one hand, due to the range of the output value of the output layer of uncertainty, we intuitively difficult to judge the significance of these values. For example, the output value in the example just cited 10 indicates "very confidence" category cat image, because the output value is 100 times the output value of the other two. But if o1 = o3 = 103o1 = o3 = 103, but the output value of 10 represents the probability is very low cat image category.
    2. On the other hand, since the tag is a discrete real values, the error between the discrete values ​​and the output values ​​of the uncertainty range difficult to measure.

softmax operator (operator softmax) solve the above two problems. It is converted by the following formula into the output value and a positive value and a probability distribution:

y ^ 1, y ^ 2 y ^ 3 = softmax (o1, o2, o3) and ^ 1, y ^ 2 y ^ 3 = softmax (o1, o2, o3)

among them

y ^ 1 = exp (o1) S3i = 1exp (oi), y ^ 2 = exp (o2) S3i = 1exp (oi), y ^ 3 = exp (o3) S3i = 1exp (oi) .y ^ 1 = exp⁡ (o1) Ói = 13exp⁡ (oi), y ^ 2 = exp⁡ (o2) Ói = 13exp⁡ (oi), y ^ 3 = exp⁡ (o3) Ói = 13exp⁡ (oi ).

Readily seen y ^ 1 + y ^ 2 + y ^ 3 = 1y ^ 1 + y ^ 2 + y ^ 3 = 1 and 0≤y ^ 1, y ^ 2, y ^ 3≤10≤y ^ 1, y ^ 2, y ^ 3≤1, so y ^ 1, y ^ 2, y ^ 3y ^ 1, y ^ 2, y ^ 3 is a valid probability distribution. At this time, if y ^ 2 = 0.8y ^ 2 = 0.8, regardless of the value of y ^ 1y ^ 1 and y ^ 3y ^ 3 is how much, we all know that the probability of a cat image category is 80%. In addition, we note

argmaxioi = argmaxiy iarg⁡maxioi ^ = ^ in arg⁡maxiy

Thus softmax operation does not change the predicted output class.

  • Computational efficiency
    • Single sample vector calculation expression
      in order to improve computational efficiency, we can classify the samples will be expressed by a single vector calculation. In the above image classification problem, we assume that the right weight and bias softmax regression parameters were

W=⎡⎣⎢⎢⎢w11w21w31w41w12w22w32w42w13w23w33w43⎤⎦⎥⎥⎥,b=[b1b2b3],W=[w11w12w13w21w22w23w31w32w33w41w42w43],b=[b1b2b3],

Set high and the width of each of the two pixels of image samples ii wherein

x(i)=[x(i)1x(i)2x(i)3x(i)4],x(i)=[x1(i)x2(i)x3(i)x4(i)],

Output layer

o (i) = [o (i) 1o (i) 2o (i) 3], o (i) = [o1 (i) o2 (i) o3 (i)],

Predict the probability of a dog, cat or chicken distribution

y ^ (i) = [y ^ (i) 1y ^ (i) 2y ^ (i) 3] .and ^ (i) = [y ^ 1 (i) y ^ 2 (i) and ^ 3 (i) ].

softmax regression calculation expression vector is classified samples ii

o(i)y^(i)=x(i)W+b,=softmax(o(i)).o(i)=x(i)W+b,y^(i)=softmax(o(i)).

  • Small quantities of vector calculation expression
    in order to further improve the computational efficiency, we usually do vector calculation for small quantities of data. Broadly speaking, small quantities of a given sample, which batch size of NN, the number of inputs (the number of features) is dd, the output number (category number) qq. Wherein the set batch X∈Rn × dX∈Rn × d. Suppose the right weight and bias softmax regression parameters were W∈Rd × qW∈Rd × q and b∈R1 × qb∈R1 × q. softmax expression vector calculation for return

OY ^ = XW + b, = softmax (O), O = XW + b, Y ^ = softmax (O),

Wherein the addition operation using a broadcast mechanism, O, Y ^ ∈Rn × qO, Y ^ ∈Rn × q and the first two rows of the matrix respectively ii ii output samples o (i) o (i) and a probability distribution y ^ (i) y ^ (i).

Cross-entropy loss function

For a sample of ii, we constructed the vector y (i) ∈Rqy (i) ∈Rq, so that the first y (i) y (i) (ii category sample discrete values) of elements 1, the rest is 0. So that our training goals can be set to the predicted probability distribution y ^ (i) y ^ (i) actual label probability as close as possible distribution y (i) y (i).

  • Quadratic loss estimate

Loss=|y^(i)−y(i)|2/2Loss=|y^(i)−y(i)|2/2

However, we want to predict the correct classification results, we actually do not need to predict the probability of exactly equal to the probability of the label. In the example of image classification, if y (i) = 3y (i) = 3, then we need only y ^ (i) 3y ^ 3 (i) than the other two prediction values ​​y ^ (i) 1y ^ 1 (i) and y ^ (i) 2y ^ 2 (i) on the line big. Even if y ^ (i) 3y ^ 3 (i) of 0.6, no matter how much the other two predictive value, class prediction were correct. And squaring loss is too strict, for example, y ^ (i) 1 = y ^ (i) 2 = 0.2y ^ 1 (i) = y ^ 2 (i) = 0.2 ratio y ^ (i) 1 = 0, y ^ (i) 2 = 0.4y ^ 1 (i) = 0, y ^ 2 (i) = 0.4 is much smaller loss, although both have the same prediction correct classification results.

A method of improving the above problem is to use more suitable difference measure two probability distributions of the measurement function. Wherein the cross-entropy (cross entropy) is a commonly used measure:

H(y(i),y^(i))=−∑j=1qy(i)jlogy^(i)j,H(y(i),y^(i))=−∑j=1qyj(i)log⁡y^j(i),

Wherein subscripted y (i) jyj (i) is an element of the vector y (i) y (i) of nonzero i.e. 1, it should be noted that the sample category ii discrete values, i.e., without subscripts y (i ) y (i) distinction. In the above formula, we know that the vector y (i) y (i) Only the first y (i) y (i) elements y (i) y (i) y (i) y (i) is 1, the rest is 0, then H (y (i), y ^ (i)) = - logy ^ y (i) (i) H (y (i), y ^ (i)) = - log⁡y ^ y (i ) (i). In other words, cross entropy concerned only predict the probability of correct categories, as long as its value is large enough, you can ensure proper classification results. Of course, the face of a sample have several labels, for example, when the image contains more than one object, we do not do this step is simplified. But even in this case, the cross-entropy equally concerned only predict the probability of an object appearing in the image category.

Number of samples is assumed as training data set nn, cross entropy loss function defined as

ℓ(Θ)=1n∑i=1nH(y(i),y^(i)),ℓ(Θ)=1n∑i=1nH(y(i),y^(i)),

ΘΘ which represents the model parameters. Likewise, if only one label each sample, then the cross entropy loss can be abbreviated ℓ (Θ) = - (1 / n) Σni = 1logy ^ (i) y (i) ℓ (Θ) = - (1 / n) Σi = 1nlog⁡y ^ y (i) (i). From another perspective, we know that minimize ℓ (Θ) ℓ (Θ) is equivalent to maximizing exp (-nℓ (Θ)) = Πni = 1y ^ (i) y (i) exp⁡ (-nℓ (Θ)) = Πi = 1ny ^ y (i) (i), i.e., minimizing the loss of cross-entropy function is equivalent to maximizing the probability of the training data in prediction set of all categories of labels.

Training and forecasting model

After a good training softmax regression model, given either as this feature, you can predict the probability of each output category. Usually, we predict the most probable category as the output category. If it is consistent with the real categories (tags), indicating that this prediction is correct. In the experimental section 3.6, we will use to evaluate the performance of the model accuracy (accuracy). It is equal to the number of correct predictions than the number of total forecast.

Gets Fashion-MNIST training set and read data

Before implementing introduce softmax return us to introduce a multi-class image classification data set. It will be used multiple times in later sections, to facilitate comparison algorithm, we observed in the difference between model accuracy and computational efficiency. Image classification data set is most commonly used to identify handwritten digital data set MNIST [1]. But most models on MNIST classification accuracy of more than 95%. For examining difference between the algorithm is more visually observe, we will use a more complex image content data set Fashion-MNIST [2].

I have here we'll use torchvision package, which is to serve the PyTorch deep learning framework, mainly used to build computer vision model. torchvision mainly consists of the following parts:

  1. torchvision.datasets: Some data loading function and the set of common data interfaces;
  2. torchvision.models: contains conventional model structure (including pre-trained model), e.g. AlexNet, VGG, ResNet the like;
  3. torchvision.transforms: common graphic transformation such as cropping, rotating and the like;
  4. torchvision.utils: some other useful methods.
Released eight original articles · won praise 0 · Views 4787

Guess you like

Origin blog.csdn.net/yhj20041128001/article/details/104283458