Translation: 3.4. Softmax regression pytorch

In Section 3.1, we introduce linear regression, implement it from scratch in Section 3.2, and again use the high-level API of the deep learning framework to do the heavy lifting in Section 3.3.

Regression is how much we solve when we want to answer how much. If you want to predict the dollar amount (price) of a house sold, or the number of wins by a baseball team, or the number of days a patient stays in the hospital before being discharged, you might be looking for a regression model.

In practice, we are more often interested in categorization: not "how many" but "which":
* Does this email belong to the spam folder or inbox?
*Is this customer more likely to sign up or not sign up for a subscription service?
*Does this picture depict a donkey, a dog, a cat or a rooster?
*Which movie Aston is most likely to see next?

In layman's terms, machine learning practitioners overload word classification to describe two subtly different problems:

  • (i) problems where we are only interested in hard-assigning examples to categories (classes);
  • (ii) those for which we wish to make soft assignments, i.e. assess the probability that each class applies.

This distinction tends to blur, in part because, even though we only care about hard assignments, we still use models that do soft assignments.

3.4.1. classification problem

To get our feet wet, let's start with a simple image classification problem. Here, each input consists of a 2 x 2 grayscale image. We can represent each pixel value with a single scalar, giving us four features x1, x2, x3, x4. Also, let's assume that each image belongs to one of the categories "cat", "chicken" and "dog".

Next, we have to choose how to represent the labels. We have two obvious options. Perhaps the most natural impulse is to choose y ∈ {1, 2, 3}, where integers represent {dog, cat, chicken}categories . This is a great way to store this kind of information on your computer. If there is some natural ordering between categories, assuming we are trying to predict {baby, toddler, adolescent, yound adult, adult, geriatric}, it might even make sense to convert this problem to regression and keep labels in this format.

But general classification problems do not come with a natural ordering between classes. Fortunately, statisticians invented a simple way to represent categorical data a long time ago: one-hot encoding. A one-hot encoding is a vector with as many components as our categories. Components corresponding to a particular instance class are set to 1, all other components are set to 0. In our case, the label y will be a three-dimensional vector, where (1,0,0)corresponds to "cat", (0,1,0)"chicken", and (0,0,1)to "dog":
insert image description here

3.4.2. Network Architecture

To estimate the conditional probabilities associated with all possible classes, we need a model with multiple outputs, one for each class. In order to use a linear model for classification, we need as many affine functions as there are outputs. Each output will correspond to its own affine function. In our case, since we have 4 features and 3 possible output classes, we will need 12 scalars for weights (w subscripted) and 3 scalars for bias (b subscripted). We compute these three logits, o1, o2, and o3, for each input:
insert image description here
we can describe this computation with the neural network diagram shown in Figure 3.4.1. Just like in linear regression, softmax regression is also a single-layer neural network. And since the computation of each output, o1, o2 and o3, depends on all inputs, x1, x2, x3, and x4, the output layer of softmax regression can also be described as a fully connected layer.

insert image description here

To express the model more compactly, we can use linear algebra notation. In vector form, we get o = Wx + b, a form more suitable for math and writing code. Note that we have collected all the weights into a 3 x 4matrix and the features x for a given data example, our output is given by the matrix-vector product of our weights and our input features plus our bias b.

3.4.3. Parameterization cost of fully connected layers

insert image description here

3.4.4 Softmax operation

The main approach we will take here is to interpret the output of the model as probabilities. We will optimize our parameters to produce probabilities that maximize the likelihood of observing the data. Then, to generate predictions, we will set a threshold, for example, to choose the label with the largest predicted probability.
insert image description here
You might suggest that we represent logits o directly as the output of our probabilities. However, there are some problems with directly interpreting the outputs of linear layers as probabilities. On the one hand, there is nothing limiting these numbers to sum to 1. On the other hand, depending on the input, they can take negative values. These violate the fundamental axioms of probability presented in Section 2.6.
insert image description here
To represent our outputs as probabilities, we must guarantee (even on new data) that they will be non-negative and sum to 1. Furthermore, we need a training objective to encourage the model to faithfully estimate probabilities. Of all the instances where the classifier outputs 0.5, we expect that half of these instances actually belong to the predicted class. This is a property called calibration.

The softmax function, invented by social scientist R. Duncan Luce in 1959 in the context of selection models, does just that. To convert our logits to be non-negative and sum to 1, while requiring the model to remain differentiable, we first exponentiate each logit (to ensure non-negativity) and divide by their sum (to ensure they sum to 1):
insert image description here
Although softmax is a nonlinear function, the output of softmax regression is still determined by an affine transformation of the input features; thus, softmax regression is a linear model.

3.4.5 Mini-batch vectorization

insert image description here

3.4.6 Loss function

Next, we need a loss function to measure the quality of our predicted probabilities. We will rely on maximum likelihood estimation, which is the exact same concept we encountered when providing a probabilistic proof for the mean squared error objective in linear regression (Section 3.1.3).

3.4.6.1 Likelihood function

insert image description here
insert image description here

3.4.6.2 Softmax and Derivation

Since softmax and corresponding loss are so common, it is worth understanding how it is calculated better. Substitute (3.4.3) into the definition of loss in (3.4.8) and use the definition of softmax we get:

insert image description here
insert image description here
In other words, the derivative is the difference between the probability that our model assigns (represented by the softmax operation) and the probability that it actually occurs (represented by the elements in the one-hot label vector). In this sense, it is very similar to what we see in regression, where the gradient is the difference y between observations and estimates y'
. This is no coincidence. In any exponential family (see online appendix on distributions) models, the gradient of the log-likelihood is given by exactly this term. This fact makes computing gradients easy in practice.

3.4.6.3 Cross-entropy loss

Now consider a situation where we observe not only a single outcome, but an entire distribution of outcomes. We can use the same representation y as before for the labels. The only difference is that instead of a vector containing only binary entries (0,0,1), we now have a general probability vector, say (0.1,0.2,0.7). The variable we used to define the loss before is lin (3.4. 8) still works fine, just the explanation is slightly more general. It is the expected value of the loss for the label distribution. This loss is called cross-entropy loss, and it is one of the most commonly used losses in classification problems. We can demystify the name by introducing the basics of information theory. If you want to know more details of information theory, you can further refer to the online appendix of information theory .

3.4.7 Fundamentals of Information Theory

Information theory deals with the problem of encoding, decoding, transmitting, and manipulating information (also known as data) in the most concise form possible.

3.4.7.1 Entropy

The central idea of ​​information theory is to quantify the information content in data. This amount limits our ability to compress data. In information theory, this quantity is called the entropy P of the distribution, and it is captured by the following equation:
insert image description here

3.4.7.2 Unexpected Surprisal

You might be wondering what compression has to do with prediction. Imagine we have a stream of data to compress. If we always predict the next token easily, then this data is easy to compress! As an extreme example, every token in the stream always takes the same value. That's a pretty boring stream of data! And not only is it boring, it's also very predictable. Because they are always the same, we don't have to transmit any information to convey the content of the stream. Easy to predict and easy to compress.

insert image description here

3.4.7.3 Cross-entropy revisited

insert image description here
In short, we can consider the cross-entropy classification objective in two ways:

  • (i) maximize the likelihood of observing the data;
  • (ii) Minimize the unexpected Surprisal (and the number of bits) we need to communicate the label.

3.4.8 Model prediction and evaluation

After training a softmax regression model, given any example features, we can predict the probability of each output class. Typically, we use the class with the highest predicted probability as the output class. The prediction is correct if it agrees with the actual class (label). In the next part of the experiment, we will use the accuracy rate to evaluate the performance of the model. This is equal to the ratio between the number of correct predictions and the total number of predictions.

3.4.9. generalize

  • The softmax operation takes a vector and maps it to probabilities.

  • Softmax regression is suitable for classification problems. It uses the probability distribution of the output classes in the softmax operation.

  • Cross-entropy is a good measure of the difference between two probability distributions. It measures the number of bits required by a given model to encode the data.

3.4.10 Exercises

insert image description here

refer to

https://d2l.ai/chapter_linear-networks/softmax-regression.html

Guess you like

Origin blog.csdn.net/zgpeace/article/details/123787046