Hands-on deep learning notes (4) - classification problem (softmax regression)

Classification problems are usually divided into two categories: hard classification and soft classification;

  • Hard classification: only interested in the "hard" category of the sample, i.e. which category it belongs to;
  • Soft classification: that is, to get the probability of belonging to each category;

The line between the two is often blurred, because even though we only care about hard categories, we still use soft-category models.

1.1.1 Classification problems

Common classification problems:

  • Is an email in the spam folder?

  • A user may or may not sign up for a subscription service?

  • Does an image depict a donkey, a dog, a cat, or a chicken?

  • Which movie is someone most likely to watch next?
    insert image description here

We start with an image classification problem. Suppose each input is a 2x2 grayscale image. We can represent each pixel value with a scalar, and each image corresponds to four features x1,x2,x3,x4. Also, assume that each image belongs to one of the categories "cat", "chicken" and "dog".
Next, we have to choose how to represent the labels. We have two obvious choices: the most straightforward idea is to choose y∈{1,2,3}, where the integers represent {dog, cat, chicken} respectively. This is an efficient way to store this type of information on your computer. If there is some natural ordering between categories, say we are trying to predict {infants, children, teens, young adults, middle-aged, elderly}, then it makes sense to turn this problem into a regression problem, and keep this format.
But general classification problems are not about natural ordering between categories. Fortunately, statisticians have long since invented a simple way to represent categorical data:
独热编码(one-hot encoding).A one-hot encoding is a vector with as many components as there are categories. The component corresponding to the category is set to 1, and all other components are set to 0. In our case, the label y will be a three-dimensional vector, where (1,0,0) corresponds to "cat", (0,1,0) corresponds to "chicken", and (0,0,1) corresponds to "dog":

1.1.2 Network Architecture

To estimate conditional probabilities for all possible classes, we need a model with multiple outputs, one for each class.To solve a classification problem with a linear model, we need as many as outputs 仿射函数(affine function). Each output corresponds to its own affine function.
In our case, since we have 4 features and 3 possible output classes, we will need 12 scalars for weights (subscripted w) and 3 scalars for bias (subscripted b).
Below we compute three unnormalized predictions (logit) for each input: o 1 , o 2 and o 3 .
insert image description here
This computational process can be described with a neural network diagram. Like linear regression, softmax regression is also a single-layer neural network.Since computing each output o 1 , o 2 and o 3 depends on all inputs x 1 , x 2 , x 3 and x 4 , the output layer of softmax regression is also a fully connected layer.
insert image description here

1.1.3 softmax operation

The softmax function transforms unnormalized predictions to be non-negative and sum to 1, while requiring the model to remain differentiable.
We first exponentiate each unnormalized prediction, which ensures that the output is non-negative. To ensure that the final output sums to 1, we divide each exponentiated result by their sum.
insert image description here
Although softmax is a nonlinear function, the output of softmax regression is still determined by the affine transformation of the input features. So softmax regression is one 线性模型(linear model).

1.1.4 Vectorization of Mini-batch Samples

To improve computational efficiency and make full use of the GPU, we typically perform vector computations on small batches of data.
Suppose a batch of samples X is read , where the feature dimension (number of inputs) is d and the batch size is n. Suppose there are q categories in the output.
Then the mini-batch feature is XR n×d , the weight is WR d×q , and the bias is bR 1×q . . The vector calculation expression for softmax regression is:
insert image description here

1.1.5 Loss function

Use maximum likelihood estimation, as in linear regression.

1.1.5.1 Log-likelihood

The softmax function gives a vector y^ which we can think of as " 对给定任意输入 x 的每个类的条件概率".
For example, y^1 = P(y=cat∣x) . Assuming that the entire dataset {X,Y} has n samples, compare the estimated value with the actual value:
insert image description here
according to the maximum likelihood estimation, we maximize P(Y∣X), which is equivalent to minimizing the negative log-likelihood:
insert image description here
Medium , for any label y and model prediction y^ , the loss function is:
insert image description here
This loss function is often called 交叉熵损失(cross-entropy loss).

1.1.5.2 softmax and its derivatives

We substitute y^ in 1.1.3 into the loss function to get: the derivative
insert image description here
of o j , we get:
insert image description here

1.1.5.3 Cross-entropy loss

For the label y, we can use the same representation as before. The only difference is that we now represent it as a probability vector like (0.1,0.2,0.7) instead of (0,0,1) which contains only binary terms.
insert image description here
The loss l is defined as above and it is the expected loss value for all label distributions. This loss is called 交叉熵损失(cross-entropy loss), and it is one of the most commonly used losses for classification problems.

1.1.6 Fundamentals of Information Theory

信息论(information theory)Involves encoding, decoding, sending, and processing information or data as concisely as possible.

1.1.6.1 Entropy

The core idea of ​​information theory is to quantify the information content in data. In information theory, this value is called the distribution 熵(entropy)P. It can be obtained by the following equation:
insert image description here

The essence of entropy is the "internal chaos" of a system.

1.1.6.2 Surprise

If we can't fully predict every event, we can sometimes be "surprised". Claude Shannon decided log1/P(j)=−logP(j)to quantify 惊异(surprisal)。an event j in observation and assign it a (subjective) probability P(j). Our surprise is greater when we assign a lower probability to an event.
Entropy, defined above, is when the assigned probabilities really match the data generation process 预期惊异(expected surprisal).

1.1.6.3 Revisiting cross-entropy

The cross entropy goes from P to Q, denoted as H(P,Q). You can think of cross-entropy as "the expected surprise of an observer with subjective probability Q when they see data generated with probability P". When P=Q, the cross-entropy is at its lowest.
In short, we can consider the cross-entropy classification objective in two ways:

  • (i) maximize the likelihood of the observed data;
  • (ii) Minimize the surprise required to convey the label.

1.1.7 Model prediction and evaluation

After training a softmax regression model, given any sample features, we can predict the probability of each output class. Usually we use the class with the highest predicted probability as the output class. The prediction is correct if it agrees with the actual class (label). In the next experiments, we will use 精度(accuracy)to evaluate the performance of the model.Precision equals the ratio between the number of correct predictions and the total number of predictions.

Summarize

  • The softmax operation takes a vector and maps it to probabilities.
  • Softmax regression is suitable for classification problems, it uses the probability distribution of the output classes in the softmax operation.
  • Cross-entropy is a good measure of the difference between two probability distributions.

Guess you like

Origin blog.csdn.net/qq_52118067/article/details/122729275