Series Article Directory
foreword
Softmax regression (softmax regression) is actually the general form of logistic regression, logistic regression is used for binary classification, and softmax regression is used for multi-classification
softmax regression is a single-layer neural network with multiple outputs
Neural Network: All neurons in the output layer are linearly connected to all neurons in the input layer. no middle layer
For any output neuron yj y_jyj, Toyu yj = ∑ wi , jxi + bj y_j = \sum w_{i, j}x_i + b{j}yj=∑wi,jxi+bj
Furthermore, the softmax regression model can be written as:
y = W x + b \pmb{y}= \pmb{Wx+b} y=Wx+b
Suppose x ∈ R n × 1 \pmb{x}\in\mathbb{R}^{n\times 1}x∈Rn×1, y ∈ R m × 1 \pmb{y}\in\mathbb{R}^{m\times 1} y∈Rm×1,
T h e n Then Then
W ∈ R m × n , b ∈ R m × 1 \pmb{W}\in\mathbb{R}^{m\times n}, \pmb{b}\in\mathbb{R}^{m\times 1} W∈Rm×n,b∈Rm×1
The output of Softmax regression is not a real value, but a probability between 0 and 1
The essential difference between softmax regression and linear regression is that the output has different meanings
Take iris as an example: y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3represent x \pmb{x} respectivelyProbability that x belongs to Iris iris, Iris versicolor, or Iris virginia.
Since it is a probability, then y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3The range of values:
1. 应当在1的范围之内
2. 和应当为1
Treating it with linear regression does not guarantee y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3The range of values satisfies the above conditions.
The output needs to be softmax processed.
y i = e y i ∑ j = 1 3 e y j y_i = \frac{e^{y_i}}{\sum_{j=1}^{3} e^{y_j}} yi=∑j=13eyjeyi
The above formula is the softmax function .
Obviously, yi ∈ ( 0 , 1 ] y_i\in(0, 1]yi∈(0,1 ] , y can never be equal to 0, but may be very close to 0.
Example 1: Given an array {-0.5, 0, 10}, calculate their softmax output
e − 0.5 = 0.6065 , e 0 = 1 , e 100 = 22026.4658 e^{-0.5}=0.6065, e^{0}=1, e^{100}=22026.4658e−0.5=0.6065,e0=1,e100=22026.4658
∑ j = 1 3 e y j = 22028.0723 \sum_{j=1}^{3} e^{y_j}=22028.0723 ∑j=13eyj=22028.0723
y1 = 0.00003 y_1 = 0.00003y1=0.00003
y2 = 0.00004 y_2 = 0.00004y2=0.00004
y3 = 0.99993 y_3 = 0.99993y3=0.99993
Example 2: Given an array {0.5, 0.8, 0.4}, calculate their softmax output
e 0.5 = 1.649 , e 0.8 = 2.226 , e 0.4 = 1.492 e^{0.5}=1.649, e^{0.8}=2.226, e^{0.4}=1.492e0.5=1.649,e0.8=2.226,e0.4=1.492
∑ j = 1 3 e y j = 5.367 \sum_{j=1}^{3} e^{y_j}=5.367 ∑j=13eyj=5.367
y1 = 0.307 y_1 = 0.307y1=0.307
y2 = 0.415 y_2 = 0.415y2=0.415
y3 = 0.278 y_3 = 0.278y3=0.278
In summary, a more complete softmax regression model can be written as
y = s o f t m a x ( W x + b ) \pmb{y}= softmax(\pmb{Wx+b}) y=softmax(Wx+b)
What kind of data do we expect the softmax model to output?
Linear regression uses (mean square error, MSE) loss
l = 0.5 ( y ^ − y ) 2 l = 0.5 (\hat{y}-y)^2l=0.5(y^−y)2
Its purpose is to make the predicted value closer to the real value. Obviously, this doesn't apply to softmax regression models .
In the softmax regression model, the output is a discrete value, and each value represents the probability of a category.
We expect:
-
The predicted class of the correct class is higher → \rightarrow→ close to 1
-
The predicted class of the wrong class is lower → \rightarrow→ close to 0
Then in the extreme case, our expected output should be a string of numbers consisting of 1 and 0, of which there is only one element that is 1, and the position of 1 is in the correct category.
y i = { 0 , c a t e g o r y ≠ i 1 , c a t e g o r y = i y_i= \begin{cases} 0,& category \neq i \\ 1,& category = i\end{cases} yi={ 0,1,category=icategory=i
Example: In the iris data set, the labels of Iris Iris, Iris Versicolor, and Iris Virginia are respectively 1, 2, and 3. Then in the softmax model, the best outputs we expect are:
Yamanashio = [1, 0, 0]
color iris = [0, 1, 0]
Virginia Kite = [0, 0, 1]
–
So for a given softmax output, how do we measure the gap between him and lable?
cross entropy loss
Unlike MSE or L1 norm loss, cross-entropy loss can more sensitively reflect the gap between the predicted result and the real value:
The greater the difference between the results
↓ \downarrow ↓
The greater the loss
↓ \downarrow↓
The multiplication of the model will be larger
↓ \downarrow↓
The greater the intensity of parameter adjustment
The specific form of cross entropy loss
H ( p , q ) = ∑ i − p i log ( q i ) H(p, q)=\sum_{i} -p_i\text{log}(q_i) H(p,q)=∑i−pilog(qi)
where p , qp, qp,q denotes the probability of the output and the true value, respectively.
Example 1
The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.3, 0.2, 0.7], and its value in the first dimension is 0.3.
Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.3 , 0.2 , 0.7 ] ) = − ( 0.3 ) ≈ 1.7 H([1, 0, 0], [0.3, 0.2, 0.7])= -(\text{0.3})\approx 1.7H([1,0,0],[0.3,0.2,0.7])=−(0.3)≈1.7
Example 2
The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.6, 0.2, 0.2], and its value in the first dimension is 0.6.
Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.6 , 0.2 , 0.2 ] ) = − ( 0.6 ) ≈ 0.74 H([1, 0, 0], [0.6, 0.2, 0.2])= -(\text{0.6})\approx 0.74H([1,0,0],[0.6,0.2,0.2])=−(0.6)≈0.74
Example 3
The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.8, 0.1, 0.1], and its value in the first dimension is 0.8.
Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.8 , 0.1 , 0.1 ] ) = − ( 0.6 ) ≈ 0.32 H([1, 0, 0], [0.8, 0.1, 0.1])= -(\text{0.6})\approx 0.32H([1,0,0],[0.8,0.1,0.1])=−(0.6)≈0.32
Obviously, the more accurate the prediction, the closer the distribution to the true probability, and the lower the cross-entropy.
Therefore, it is very suitable to use cross entropy as a loss function.
–
Theoretical summary
- expression
[ o 1 ⋮ o n ] = [ w 11 ⋯ w 1 m ⋮ ⋱ ⋮ w n 1 ⋯ w m n ] × [ x 1 ⋮ x m ] + [ b 1 ⋮ b n ] \begin{bmatrix} {o_1}\\ {\vdots}\\ {o_{n}}\\ \end{bmatrix} = \begin{bmatrix} {w_{11}}&{\cdots}&{w_{1m}}\\ {\vdots}&{\ddots}&{\vdots}\\ {w_{n1}}&{\cdots}&{w_{mn}}\\ \end{bmatrix}\times \begin{bmatrix} {x_1}\\ {\vdots}\\ {x_{m}}\\ \end{bmatrix}+\begin{bmatrix} {b_1}\\ {\vdots}\\ {b_{n}}\\ \end{bmatrix} o1⋮on = w11⋮wn 1⋯⋱⋯w1 m⋮wmn × x1⋮xm + b1⋮bn
Written in matrix form is:
O = W x + b \pmb{O}=\pmb{Wx+b} O=Wx+b
The final result is softmax:
y = [ y 1 ⋮ yn ] = softmax ( o ) = 1 ∑ ieoi [ eo 1 ⋮ eon ] \pmb{y}=\begin{bmatrix} {y_1}\\ {\vdots}\\ {y_{n}; }\\ \end{bmatrix}=\text{softmax}(\pmb{o})=\frac{1}{\sum_{i}e^{o_i}}\begin{bmatrix}{e^{o_1} }\\{\vdots}\\{e^{o_n}}\\\end{bmatrix}y= y1⋮yn =softmax(o)=∑ieoi1 eo1⋮eon
- Loss function
crossover Loss function
L ( y , y ^ ) = − ∑ iyi log y ^ = − log y ^ i ∣ yi = 1 \mathcal{L}(\pmb{y, \hat{y}})=- \sum_{i}y_i\text{log}\hat{y}=-\text{log}\hat{y}_{i|y_i=1}L ( y ,y^)=−i∑yilogy^=−logy^i∣yi=1