deep learning|1|softmax regression

Series Article Directory


foreword

Softmax regression (softmax regression) is actually the general form of logistic regression, logistic regression is used for binary classification, and softmax regression is used for multi-classification


softmax regression is a single-layer neural network with multiple outputs

Neural Network: All neurons in the output layer are linearly connected to all neurons in the input layer. no middle layer

For any output neuron yj y_jyj, Toyu yj = ∑ wi , jxi + bj y_j = \sum w_{i, j}x_i + b{j}yj=wi,jxi+bj

Furthermore, the softmax regression model can be written as:

y = W x + b \pmb{y}= \pmb{Wx+b} y=Wx+b

Suppose x ∈ R n × 1 \pmb{x}\in\mathbb{R}^{n\times 1}xRn×1, y ∈ R m × 1 \pmb{y}\in\mathbb{R}^{m\times 1} yRm×1,

T h e n Then Then

W ∈ R m × n , b ∈ R m × 1 \pmb{W}\in\mathbb{R}^{m\times n}, \pmb{b}\in\mathbb{R}^{m\times 1} WRm×n,bRm×1


The output of Softmax regression is not a real value, but a probability between 0 and 1

The essential difference between softmax regression and linear regression is that the output has different meanings

Take iris as an example: y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3represent x \pmb{x} respectivelyProbability that x belongs to Iris iris, Iris versicolor, or Iris virginia.

Since it is a probability, then y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3The range of values:

1. 应当在1的范围之内
2. 和应当为1

Treating it with linear regression does not guarantee y 1 , y 2 , y 3 y_1, y_2, y_3y1,y2,y3The range of values ​​satisfies the above conditions.

The output needs to be softmax processed.

y i = e y i ∑ j = 1 3 e y j y_i = \frac{e^{y_i}}{\sum_{j=1}^{3} e^{y_j}} yi=j=13eyjeyi

The above formula is the softmax function .

Obviously, yi ∈ ( 0 , 1 ] y_i\in(0, 1]yi(0,1 ] , y can never be equal to 0, but may be very close to 0.

Example 1: Given an array {-0.5, 0, 10}, calculate their softmax output

e − 0.5 = 0.6065 , e 0 = 1 , e 100 = 22026.4658 e^{-0.5}=0.6065, e^{0}=1, e^{100}=22026.4658e0.5=0.6065,e0=1,e100=22026.4658

∑ j = 1 3 e y j = 22028.0723 \sum_{j=1}^{3} e^{y_j}=22028.0723 j=13eyj=22028.0723

y1 = 0.00003 y_1 = 0.00003y1=0.00003

y2 = 0.00004 y_2 = 0.00004y2=0.00004

y3 = 0.99993 y_3 = 0.99993y3=0.99993

Example 2: Given an array {0.5, 0.8, 0.4}, calculate their softmax output

e 0.5 = 1.649 , e 0.8 = 2.226 , e 0.4 = 1.492 e^{0.5}=1.649, e^{0.8}=2.226, e^{0.4}=1.492e0.5=1.649,e0.8=2.226,e0.4=1.492

∑ j = 1 3 e y j = 5.367 \sum_{j=1}^{3} e^{y_j}=5.367 j=13eyj=5.367

y1 = 0.307 y_1 = 0.307y1=0.307

y2 = 0.415 y_2 = 0.415y2=0.415

y3 = 0.278 y_3 = 0.278y3=0.278

In summary, a more complete softmax regression model can be written as

y = s o f t m a x ( W x + b ) \pmb{y}= softmax(\pmb{Wx+b}) y=softmax(Wx+b)


What kind of data do we expect the softmax model to output?

Linear regression uses (mean square error, MSE) loss

l = 0.5 ( y ^ − y ) 2 l = 0.5 (\hat{y}-y)^2l=0.5(y^y)2

Its purpose is to make the predicted value closer to the real value. Obviously, this doesn't apply to softmax regression models .

In the softmax regression model, the output is a discrete value, and each value represents the probability of a category.

We expect:

  • The predicted class of the correct class is higher → \rightarrow close to 1

  • The predicted class of the wrong class is lower → \rightarrow close to 0

Then in the extreme case, our expected output should be a string of numbers consisting of 1 and 0, of which there is only one element that is 1, and the position of 1 is in the correct category.

y i = { 0 , c a t e g o r y ≠ i 1 , c a t e g o r y = i y_i= \begin{cases} 0,& category \neq i \\ 1,& category = i\end{cases} yi={ 0,1,category=icategory=i

Example: In the iris data set, the labels of Iris Iris, Iris Versicolor, and Iris Virginia are respectively 1, 2, and 3. Then in the softmax model, the best outputs we expect are:

Yamanashio = [1, 0, 0]

color iris = [0, 1, 0]

Virginia Kite = [0, 0, 1]

So for a given softmax output, how do we measure the gap between him and lable?

cross entropy loss

Unlike MSE or L1 norm loss, cross-entropy loss can more sensitively reflect the gap between the predicted result and the real value:

The greater the difference between the results

↓ \downarrow

The greater the loss

↓ \downarrow

The multiplication of the model will be larger

↓ \downarrow

The greater the intensity of parameter adjustment

The specific form of cross entropy loss

H ( p , q ) = ∑ i − p i log ( q i ) H(p, q)=\sum_{i} -p_i\text{log}(q_i) H(p,q)=ipilog(qi)

where p , qp, qp,q denotes the probability of the output and the true value, respectively.

Example 1

The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.3, 0.2, 0.7], and its value in the first dimension is 0.3.

Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.3 , 0.2 , 0.7 ] ) = − ( 0.3 ) ≈ 1.7 H([1, 0, 0], [0.3, 0.2, 0.7])= -(\text{0.3})\approx 1.7H([1,0,0],[0.3,0.2,0.7])=(0.3)1.7

Example 2

The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.6, 0.2, 0.2], and its value in the first dimension is 0.6.

Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.6 , 0.2 , 0.2 ] ) = − ( 0.6 ) ≈ 0.74 H([1, 0, 0], [0.6, 0.2, 0.2])= -(\text{0.6})\approx 0.74H([1,0,0],[0.6,0.2,0.2])=(0.6)0.74

Example 3

The actual value is the first type of iris, and its value in the first dimension is 1; the predicted result is [0.8, 0.1, 0.1], and its value in the first dimension is 0.8.

Then you can calculate the cross entropy H ( [ 1 , 0 , 0 ] , [ 0.8 , 0.1 , 0.1 ] ) = − ( 0.6 ) ≈ 0.32 H([1, 0, 0], [0.8, 0.1, 0.1])= -(\text{0.6})\approx 0.32H([1,0,0],[0.8,0.1,0.1])=(0.6)0.32

Obviously, the more accurate the prediction, the closer the distribution to the true probability, and the lower the cross-entropy.
Therefore, it is very suitable to use cross entropy as a loss function.

Theoretical summary

  1. expression

[ o 1 ⋮ o n ] = [ w 11 ⋯ w 1 m ⋮ ⋱ ⋮ w n 1 ⋯ w m n ] × [ x 1 ⋮ x m ] + [ b 1 ⋮ b n ] \begin{bmatrix} {o_1}\\ {\vdots}\\ {o_{n}}\\ \end{bmatrix} = \begin{bmatrix} {w_{11}}&{\cdots}&{w_{1m}}\\ {\vdots}&{\ddots}&{\vdots}\\ {w_{n1}}&{\cdots}&{w_{mn}}\\ \end{bmatrix}\times \begin{bmatrix} {x_1}\\ {\vdots}\\ {x_{m}}\\ \end{bmatrix}+\begin{bmatrix} {b_1}\\ {\vdots}\\ {b_{n}}\\ \end{bmatrix} o1on = w11wn 1w1 mwmn × x1xm + b1bn

Written in matrix form is:

O = W x + b \pmb{O}=\pmb{Wx+b} O=Wx+b

The final result is softmax:

y = [ y 1 ⋮ yn ] = softmax ( o ) = 1 ∑ ieoi [ eo 1 ⋮ eon ] \pmb{y}=\begin{bmatrix} {y_1}\\ {\vdots}\\ {y_{n}; }\\ \end{bmatrix}=\text{softmax}(\pmb{o})=\frac{1}{\sum_{i}e^{o_i}}\begin{bmatrix}{e^{o_1} }\\{\vdots}\\{e^{o_n}}\\\end{bmatrix}y= y1yn =softmax(o)=ieoi1 eo1eon

  1. Loss function
    crossover Loss function
    L ( y , y ^ ) = − ∑ iyi log y ^ = − log y ^ i ∣ yi = 1 \mathcal{L}(\pmb{y, \hat{y}})=- \sum_{i}y_i\text{log}\hat{y}=-\text{log}\hat{y}_{i|y_i=1}L ( y ,y^)=iyilogy^=logy^iyi=1

Guess you like

Origin blog.csdn.net/weixin_51672245/article/details/131055384