Logistic regression algorithm to derive the gradient formula

Logistic regression learning algorithm recently, the Internet saw a lot of stuff, I think this article http://blog.kamidox.com/logistic-regression.html the best written. But there is a key question is not clear: Why -log (h (x)) as a cost function (also called loss function).

And compared to a linear regression algorithm, the prediction function is non-linear logistic regression, variance can not be used both as a function of the cost function. So how to choose a logistic regression algorithm cost function, it is necessary to spend more something.

Before the formal discussion of this issue, let's review some basics.

Some common derivative function

$$ \frac{dy}{dx}(x^n) = nx^{n-1} $$

$$ \frac{dy}{dx}log_b(x) = \frac{1}{xln(b)} \text{ 如果b=e } \frac{dy}{dx}log_e(x) = \frac{1}{x} $$

$$ \frac{dy}{dx}(b^x)= b^xln(b) \text{ 如果b=e } \frac{dy}{dx}(e^x) = e^x $$

Derivation rule

Often times

If f (x) = Cg (x), C is a constant, then


\[ \frac{dy}{dx}(f(x))=C\frac{dy}{dx}(g(x)) \]

Function and a function of the difference

If f (x) = g1 (x) + g2 (x) - g3 (x), then


\[ \frac {dy}{dx}(f(x)) = \frac {dy}{dx}(g1(x)) + \frac {dy}{dx}(g2(x)) - \frac {dy}{dx}(g3(x)) \]

Method derivative product

If h (x) = f (x) g (x), then:


\[ h^{'}(x) = f^{'}(x)g(x) + g^{'}(x)f(x) \]


Setting h (x) = y, f (x) = u, g (x) = v, then:


\[ \frac {dy}{dx} = v\frac {du}{dx} + u\frac {dv}{dx} \]

Quotient rule derivation

If h (x) = f (x) / g (x), then:


\[ h^{'}(x) = \frac {f^{'}(x)g(x) - g^{'}(x)f(x)}{{(g(x))}^2} \]


y = u / v, then:


\[ \frac{dy}{dx} = \frac{\frac{du}{dx}v - \frac{dv}{dx}u}{v^2} \]

Chain derivation

If h (x) = f (g (x)), then:


\[ h^{'}(x) = f^{'}(g(x))g^{'}(x) \]


If y is a function of u, and u is a function of x, then:


\ [\ Frac {dy} {dx} = \ frac {dy} {you} \ frac {you} {dx} \]

Logistic regression algorithm several basic functions involved

Data on a linear function of a feature vector x and the regression coefficient vector w


\[ L_w(x) = w^Tx \]


sigmoid function


\ [G (z) = \ frac {1} {1 + e ^ {-} z} \]


? Class prediction function


\[ h_w(x) = \frac{1}{1 + e^{-w^Tx}} \]


Logistic regression algorithm is a binary classification algorithm, can be 1, 0 indicates these classifications. The final goal is to find a suitable algorithm regression coefficient w, a data set of any data X I is satisfied:


\ [\ Begin {cases} h_w (x_i)> = 0.5 & \ text {true classification y = 1} \\ h_w (x_i) <0.5 & \ text {true classification y = 0} \ end {cases} \]


Analyzing classification function H W (X I ) is the value of the interval (0,1), it can be seen as data X I of the class 1 when the probability coefficient w. Since only two classifications, the same can be-H. 1 w (x) is the probability of x as the coefficient of the class w is 0

Select the cost function

Now choose the cost function, there is no idea of ​​the cost function to select, but I think you can assume that there is a cost function, and see what conditions it should meet, set the cost function is:


\[ J(w) = \begin{cases} \frac{1}{m}\sum^m_{i=1}f(h_w(x_i)) &\text{y=1} \\ \frac{1}{m}\sum^m_{i=1}f(1- h_w(x_i)) &\text{y=0} \end{cases} \]


The cost of this cost function and linear regression function look alike, except that here there is an unknown function f (u), the linear regression \ (f (u) = ( h_w (x_i) - y) ^ 2 \) , I do not know where f (x) is. However, according to H W (X I ) characteristics, the thrust reverser, can be obtained first property f (u) should have:

When u approaches 1 (100% probability), f (u) tends to a minimum value.

In the formula downward gradient, calculated J (w) can be attributed to calculate the gradient of f (u) gradient. Can be calculated using a chain Derivation:


\[ \frac{δ}{δw_j}f(u) = f'(u)u'x_{ij} \]


Here u may be H W (X I ) or-H. 1 W (X I ), u 'equal to H' W (X I ) or -H ' W (X I ), thus comes to the end of the sigmoid function derivative:

set \ (g (x) = \ frac {1} {1 + e ^ {- x}} \)


$
\frac{dy}{dx}(g(x)) = \frac{0(1+e^{-x}) - 1(-e^{-x})}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2}
$

$
= \frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}} = \frac{1}{1+e^{-x}}\frac{1+e^{-x}-1}{1+e^{-x}} = \frac{1}{1+e^{-x}}(\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}) = g(x)(1-g(x))
$


? The so u = g (x), then \ (U '= U (. 1-U) \) , substituting into the equation obtained in the gradient:


$
\frac{δ}{δw_j}f(u) = f'(u)u(1-u)x_{ij}
$


If this calculation equation may eliminate or u (1-u) at the same time we do not introduce other functions can greatly simplify the calculation of the gradient. The second property can be obtained f (u) need to be satisfied:

Can satisfy: \ (F '(U) = \ FRAC} {A} U {\) , A is a constant.

In the foregoing there is a function just to meet this requirement: \ (\ FRAC du} {} {Dy (LN (U)) = \ {FRAC. 1} U {} \) , but f (u) = ln (u ) , can not meet the first property, then just add a '-' sign on it, namely: f (u) = - ln (u).

Found f (u) and then to rewrite the cost function:

\ [J (W) = \ Cases the begin {} \ {FRAC. 1 {m}} \ {I SUM ^ m_Low. 1} = -ln (h_w (x_i)) & \ text {y = 1} \\ \ frac {1} {m} \ sum ^ m_ {i = 1} -ln (1- h_w (x_i)) & \ text {y = 0} \ end {cases} \ ]

combined into one function:

$
J (W) = \ {FRAC. 1 {m}} \ {I SUM = ^ m_Low. 1} -yln (h_w (x_i)) - (. 1-Y) LN (h_w (x_i))
$

Gradient descent equation


$
w_j := w_j - \alpha\frac{δ}{δw_j}J(w)
$


$
\frac{δ}{δw_j}J(w) = \frac{δ}{δw_j}\frac{1}{m}\sum^m_{i=1}-yln(h_w(x_i)) - (1-y)ln(1- h_w(x_i))
$


$
= \frac{1}{m}\sum^m_{i=1}\frac{-y_jh_w(x_i)(1-h_w(x_i))x_{ij}}{h_w(x_i)}+\frac{-(1-y_j)(1-h_w(x_i))(1-(1-h_w(x_i))(-x_{ij})}{1-h_w(x_i)}
$


$
= \frac{1}{m}\sum^m_{i=1}-y_j(1-h_w(x_i))x_{ij}+(1-y_j)h_w(x_i)x_{ij}
$

$
= \frac{1}{m}\sum^m_{i=1}(-y + yh_w(x_i) + h_w(x_i) - yh_w(x_i))x_{ij}
= \frac{1}{m}\sum^m_{i=1}(h_w(x_i)-y_i)x_{ij}
$

最终得到梯度下降公式如下

$
w_j := w_j - \alpha\frac{δ}{δw_j}J(w) = w_j - \alpha\frac{1}{m}\sum^m_{i=1}(h_w(x_i)-y_i)x_{ij}
$

Guess you like

Origin www.cnblogs.com/brandonli/p/11980991.html