Logistic regression learning algorithm recently, the Internet saw a lot of stuff, I think this article http://blog.kamidox.com/logistic-regression.html the best written. But there is a key question is not clear: Why -log (h (x)) as a cost function (also called loss function).
And compared to a linear regression algorithm, the prediction function is non-linear logistic regression, variance can not be used both as a function of the cost function. So how to choose a logistic regression algorithm cost function, it is necessary to spend more something.
Before the formal discussion of this issue, let's review some basics.
Some common derivative function
$$ \frac{dy}{dx}(x^n) = nx^{n-1} $$
$$ \frac{dy}{dx}log_b(x) = \frac{1}{xln(b)} \text{ 如果b=e } \frac{dy}{dx}log_e(x) = \frac{1}{x} $$
$$ \frac{dy}{dx}(b^x)= b^xln(b) \text{ 如果b=e } \frac{dy}{dx}(e^x) = e^x $$
Derivation rule
Often times
If f (x) = Cg (x), C is a constant, then
\[ \frac{dy}{dx}(f(x))=C\frac{dy}{dx}(g(x)) \]
Function and a function of the difference
If f (x) = g1 (x) + g2 (x) - g3 (x), then
\[ \frac {dy}{dx}(f(x)) = \frac {dy}{dx}(g1(x)) + \frac {dy}{dx}(g2(x)) - \frac {dy}{dx}(g3(x)) \]
Method derivative product
If h (x) = f (x) g (x), then:
\[ h^{'}(x) = f^{'}(x)g(x) + g^{'}(x)f(x) \]
Setting h (x) = y, f (x) = u, g (x) = v, then:
\[ \frac {dy}{dx} = v\frac {du}{dx} + u\frac {dv}{dx} \]
Quotient rule derivation
If h (x) = f (x) / g (x), then:
\[ h^{'}(x) = \frac {f^{'}(x)g(x) - g^{'}(x)f(x)}{{(g(x))}^2} \]
y = u / v, then:
\[ \frac{dy}{dx} = \frac{\frac{du}{dx}v - \frac{dv}{dx}u}{v^2} \]
Chain derivation
If h (x) = f (g (x)), then:
\[ h^{'}(x) = f^{'}(g(x))g^{'}(x) \]
If y is a function of u, and u is a function of x, then:
\ [\ Frac {dy} {dx} = \ frac {dy} {you} \ frac {you} {dx} \]
Logistic regression algorithm several basic functions involved
Data on a linear function of a feature vector x and the regression coefficient vector w
\[ L_w(x) = w^Tx \]
sigmoid function
\ [G (z) = \ frac {1} {1 + e ^ {-} z} \]
? Class prediction function
\[ h_w(x) = \frac{1}{1 + e^{-w^Tx}} \]
Logistic regression algorithm is a binary classification algorithm, can be 1, 0 indicates these classifications. The final goal is to find a suitable algorithm regression coefficient w, a data set of any data X I is satisfied:
\ [\ Begin {cases} h_w (x_i)> = 0.5 & \ text {true classification y = 1} \\ h_w (x_i) <0.5 & \ text {true classification y = 0} \ end {cases} \]
Analyzing classification function H W (X I ) is the value of the interval (0,1), it can be seen as data X I of the class 1 when the probability coefficient w. Since only two classifications, the same can be-H. 1 w (x) is the probability of x as the coefficient of the class w is 0
Select the cost function
Now choose the cost function, there is no idea of the cost function to select, but I think you can assume that there is a cost function, and see what conditions it should meet, set the cost function is:
\[ J(w) = \begin{cases} \frac{1}{m}\sum^m_{i=1}f(h_w(x_i)) &\text{y=1} \\ \frac{1}{m}\sum^m_{i=1}f(1- h_w(x_i)) &\text{y=0} \end{cases} \]
The cost of this cost function and linear regression function look alike, except that here there is an unknown function f (u), the linear regression \ (f (u) = ( h_w (x_i) - y) ^ 2 \) , I do not know where f (x) is. However, according to H W (X I ) characteristics, the thrust reverser, can be obtained first property f (u) should have:
When u approaches 1 (100% probability), f (u) tends to a minimum value.
In the formula downward gradient, calculated J (w) can be attributed to calculate the gradient of f (u) gradient. Can be calculated using a chain Derivation:
\[ \frac{δ}{δw_j}f(u) = f'(u)u'x_{ij} \]
Here u may be H W (X I ) or-H. 1 W (X I ), u 'equal to H' W (X I ) or -H ' W (X I ), thus comes to the end of the sigmoid function derivative:
set \ (g (x) = \ frac {1} {1 + e ^ {- x}} \)
$
\frac{dy}{dx}(g(x)) = \frac{0(1+e^{-x}) - 1(-e^{-x})}{(1+e^{-x})^2} = \frac{e^{-x}}{(1+e^{-x})^2}
$
$
= \frac{1}{1+e^{-x}}\frac{e^{-x}}{1+e^{-x}} = \frac{1}{1+e^{-x}}\frac{1+e^{-x}-1}{1+e^{-x}} = \frac{1}{1+e^{-x}}(\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}) = g(x)(1-g(x))
$
? The so u = g (x), then \ (U '= U (. 1-U) \) , substituting into the equation obtained in the gradient:
$
\frac{δ}{δw_j}f(u) = f'(u)u(1-u)x_{ij}
$
If this calculation equation may eliminate or u (1-u) at the same time we do not introduce other functions can greatly simplify the calculation of the gradient. The second property can be obtained f (u) need to be satisfied:
Can satisfy: \ (F '(U) = \ FRAC} {A} U {\) , A is a constant.
In the foregoing there is a function just to meet this requirement: \ (\ FRAC du} {} {Dy (LN (U)) = \ {FRAC. 1} U {} \) , but f (u) = ln (u ) , can not meet the first property, then just add a '-' sign on it, namely: f (u) = - ln (u).
Found f (u) and then to rewrite the cost function:
\ [J (W) = \ Cases the begin {} \ {FRAC. 1 {m}} \ {I SUM ^ m_Low. 1} = -ln (h_w (x_i)) & \ text {y = 1} \\ \ frac {1} {m} \ sum ^ m_ {i = 1} -ln (1- h_w (x_i)) & \ text {y = 0} \ end {cases} \ ]
combined into one function:
$
J (W) = \ {FRAC. 1 {m}} \ {I SUM = ^ m_Low. 1} -yln (h_w (x_i)) - (. 1-Y) LN (h_w (x_i))
$
Gradient descent equation
$
w_j := w_j - \alpha\frac{δ}{δw_j}J(w)
$
$
\frac{δ}{δw_j}J(w) = \frac{δ}{δw_j}\frac{1}{m}\sum^m_{i=1}-yln(h_w(x_i)) - (1-y)ln(1- h_w(x_i))
$
$
= \frac{1}{m}\sum^m_{i=1}\frac{-y_jh_w(x_i)(1-h_w(x_i))x_{ij}}{h_w(x_i)}+\frac{-(1-y_j)(1-h_w(x_i))(1-(1-h_w(x_i))(-x_{ij})}{1-h_w(x_i)}
$
$
= \frac{1}{m}\sum^m_{i=1}-y_j(1-h_w(x_i))x_{ij}+(1-y_j)h_w(x_i)x_{ij}
$
$
= \frac{1}{m}\sum^m_{i=1}(-y + yh_w(x_i) + h_w(x_i) - yh_w(x_i))x_{ij}
= \frac{1}{m}\sum^m_{i=1}(h_w(x_i)-y_i)x_{ij}
$
最终得到梯度下降公式如下
$
w_j := w_j - \alpha\frac{δ}{δw_j}J(w) = w_j - \alpha\frac{1}{m}\sum^m_{i=1}(h_w(x_i)-y_i)x_{ij}
$