Follow a unified framework for understanding the logistic regression machine learning

Follow a unified framework for understanding the logistic regression machine learning

Tags: machine learning LR classification

I. Introduction

  1. My blog is not the nature of the science blog, only to record my ideas and thought processes. Welcome to point out I think the blind spot, but hope that we can have our own understanding.
  2. This reference to a lot of data on the network.

Second, understanding

Unified machine learning framework (MLA):

1. model (Model)
2. Strategy (Loss)
3. Algorithm (Algorithm)

According to the above mentioned framework, LR is the loss of the core function uses Sigmoid and Cross Entropy .

LR: Sigmoid + Cross Entropy

Model

Digression: See previous blog: follow a unified framework for understanding SVM machine learning , you will find in the LR and SVM models and algorithms are the same, the difference lies in the different loss functions.

Given data set \ ((x ^ 1, \ hat {y} ^ 1), (x ^ 2, \ hat {y} ^ 2) ... (x ^ n, \ hat {y} ^ n) \ ) , where \ (\ Hat {Y} ^ I \ in \ {0,1 \} \) , \ (Y \) represents the predicted \ (label \) values, the linear function:
\ [F (X) = W ^ Tx + b \]

\ [y = \ begin {cases
} 1, \ quad & f (x)> 0 \\ 0, & f (x) <0 \ end {cases} \] simultaneously:
When \ (\ hat {y} = 1 \) when, \ (F (X) \) the better; $ \ hat {y} $ 0 when =, \ (F (X) \) as small as possible.

Loss

ERM (cross-entropy loss function): Sigmoid + Cross Entropy.
Sigmoid purpose is to increase the \ (f (x) \) values between 0-1 zooming, for calculating a cross entropy loss.

\[ \begin{aligned} &z = \sigma(f(x))\\ &p(\hat{y}=1|x;w,b) = z\\ &p(\hat{y}=0|x;w,b) = 1-z \end{aligned} \]

\ (Z \) represents the predicted likelihood

Empirical risk

1.使用 \(sigmoid + cross\ entropy\) 的损失函数:
\[\hat{y}=\begin{cases} 1,\; &f(x)>0\; &\sigma(f(x))\longrightarrow 1, &Loss=-ln(z)\\ 0,\; &f(x)<0\; &\sigma(f(x))\longrightarrow 0, &Loss=-ln(1-z) \end{cases}\]

\[ Loss = -[\hat{y} ln z+(1-\hat{y})ln (1-z)] \]

2 . From the perspective of the maximum likelihood
hypothesis training samples are independent, then the likelihood function expression is:

\[ \begin{aligned} Loss &= p(\hat{Y}|X;w,b) \\ &= \prod_{i=1}^n p(\hat{y}^i|x^i;w,b)\\ &= \prod_{i=1}^n z_i^{\hat{y}^i} (1-z_i)^{1-\hat{y}^i}\\ &= \sum_{i=1}^n \hat{y}^iln z_i + (1-\hat{y}^i)ln(1-z_i) \end{aligned} \]

Thus, from the perspective of discovery and the maximum angle likelihood of cross-entropy, the resulting loss function even identical, indicating the existence of unknown nature They link back.
Now beginning to explore this link.

3 Cross entropy and maximum likelihood links
entropy

\[H(X) = -E_{x \sim P}[log {P(x)}]\]

KL divergence: KL measure the difference between two distributions
\ [\ begin {aligned} D_ {KL} (P || Q) & = E_ {x \ sim P} [log \ frac {P (x)} { Q (x)}] \\ & = E_ {x \ sim P} [log {P (x)} - log {Q (x)}] \ end {aligned} \]

\ (D_ {KL} (P || Q) \) represents a selected \ (Q \) , so that it \ (P \) where a high probability of having a high probability. In simple terms is to find a set of parameters indicate \ (Q \) distribution, which set of parameters should be done: when \ (P \) when local high probability distributions, this set of parameters can also be taken to the high probability.

Cross entropy
\ [\ begin {aligned} H (P, Q) & = H (P) + D_ {KL} (P || Q) \ end {aligned} \]

For our specific scenario: \ (\ Hat the Y} {\) distribution corresponding to \ (P \) distribution, \ (the Y \) distribution corresponding to \ (Q \) distribution. \ (\ hat {Y} \ ) distribution is determined, \ (the Y \) distribution is what we ask. In other words, so \ (the Y \) distribution try Approximation \ (\ hat {Y} \ ) distribution.

In our scenario, \ (\ Hat the Y} {\) is determined but unknown (prior distribution).

\[ \begin{aligned} H(\hat{Y},Y) &= H(\hat{Y})+D_{KL}(\hat{Y}||Y)\\ &=-E_{x \sim \hat{Y}}[log \hat{Y}]+E_{x \sim \hat{Y}}[log{\hat{Y}(x)}-log{Y(x)}]\\ &=E_{x \sim \hat{Y}}-log{Y(x)} \end{aligned} \]

When we minimize cross entropy:
\ [\ the begin {the aligned} & min \; \; H (\ Hat {the Y}, the Y) \\ & min \; \; D_ {KL} (\ Hat {the Y} || the Y ) \\ & min \; \; E_ {x \ sim \ hat {Y}} [log {\ hat {Y} (x)} - log {Y (x)}] \\ & min \; \; E_ {x \ sim \ hat {Y}} - log {Y (x)} \ end {aligned} \]

When \ (\ hat {Y} \ ) distribution is known, the entropy is constant, then the cross-entropy and the KL divergence is equivalent.
For \ (the Y \) to minimize cross entropy equivalent to minimizing the KL divergence, because (H (\ hat {Y} ) \) \ and \ (the Y \) independent.

Note that the last \ (E_ {x \ sim \ hat {Y}} - log {Y (x)} \) entropy \ (H (Y) \) the difference between. Entropy is a variable already know the probability distribution of x, find out the total amount of desired information distribution event generated; but for this equation, \ (the Y-\) distribution is unknown, is what we are asking for. We just want \ (Y \) and \ (\ hat {Y} \ ) as between similar or close to, but not what they need to know the exact distribution for each is (that is, do not need to know the probability distribution expression), so use the KL divergence directly defined the differences between them on the line.
Here think of it when the kernel on an introduction SVM, which also make low-dimensional space rose to a high-dimensional space, and then calculate their inner product, for this whole process, we ultimately need is the result of inner product . In order to reduce the amount of computation while achieving the ultimate goal, skipping the middle of a complex process, the introduction of the kernel, so we do not know specifically what the liter-dimensional look.

KL divergence is minimized and the model using the maximum likelihood estimation and parameter estimation is the same, so the cross-entropy and maximum likelihood estimation associated with a KL divergence.

Algorithm

Gradient descent

\ (\ sigma (x) '= \ sigma (x) (1- \ sigma (x)) \)
$ ;; Loss min = - \ sum_ {i = 1} ^ n \ hat ^ {y} and ln z_i + (1- \ hat {y ^} i) ln (1-z_i) $
\ (z = \ sigma (f (x)) \)

\[ \begin{aligned} \frac{\partial L}{\partial w} &= -\sum_{i=1}^n \hat{y}^i \frac{1}{z_i} z_i(1-z_i) x^i+(1-\hat{y}^i)\frac{1}{1-z_i} (-1) z_i(1-z_i)x^i \\ &= -\sum_{i=1}^n \hat{y}^i(1-z_i) x^i-(1-\hat{y}^i)z_ix^i\\ &= -\sum_{i=1}^n (\hat{y}^i-z_i)x^i\\ &= -\sum_{i=1}^n (\hat{y}^i-\sigma(w^Tx^i+b))x^i \end{aligned} \]

\[ \begin{aligned} w^{k+1} &= w^k - \eta \frac{\partial L}{\partial w} \\ &= w^k+\eta\sum_{i=1}^n (\hat{y}^i-z_i)x^i \end{aligned} \]

There is a remarkable property gradient update \ (\ hat {y} ^ i-z_i \) , and when the larger gap between them, the larger the gradient update.

Third, expand

The above mentioned model \ (\ Hat {Y} ^ I \ in \ {0,1 \} \) , in a different wording: \ (\ Hat {Y} ^ I \ in \ {. 1, -1 \ } \) , or to write the loss function with sigmoid + cross entropy method.
In this case:
\ [\ Hat {Y} = \ Cases the begin {}. 1, \; & F (X)> 0 \; & \ Sigma (F (X)) \ longrightarrow. 1, & Loss = -ln (Z) \\ -1, \; & f (x ) <0 \; & \ sigma (f (x)) \ longrightarrow 0, & Loss = -ln (1-z) = - ln (-z) \ end {cases} \]

A reference to the above formula transforms to follow a unified framework for understanding SVM machine learning

综合得来:
\[ \begin{aligned} Loss &= -\sum_{i=1}^n ln(\sigma(\hat{y}^if(x^i)))\\&= - \sum_{i=1}^n ln \frac{1}{1+exp(-\hat{y}^if(x^i))}\\&=\sum_{i=1}^n ln(1+exp(-\hat{y}^if(x^i))) \end{aligned} \]

\[ \begin{aligned} \frac{\partial L}{\partial w} &= -\sum_{i=1}^n \frac{1}{\sigma(\hat{y}^if(x^i))}\sigma(\hat{y}^if(x^i))(1-\sigma(\hat{y}^if(x^i)))\hat{y}^ix^i\\ &=-\sum_{i=1}^n (\hat{y}^i-\hat{y}^i\sigma(\hat{y}^if(x^i)))x^i \end{aligned} \]

\(\hat{y}^i=1\)时,\[\frac{\partial L}{\partial w}=-\sum_{i=1}^n (1-\sigma(f(x^i)))x^i\]

\(\hat{y}^i=-1\)时,\[\frac{\partial L}{\partial w}=-\sum_{i=1}^n (-1+\sigma(-f(x^i)))x^i=-\sum_{i=1}^n (-1+1-\sigma(f(x^i)))x^i=-\sum_{i=1}^n -\sigma(f(x^i))x^i\]

This can be seen with (\ hat {y} ^ i \ in {1,0 \} \ \) \ exactly the same.

Guess you like

Origin www.cnblogs.com/SpingC/p/11622726.html