table of Contents

Main points of this paper: The first part stresses the understanding of the concept of Logistics. The second part of the formula is derived speak, to think mainly from two perspectives, one from the perspective of generalized linear models to derive formulas, another point is derived from the Bernoulli distribution.

# The first part of the understanding of the concept Logistics

## Nonlinear mapping linear regression

We have heard Logistics do is return the classification of living, rather than return to the literal meaning, then return it and what does it matter?

On a linear regression model has been talked about a point of view, why the linear regression is a linear combination of properties characteristic of it?

- One is \ (\ omega \) to visually show the right of each attribute significant small,
- The second model is simple enough, you can increase high-dimensional mapping becomes nonlinear model is based on linear regression model.

Such as:

\ [^ the Tx Y = W + B \]

If the predicted value of the nonlinear transformation, exponential transformation, but, in other words **the logarithm of the predicted value is a linear transform** , as follows:

\ [LNY the Tx + B = W ^ \ ]

FIG linear prediction can be considered the value \ (y '\) nonlinear (Y \) \ mapping:

\ [= E ^ {y_i y_i '} = ln ^ {- 1} (y_i') \]

The above-described mapping is "generalized linear model" special form:

\ [Y = G ^ {-. 1} (W ^ the Tx + B) \]

\ (G (\ CDOT) \) is a contact function, requiring continuous and substantially smooth, the above example is \ (g (\ cdot) = ln (\ cdot) \) is a special case.

## Core Logistics is classified, but why call back?

If the two do Category: The simplest way is a step function

\ [y = \ left \ { \ begin {array} {cc} {0,} & {w ^ Tx + b <0} \\ {0.5,} & {w ^ Tx + b =

0} \\ {1,} & {w ^ Tx + b> 0} \ end {array} \ right \]. However, the step function has a fatal drawback: non-conductive. Especially in depth study, the back-propagation process requires reverse derivation function. Therefore, we proposed an alternative function **logarithmic probability (odd) function** (Logistic function): \

[Y = \ FRAC {. 1} {. 1 + E ^ {- Z}} = \ FRAC {. 1} {. 1 + E ^ {- (w ^ Tx + b)}

} \] the above-described function is also referred to as \ (Sigmoid \) function:

The \ (Sigmoid \) function for a transformation, in conjunction with the first comparison content:

\ [\ LN \ FRAC {Y} = {}. 1-Y \ boldsymbol {W} ^ {\ mathrm {T}} \ {boldsymbol x} + b \]

this formula explains why called back, just like with the linear regression equation, actually it does is to predict the return value classification

Note: sklearn LogisticsRegression call this function, the return value

`coef_`

is here \ (\ omega \)

# The second part of the Logistics regression formula derivation [core methods: MLE]

Described above is the essence of Logistics regression: linear model + sigmoid function (non-linear mapping), but said this could change in fitting the index level y. And why is this combination Logistics regression, and how to use this combination to train binary model has not explained.

- We need a formula derived:

\[ \varPhi = \frac{1}{1+e^{-w^Tx}}\\ \]

- Model requires obtained:
`w`

\ [\ mathop {\ Arg \ min} _ {W} \ sum_ {I =. 1} ^ N [y_ilogp (Y =. 1 | X) + (. 1-y_i) a logP (Y = 0 | X )] \]

can be seen from this formula, Logistics regression equation is to find the maximum (ie, the minimum cross-entropy) of negative cross-entropy`w`

parameters.

## Complex version: Generalized Linear Models

Logistics regression equation can be derived from two angles, more complex instructions start generalized linear model.

To find out `广义线性模型`

we must first know the two concepts. `指数族分布`

And `需要满足的三条假设`

.

### Knowledge Point one: exponential family distribution law

Subject to the following distribution law can only be called a Group Index:

\ [P (Y; \ ETA) = B (Y) exp ({\ ^ TT ETA (Y) -a (\ ETA)}) \]

wherein \ (A (\ ETA) \) is the partition function, so that the maximum distribution law. 1, \ (\ ETA \) is the natural distribution of the parameters, \ (T (Y) \) is a sufficient statistic.

#### Note 1: `伯努利分布`

is the exponential family distribution

Bernoulli (Bernoulli distribution), also known as **two-point distribution** or **0-1 distribution** , introduce necessary to introduce before the Bernoulli distribution **Bernoulli trials (Bernoulli Trial)** .

**Bernoulli trials**are only two possible outcomes of**a single randomized trial**, i.e. for the purposes of a random variable X:

\ [\} the aligned P_r the begin {[X =. 1] = {P} \\ & P_r [X = 0] = {} & 1-p \ end {aligned} \]

Bernoulli trials can be expressed as a "yes or no" questions. For example, a coin toss is right up front? Newborn child is a girl? and many more

If the test is a Bernoulli experiment E, E will be repeated n times independently are said to a repeated sequence of independent experiments is

**n-fold Bernoulli trials**.Once Bernoulli trials, success (X = 1) is the probability p (0 <= p <= 1), failed (X = 0) with probability 1-p, the random variable X is said to obey the Bernoulli distribution. Bernoulli distribution is a discrete probability distribution, the probability mass function:

\ [F \ left (X \ right) = P ^ X \ left (. 1-P \ right). 1-X ^ {} = \ left \ { \ begin {array} {l} p \\ 1-p \\ 0 \\ \ end {array} \ begin {array} {c} x = 1 \\ x = 0 \\ \ text { another} \\ \ end {array} \ right. \ ]

Therefore, the above definition with a Bernoulli distribution is:

\ [\} the aligned the begin {P (Y) = {} & \ varphi ^ Y (l- \ varphi). 1-^ {Y} = {} & exp \\ (ln (\ varPhi ^ y ( 1- \ varPhi) ^ {1-y})) \\ = {} & exp (y \ cdot ln (\ frac {\ varPhi} {1- \ varPhi}) + ln ( 1- \ varPhi)) \ end {

aligned} \] defined ratio of exponential family: \ (P (Y; \ ETA) = B (Y) exp ({\ ^ TT ETA (Y) -a (\ ETA )}) \)

Bernoulli have indeed exponential distribution:

\ [\ the aligned the begin {B} (Y)} = {\\ & T. 1 (Y)} = {Y & \\ \ & ETA} = {LN (\ FRAC {\ varPhi} {1- \ varPhi}) \\ a (\ eta) = {} & -ln (1- \ varPhi) = ln (1 + e ^ \ eta) \ end {aligned} \]

### Knowledge Point two: three assumptions generalized linear models needed

- Given x, y required subject to exponential family distribution (to satisfy)
- Given x, the trained model is equal to the full amount of the desired statistic: \ (H (the X-) = E [T (the y-| the X-)] \)
- NATURAL parameter \ (\ ETA \) , for an observation variable x, and linear: \ (\ ETA the Tx = W ^ \)

According to a first article, Bernoulli distribution is successful (X = 1) with probability p, failed (X = 0) with probability 1-p. For the label of the present embodiment are two classification problem 0/1, set to 1 if the probability p, the probability of a 0-p, it is obvious that the two classification tag is 0, the distribution is typical of primary Bernoulli distribution.

The second:

\ [\ the aligned the begin {H} (X) = {} & E [T (Y | X)] = {} \\ & E [Y | X] =. 1 \ Times P (= Y. 1) +0 \ times p (y = 0

) = p (y = 1) = \ varPhi \ end {aligned} \] and:

\ [\ Eta = ln (\ frac {\ varPhi} {1- \ varPhi}) \ Rightarrow \ varPhi = \ frac {1} {1 + e ^ {- \ eta}} \]

Under Article: \ (\ ETA the Tx = W ^ \) into the above equation

In summary:

\ [\ varphi = \ {FRAC. 1. 1 + E {} ^ {-}} W ^ the Tx \]

**How to get the next target model `w`?**

### Y distribution law as follows:

I did not understand the beginning: \ (y (y =. 1 | X) = \ Sigma (W ^ the Tx) = \ {FRAC. 1. 1 + E {} ^ {-}} the Tx W ^ \) and \ (y (y = 0 | x) = \ sigma (w ^ Tx) = \ frac {e ^ {- w ^ Tx}} {1 + e ^ {- w ^ Tx}} \) these two formulas, would be understandable \ (\ sigma (z) \) output is assumed to be 0.1, ..., 0.6, 0.9. The larger the value output, 1 the closer, were divided into larger the probability of. Therefore, naturally the \ ((y = 1) \ ) probability being deemed \ (\ Sigma (W ^ the Tx) \) .

**Zhihua edition: **

\ [P (Y | X; \ Beta) = Y \ CDOT P (Y =. 1 | X; \ Beta) + (. 1-Y) \ CDOT P (Y = 0 | X; \ Beta) \]

**Andrew Ng edition: **

\ [P (Y | X; \ Beta) = [P (Y =. 1 | X; \ Beta)] ^ Y + [P (Y = 0 | X; \ Beta)] ^ {. 1-Y} \ ]

It can be found that the above two equations, well integrated the \ (y = 1 \) and \ (y = 0 \) in both cases

**According to the above two distribution law is assumed, and then using maximum likelihood estimation [MLE] can be derived: **

\ [\ the aligned the begin {} \ {mathop \ Arg \} _ max {W} log P (Y | X) = {} & \ mathop {\ arg \ max} _ {w} \ sum_ {i = 1} ^ {N} log p (y_i | x_i) \\ = {} & \ mathop {\ arg \ max} _ {w } \ sum_ {i = 1} ^ N [y_ilogp (y = 1 | x) + (1-y_i) logP (y = 0 | x)] \ end {aligned} \]

Note: Formula Zhou Zhihua version of the book is derived watermelon too complicated, Andrew Ng version of the formula is very good push, but also often see derivation.

# Additional knowledge: derivation sigmoid function

Easy to prove that: wherein represents single quotes derivative of x

\ [\ sigma (z) ' = \ sigma (z) (1- \ sigma (z)) \]