@

1 Logistic Regression

Logistic Regression logistic regression, referred to as LR, suitable for classification

1.1 LR model

For the linear regression model, the need to fit a \ (Y = X \ theta + b \) function, suitable for continuous regression problems, and for the classification problem, a series of discrete labels need to binary, for example, need to predict the category 0,1 sigmoid function can be used to continuously problem into a discrete problem:

\ [G (z) = \ frac {1} {1 + e ^ {-} z} \]

image sigmoid function is:

wherein, \ (Z \ to + \ infty \) when, \ (G (Z) \ to. 1 \) , \ (Z \ to - \ infty \) when, \ (G (Z) \ to 0 \) derivative of the Sigmoid function is:.
\ [G '(Z) = G (Z) (. 1-G (Z)) \]
order \ (Z = \ Theta X + B \) , then:
\ [h_ \ theta (x)
= \ frac {1} {1 + e ^ {- \ theta x + b}} \] for binary classification, output of the function can be understood as the classification probability is 1, if the X referred to as \ ([{X ^ (. 1)}, {X ^ (2)}, ..., {X ^ (m)}] \) , \ (\ Theta \) referred to as \ ([\ theta_1 , \ theta_2, ..., \ theta_m ] \) for the model parameters to be estimated, the LR matrix form as:
\ [H_ \ Theta (X-) = \ {FRAC. 1. 1 + E {} ^ {- \ Theta X ^ T + b}} \ ]

1.2 loss function

Since \ (h_ \ theta (x) \) indicates the probability of being classified as, and are classified as zero probability, compared with \ (1-H_ \ Theta (the X-) \) , For discrete random variables, we can distribution write column:
| Y |. 1 | 0 |
| - | - | - |
| | \ (H_ \ Theta (X) \) | \ (. 1-H_ \ Theta (X) \)
with a formula subpresentation is
\ [p (y | x,
\ theta) = h_ \ theta (x) ^ y (1-h_ \ theta (x)) ^ {1-y} \] this is the \ (Y \) distribution function, the function of \ theta represents the parameter to be estimated, familiar with probability theory knows, to estimate the distribution function of the parameters can have two kinds of moment estimation and maximum likelihood method, where the election maximum likelihood method, also known as the maximum likelihood method, where his concept written:

in other words, assuming that all of the training samples \ (y_1, y_2, .., y_m \) corresponding to the random variable \ (Y_1, Y_2, Y_m \ ) is an independent and identically distributed, the distribution function \ (P (Y | X, \ Theta) \) , the joint distribution function of the random variables iid is the product of each distribution function, the distribution function of this joint likelihood function is called , expressed as:
\ [L (\ Theta) = \ prod_ {I =. 1} ^ {m} H_ \ Theta (X ^ {(I)}) ^ {\ Hat {Y} ^ {(I)}} ( 1-h_ \ theta (x ^ {(i)})) ^ {1- \ hat {y} ^ {(i)}} \]
Because of the likelihood function takes the maximum when the number \ (\ Theta \) after the maximum se \ (\ Theta \) is the same, namely:
\ [\ L the argmax (\ Theta) = \ the argmax \ log [L (\ theta)
] \] Therefore, we generally use the log likelihood function, the likelihood function is the inverse of the loss function , maximizing the likelihood function is to minimize loss function:
\ [J (\ Theta) = - \ ln L (\ theta) = - \ sum_ {i = 1} ^ {m} \ hat {y} ^ {(i)} \ ln h_ \ theta (x ^ {(i)}) + (1- y ^ {(i)}) \ ln (1-h_ \ theta (x ^ {(i)})) \]

When the loss function that takes a minimum value it? Of course, it is time to take its derivative 0 friends. Note that the \ (\ Theta \) represents the m parameters to be estimated \ (\ theta_1, \ theta_2, ..., \ \ theta_m) , requires that each partial derivative matrix are 0. extreme value when the loss function method expressed as:
\ [J (\ Theta) = -Y ^ T \ log H_ \ Theta (X-) - (EY) ^ T \ log (E-H_ \ Theta (X-)) \]
where \ (E \) It is the identity matrix

1.3 Optimization

$\theta = \theta - \alpha \frac{\partial J(\theta)}{\partial \theta}$

$\theta = \theta -\alpha X^T(h_\theta(X)-Y)$

$\theta = \theta - \alpha \frac{J'(\theta)}{J''(\theta)}$

1.4 Regulization(正则化)

$h_\theta(x) = \frac{1}{1+e^{-\theta x+b}}$
$$X$$不变的情况下，如果$$\theta$$特别大，那乘积就会特别大，假如在测试的时候，某个测试样本跟训练样本的分布不太一样，那么经过参数$$\theta$$放大后可能得到一个非常离谱的值。参数的取值过大会让模型异常敏感，也容易过拟合，那么如何避免这种情况呢？一种可行的方法就是，我们不希望学习到的参数$$\theta={\theta_1,\theta_2,...,\theta_m}$$取值太大，那就让他们尽量接近于0，即：
$\min \sum_{i=1}^{m} ||\theta_i||$

$\frac{1}{2}||\theta||_2^2=\frac{1}{2}\sum_{i=1}^{m} ||\theta_i||^2$

$J(\theta) = -Y^T\log h_\theta(X)-(E-Y)^T \log (E-h_\theta(X))+\lambda_1 ||\theta||_1$

$$\lambda_1$$是正则项的权重。加入正则项后，学习到的参数就不会太大，模型也就没那么敏感。当然，如果正则项的权重过大，那所有的参数$$\theta$$会很小，模型会异常不敏感，基本所有的输入都会得到一个差不多的输出，所有这个权重也要仔细考虑。

1.5多元逻辑回归

Multivariate logistic regression is a generalization of the case of two yuan, the probability of each class is calculated using a Softmax function. K is assumed to sub-classes, each parameter is to learn \ ((\ theta_1, b_1) , (\ theta_2, b_2) ..., (\ theta_k, b_k) \) referred to
\ [z_1 = \ theta_1x + b_1 \ \ z_2 = \ theta_2x + b_2 \\
... \\ z_k = \ theta_kx + b_k \] then x belongs to the probability of each category can be calculated as:
\ [Y_1 = \ E ^ {Z_1 FRAC {} {} \ sum_ {i = 1} ^ {k } e ^ {z_k}} \\ y_2 = \ frac {e ^ {z_2}} {\ sum_ {i = 1} ^ {k} e ^ {z_k}} \\ .. . \\ y_k = \ frac {e
^ {z_k}} {\ sum_ {i = 1} ^ {k} e ^ {z_k}} \\ \] below:

SoftMax fact, equivalent to the input amplifier before making Normalized.
Calculating a plurality of classes in the loss function, is used in the plurality of classes Entropy, this time to use the One-hot matrix is:
\ [\ Hat _1 = {Y} \ \\. 1 the begin {0} pmatrix \\ ... \\ 0 \ end {pmatrix } \ hat {y} _2 = \ begin {pmatrix} 0 \\ 1 \\ ... \\ 0 \ end {pmatrix} \ hat {y} _k = \ begin {pmatrix} 0 \\ 0 \\
... \\ 1 \ end {pmatrix} \] when the specific calculation using:
\ [\ min - \ sum_ {I} = ^ {K}. 1 \ Hat {Y } _i \ ln y_i \]

1.6 Summary

Own experimental data set has 230,000 data, from the results, the accuracy of the LR count is high, the key is to train fast.

Guess you like

Origin www.cnblogs.com/cuiyirui/p/11920668.html
Recommended
Ranking
Daily