1_ logistic regression machine learning notes

@

1 Logistic Regression

Logistic Regression logistic regression, referred to as LR, suitable for classification

1.1 LR model

For the linear regression model, the need to fit a \ (Y = X \ theta + b \) function, suitable for continuous regression problems, and for the classification problem, a series of discrete labels need to binary, for example, need to predict the category 0,1 sigmoid function can be used to continuously problem into a discrete problem:

\ [G (z) = \ frac {1} {1 + e ^ {-} z} \]

image sigmoid function is:
sigmoid
wherein, \ (Z \ to + \ infty \) when, \ (G (Z) \ to. 1 \) , \ (Z \ to - \ infty \) when, \ (G (Z) \ to 0 \) derivative of the Sigmoid function is:.
\ [G '(Z) = G (Z) (. 1-G (Z)) \]
order \ (Z = \ Theta X + B \) , then:
\ [h_ \ theta (x)
= \ frac {1} {1 + e ^ {- \ theta x + b}} \] for binary classification, output of the function can be understood as the classification probability is 1, if the X referred to as \ ([{X ^ (. 1)}, {X ^ (2)}, ..., {X ^ (m)}] \) , \ (\ Theta \) referred to as \ ([\ theta_1 , \ theta_2, ..., \ theta_m ] \) for the model parameters to be estimated, the LR matrix form as:
\ [H_ \ Theta (X-) = \ {FRAC. 1. 1 + E {} ^ {- \ Theta X ^ T + b}} \ ]

1.2 loss function

Since \ (h_ \ theta (x) \) indicates the probability of being classified as, and are classified as zero probability, compared with \ (1-H_ \ Theta (the X-) \) , For discrete random variables, we can distribution write column:
| Y |. 1 | 0 |
| - | - | - |
| | \ (H_ \ Theta (X) \) | \ (. 1-H_ \ Theta (X) \)
with a formula subpresentation is
\ [p (y | x,
\ theta) = h_ \ theta (x) ^ y (1-h_ \ theta (x)) ^ {1-y} \] this is the \ (Y \) distribution function, the function of \ theta represents the parameter to be estimated, familiar with probability theory knows, to estimate the distribution function of the parameters can have two kinds of moment estimation and maximum likelihood method, where the election maximum likelihood method, also known as the maximum likelihood method, where his concept written:
Maximum likelihood method
in other words, assuming that all of the training samples \ (y_1, y_2, .., y_m \) corresponding to the random variable \ (Y_1, Y_2, Y_m \ ) is an independent and identically distributed, the distribution function \ (P (Y | X, \ Theta) \) , the joint distribution function of the random variables iid is the product of each distribution function, the distribution function of this joint likelihood function is called , expressed as:
\ [L (\ Theta) = \ prod_ {I =. 1} ^ {m} H_ \ Theta (X ^ {(I)}) ^ {\ Hat {Y} ^ {(I)}} ( 1-h_ \ theta (x ^ {(i)})) ^ {1- \ hat {y} ^ {(i)}} \]
Because of the likelihood function takes the maximum when the number \ (\ Theta \) after the maximum se \ (\ Theta \) is the same, namely:
\ [\ L the argmax (\ Theta) = \ the argmax \ log [L (\ theta)
] \] Therefore, we generally use the log likelihood function, the likelihood function is the inverse of the loss function , maximizing the likelihood function is to minimize loss function:
\ [J (\ Theta) = - \ ln L (\ theta) = - \ sum_ {i = 1} ^ {m} \ hat {y} ^ {(i)} \ ln h_ \ theta (x ^ {(i)}) + (1- y ^ {(i)}) \ ln (1-h_ \ theta (x ^ {(i)})) \]

When the loss function that takes a minimum value it? Of course, it is time to take its derivative 0 friends. Note that the \ (\ Theta \) represents the m parameters to be estimated \ (\ theta_1, \ theta_2, ..., \ \ theta_m) , requires that each partial derivative matrix are 0. extreme value when the loss function method expressed as:
\ [J (\ Theta) = -Y ^ T \ log H_ \ Theta (X-) - (EY) ^ T \ log (E-H_ \ Theta (X-)) \]
where \ (E \) It is the identity matrix

1.3 Optimization

对于二分类的LR,可以使用梯度下降法,坐标轴下降法,牛顿法等。梯度下降法容易理解,就是参数按照梯度减小的方向更新(形式上的推导),
\[\theta = \theta - \alpha \frac{\partial J(\theta)}{\partial \theta}\]
在LR中,我们在最开始给出了SIgmoid的导数,因此用梯度下降法更新参数可以表示为:
\[\theta = \theta -\alpha X^T(h_\theta(X)-Y)\]
而牛顿法最初是用来求解函数零点的,而极值代表一阶导数的零点,因此可以用牛顿法。记\(J'(\theta)\)为一阶偏导数,\(J''(\theta)\)为二阶偏导数,则有:
\[\theta = \theta - \alpha \frac{J'(\theta)}{J''(\theta)}\]
坐标轴下降法则是固定一个坐标,沿着另外一个坐标寻找最优点,适合于导数不连续的情况。

1.4 Regulization(正则化)

为什么要正则化,这是因为机器学习模型中,学习到的参数\(\theta\)是直接与特征向量\(X\)相乘的,如LR中有:
\[h_\theta(x) = \frac{1}{1+e^{-\theta x+b}}\]
\(X\)不变的情况下,如果\(\theta\)特别大,那乘积就会特别大,假如在测试的时候,某个测试样本跟训练样本的分布不太一样,那么经过参数\(\theta\)放大后可能得到一个非常离谱的值。参数的取值过大会让模型异常敏感,也容易过拟合,那么如何避免这种情况呢?一种可行的方法就是,我们不希望学习到的参数\(\theta={\theta_1,\theta_2,...,\theta_m}\)取值太大,那就让他们尽量接近于0,即:
\[\min \sum_{i=1}^{m} ||\theta_i||\]
矩阵表达就是\(\min ||\theta||_1\),我们称为L1正则项,同样的,也有L2正则项
\[\frac{1}{2}||\theta||_2^2=\frac{1}{2}\sum_{i=1}^{m} ||\theta_i||^2\]
因为正则项也是关于\(\theta\)的函数,也是我们要优化的目标之一(目标是让它最小),这样就可以合并到损失函数中:
\[J(\theta) = -Y^T\log h_\theta(X)-(E-Y)^T \log (E-h_\theta(X))+\lambda_1 ||\theta||_1\]

\(\lambda_1\)是正则项的权重。加入正则项后,学习到的参数就不会太大,模型也就没那么敏感。当然,如果正则项的权重过大,那所有的参数\(\theta\)会很小,模型会异常不敏感,基本所有的输入都会得到一个差不多的输出,所有这个权重也要仔细考虑。
此外,由于\(b\)是直接加到优化函数后的,相当于对函数做平移,我们并不需要正则化这个参数。

1.5多元逻辑回归

Multivariate logistic regression is a generalization of the case of two yuan, the probability of each class is calculated using a Softmax function. K is assumed to sub-classes, each parameter is to learn \ ((\ theta_1, b_1) , (\ theta_2, b_2) ..., (\ theta_k, b_k) \) referred to
\ [z_1 = \ theta_1x + b_1 \ \ z_2 = \ theta_2x + b_2 \\
... \\ z_k = \ theta_kx + b_k \] then x belongs to the probability of each category can be calculated as:
\ [Y_1 = \ E ^ {Z_1 FRAC {} {} \ sum_ {i = 1} ^ {k } e ^ {z_k}} \\ y_2 = \ frac {e ^ {z_2}} {\ sum_ {i = 1} ^ {k} e ^ {z_k}} \\ .. . \\ y_k = \ frac {e
^ {z_k}} {\ sum_ {i = 1} ^ {k} e ^ {z_k}} \\ \] below:
Here Insert Picture Description
SoftMax fact, equivalent to the input amplifier before making Normalized.
Calculating a plurality of classes in the loss function, is used in the plurality of classes Entropy, this time to use the One-hot matrix is:
\ [\ Hat _1 = {Y} \ \\. 1 the begin {0} pmatrix \\ ... \\ 0 \ end {pmatrix } \ hat {y} _2 = \ begin {pmatrix} 0 \\ 1 \\ ... \\ 0 \ end {pmatrix} \ hat {y} _k = \ begin {pmatrix} 0 \\ 0 \\
... \\ 1 \ end {pmatrix} \] when the specific calculation using:
\ [\ min - \ sum_ {I} = ^ {K}. 1 \ Hat {Y } _i \ ln y_i \]

1.6 Summary

Own experimental data set has 230,000 data, from the results, the accuracy of the LR count is high, the key is to train fast.

Guess you like

Origin www.cnblogs.com/cuiyirui/p/11920668.html