After reading this, logistic regression understand about 80%

1. What is the logistic regression

Logistic regression is used to make classification algorithm, we are all familiar linear regression, the general form is Y = aX + b, y is in the range [-∞, + ∞], there are so many values, how to classify it? Do not worry, the great mathematicians have found a way for us.

Y is the result of non-linear transformation into a Sigmoid function , the value can be obtained between the range [0,1] the number of S, S can see it as a probability value, if we set the thresholds for probability values 0.5, then S is larger than 0.5 can be regarded as a positive sample, less than 0.5 as a negative sample, it can be classified.

2. What is the Sigmoid function

Function formula as follows:

T function within whatever value is taken, the results are in [0, -1] range, recall that a classification problem there are two answers, one is "yes", one is "No", then 0 corresponding to the "No", 1 corresponds to "yes", then someone asked you this is not the [0,1] interval it, how it will be only 0 and 1? Good question, we assume that the classification threshold is 0.5, so over 0.5 is classified as Category 1, lower than 0.5 is classified as Category 0, the threshold value can be set their own.

Well, then we put into aX + b t get a general model in our logistic regression equation of:

Results can also be understood as a probability P, in other words the probability of belonging to a classification of greater than 0.5, the probability is less than 0.5 belong to the classification 0, which achieve the purpose of classification.

3. What is the loss function

Loss function is logistic regression log Loss , i.e. the log-likelihood function , the following function formula:

Y = 1 in the formula represents the real value of 1 with the first formula, y = 0 with a true second formula to calculate losses. Why add log function? You can imagine, when the real sample 1 was, but the probability of h = 0, then log0 = ∞, the penalties for which the largest model; when h = 1, then log1 = 0, equivalent to no punishment, that is not loss, to achieve optimal results. So mathematicians came up with a log function to represent the loss function.

Finally, like in accordance with the gradient descent method, solving minimum point, to give the model the desired effect.

4. You can do multi-classification?

Yes, in fact we can from binary classification problem over to the multi-classification problem (one vs rest), the idea as follows:

1. Type class1 seen as a positive samples, all other types seen as negative samples, then we can get a sample tag type for that type of probability p1.

2. Further types class2 then considered positive samples, all other types regarded as negative samples, the same way to obtain p2.

3. In this cycle, we can obtain the type of marker to be predicted samples respectively when probability pi type class i, and finally we take the sample in the maximum pi tag type corresponding to a probability that our sample type to be predicted.

In short or in the dichotomous division to turn and find the maximum probability results.

5. What are the advantages logistic regression

  • LR can be output in the form of probabilities, rather than just 0,1 judgment.
  • LR interpretability strong, controllable high (say you want to give the boss a thing ...).
  • Training fast, feature engineering effect after praise.
  • Because the result is a probability, you can do ranking model.

6. What are logistic regression Application

  • Estimated learning CTR / recommendation system to rank / classification of various scenarios.
  • A search engine plant ad CTR estimate baseline version is LR.
  • A sort electricity supplier search / ad CTR estimate baseline version is LR.
  • Shopping with an electricity supplier recommended by a lot of LR.
  • A day now earn advertising 1000w + news app sort baseline LR.

7. logistic regression optimization method which is commonly used

7.1 the first order method

Gradient descent, stochastic gradient descent, Mini stochastic gradient descent drop method. Original stochastic gradient descent gradient descent is not only faster than the speed, can suppress the occurrence of a local optimal solution to a certain extent when the local optimization problem.

7.2 second-order methods: Newton method, quasi-Newton method:

这里详细说一下牛顿法的基本原理和牛顿法的应用方式。牛顿法其实就是通过切线与x轴的交点不断更新切线的位置,直到达到曲线与x轴的交点得到方程解。在实际应用中我们因为常常要求解凸优化问题,也就是要求解函数一阶导数为0的位置,而牛顿法恰好可以给这种问题提供解决方法。实际应用中牛顿法首先选择一个点作为起始点,并进行一次二阶泰勒展开得到导数为0的点进行一个更新,直到达到要求,这时牛顿法也就成了二阶求解问题,比一阶方法更快。我们常常看到的x通常为一个多维向量,这也就引出了Hessian矩阵的概念(就是x的二阶导数矩阵)。

缺点:牛顿法是定长迭代,没有步长因子,所以不能保证函数值稳定的下降,严重时甚至会失败。还有就是牛顿法要求函数一定是二阶可导的。而且计算Hessian矩阵的逆复杂度很大。

拟牛顿法: 不用二阶偏导而是构造出Hessian矩阵的近似正定对称矩阵的方法称为拟牛顿法。拟牛顿法的思路就是用一个特别的表达形式来模拟Hessian矩阵或者是他的逆使得表达式满足拟牛顿条件。主要有DFP法(逼近Hession的逆)、BFGS(直接逼近Hession矩阵)、 L-BFGS(可以减少BFGS所需的存储空间)。

8. 逻辑斯特回归为什么要对特征进行离散化。

  1. 非线性!非线性!非线性!逻辑回归属于广义线性模型,表达能力受限;单变量离散化为N个后,每个变量有单独的权重,相当于为模型引入了非线性,能够提升模型表达能力,加大拟合; 离散特征的增加和减少都很容易,易于模型的快速迭代;
  2. 速度快!速度快!速度快!稀疏向量内积乘法运算速度快,计算结果方便存储,容易扩展;
  3. 鲁棒性!鲁棒性!鲁棒性!离散化后的特征对异常数据有很强的鲁棒性:比如一个特征是年龄>30是1,否则0。如果特征没有离散化,一个异常数据“年龄300岁”会给模型造成很大的干扰;
  4. 方便交叉与特征组合:离散化后可以进行特征交叉,由M+N个变量变为M*N个变量,进一步引入非线性,提升表达能力;
  5. 稳定性:特征离散化后,模型会更稳定,比如如果对用户年龄离散化,20-30作为一个区间,不会因为一个用户年龄长了一岁就变成一个完全不同的人。当然处于区间相邻处的样本会刚好相反,所以怎么划分区间是门学问;
  6. 简化模型:特征离散化以后,起到了简化了逻辑回归模型的作用,降低了模型过拟合的风险。

9. 逻辑回归的目标函数中增大L1正则化会是什么结果。

所有的参数w都会变成0。

10. 代码实现

GitHub:https://github.com/NLP-LOVE/ML-NLP/blob/master/Machine%20Learning/2.Logistics%20Regression/demo/CreditScoring.ipynb


作者:@mantchs

GitHub:https://github.com/NLP-LOVE/ML-NLP

欢迎大家加入讨论!共同完善此项目!群号:【541954936】

Guess you like

Origin www.cnblogs.com/mantch/p/11142288.html