Logistic regression and gradient descent

logistic regression analysis, also known as logistic regression, is a generalized linear regression model, commonly used in data mining, automatic diagnosis of disease, economic forecasting and other fields.

logistic regression is a generalized linear regression (generalized linear model), and multiple linear regression analysis and therefore have a lot in common.

The formula is as follows:

                                      

We also know that the same is to x the derivative itself can represent :( gradient descent in the logistic regression derivation of the need to use)

                                                     

The image is:

            

      We found that the above observation image, logistic regression range (0, 1), when the input is 0, the output is 0.5; when the input is less than 0, and smaller and smaller, closer and closer to its output 0; the contrary, when the input is greater than 0, and increasing its output becomes close to 1.

Usually we use linear regression to predict the value, but logistic regression with the "return" of the word, but it is often used to solve binary classification problem.

When the output is greater than 0.5, we can assume that the sample belongs to Group A; less than 0.5, it is considered that the sample belonging to the class.

However, since a plurality of sample data usually have features, we can not be directly into the logistic regression equation, therefore, needs to use the previously described linear regression, characterized in that a plurality of sample values ​​to generate a specific value, in into the equation, their classification, so z is expressed as follows:

It can get detailed expressions on a logistic regression for data:

We can be by any formula of a logistic regression analysis of the data, but this is a problem which exists, that is, about the value of θ, and θ only known formula, we can not for a classification of data using this formula then θ how to obtain it?

Consider the following Formula.

Two, Logistic Regression formula derivation

In the above, we get , we need to obtain θ, about how to obtain θ, will be analyzed in detail here.

Usually in machine learning, we often have a process called training, the so-called training, namely data by a known classification (or labels), and seek a model (or splitter), and then use this data to model the unknown labels to tag (or classify).

Therefore, we use the sample (i.e., the known classification data), a series of estimates obtained θ. This process is called parameter estimation in probability theory.

在此,我们将使用极大似然估计的推导过程,求得关于计算θ的公式:

(1) 首先我们令:

(2) 将上述两式整合:

(3) 求其似然函数:

(4) 对其似然函数求对数:

(5) 当似然函数为最大值时,得到的θ即可认为是模型的参数。求似然函数的最大值,我们可以使用一种方法,梯度上升,但我们可以对似然函数稍作处理,使之变为梯度下降,然后使用梯度下降的思想来求解此问题,变换的表达式如下:

(由于乘了一个负的系数,所以梯度上升变梯度下降。)

(6) 因为我们要使用当前的θ值通过更新得到新的θ值,所以我们需要知道θ更新的方向(即当前θ是加上一个数还是减去一个数离最终结果近),所以得到J(θ)后对其求导便可得到更新方向(为什么更新方向这么求?以及得到更新方向后为什么按照下面的式子处理?请看下方的梯度下降公式的演绎推导),求导过程如下:

(7) 得到更新方向后便可使用下面的式子不断迭代更新得到最终结果。

三、梯度下降公式的演绎推导

关于求解函数的最优解(极大值和极小值),在数学中我们一般会对函数求导,然后让导数等于0,获得方程,然后通过解方程直接得到结果。但是在机器学习中,我们的函数常常是多维高阶的,得到导数为0的方程后很难直接求解(有些时候甚至不能求解),所以就需要通过其他方法来获得这个结果,而梯度下降就是其中一种。

对于一个最简单的函数:, 我们该如何求出y最小是x的值呢(不通过解2x = 0的方法)?

(1) 首先对x任取一个值,比如x = -4,可以得到一个y值。

(2) 求得更新方向(如果不求更新方向对x更新,比如x-0.5,或x+0.5,得到图像如下)。

可以发现,我们如果是向负方向更新x,那么我就偏离了最终的结果,此时我们应该向正方向更新,所以我们在对x更新前需要求得x的更新方向(这个更新方向不是固定的,应该根据当前值确定,比如当x=4时,应向负方向更新)求其导函数在这一点的值,y' = 2x,x = -4, y' = -8,那么它的更新方向就是y',对x更新我们只需x:=x-α·y'(α(大于0)为更新步长,在机器学习中,我们叫它学习率)。 PS:之前说了是多维高阶方程,无法求解,而不是不能对其求导,所以可以对其求导,然后将当前x带入。

(3) 不断重复之前的(1),(2)步,直到x收敛。

梯度下降方法:

对于这个式子,如果:

(1) m是样本总数,即每次迭代更新考虑所有的样本,那么就叫做批量梯度下降(BGD),这种方法的特点是很容易求得全局最优解,但是当样本数目很多时,训练过程会很慢。当样本数量很少的时候使用它。

(2)当m = 1,即每次迭代更新只考虑一个样本,公式为,叫做随机梯度下降(SGD),这种方法的特点是训练速度快,但是准确度下降,并不是全局最优。比如对下列函数(当x=9.5时,最终求得是区部最优解):

(3) 所以综上两种方法,当m为所有样本数量的一部分(比如m=10),即我们每次迭代更新考虑一小部分的样本,公式为,叫做小批量梯度下降(MBGD),它克服了上述两种方法的缺点而又兼顾它们的优点,在实际环境中最常被使用。

发布了67 篇原创文章 · 获赞 48 · 访问量 4万+

Guess you like

Origin blog.csdn.net/qq_41282102/article/details/104320253