## Logistic regression and gradient descent

logistic regression analysis, also known as logistic regression, is a generalized linear regression model, commonly used in data mining, automatic diagnosis of disease, economic forecasting and other fields.

logistic regression is a generalized linear regression (generalized linear model), and multiple linear regression analysis and therefore have a lot in common.

The formula is as follows:

We also know that the same is to x the derivative itself can represent :( gradient descent in the logistic regression derivation of the need to use)

The image is:

We found that the above observation image, logistic regression range (0, 1), when the input is 0, the output is 0.5; when the input is less than 0, and smaller and smaller, closer and closer to its output 0; the contrary, when the input is greater than 0, and increasing its output becomes close to 1.

Usually we use linear regression to predict the value, but logistic regression with the "return" of the word, but it is often used to solve binary classification problem.

When the output is greater than 0.5, we can assume that the sample belongs to Group A; less than 0.5, it is considered that the sample belonging to the class.

However, since a plurality of sample data usually have features, we can not be directly into the logistic regression equation, therefore, needs to use the previously described linear regression, characterized in that a plurality of sample values ​​to generate a specific value, in into the equation, their classification, so z is expressed as follows:

It can get detailed expressions on a logistic regression for data:

We can be by any formula of a logistic regression analysis of the data, but this is a problem which exists, that is, about the value of θ, and θ only known formula, we can not for a classification of data using this formula then θ how to obtain it?

Consider the following Formula.

## Two, Logistic Regression formula derivation

In the above, we get , we need to obtain θ, about how to obtain θ, will be analyzed in detail here.

Usually in machine learning, we often have a process called training, the so-called training, namely data by a known classification (or labels), and seek a model (or splitter), and then use this data to model the unknown labels to tag (or classify).

Therefore, we use the sample (i.e., the known classification data), a series of estimates obtained θ. This process is called parameter estimation in probability theory.

(1) 首先我们令：

(2) 将上述两式整合：

(3) 求其似然函数：

(4) 对其似然函数求对数：

(5) 当似然函数为最大值时，得到的θ即可认为是模型的参数。求似然函数的最大值，我们可以使用一种方法，梯度上升，但我们可以对似然函数稍作处理，使之变为梯度下降，然后使用梯度下降的思想来求解此问题，变换的表达式如下：

（由于乘了一个负的系数，所以梯度上升变梯度下降。）

(6) 因为我们要使用当前的θ值通过更新得到新的θ值，所以我们需要知道θ更新的方向(即当前θ是加上一个数还是减去一个数离最终结果近)，所以得到J(θ)后对其求导便可得到更新方向（为什么更新方向这么求？以及得到更新方向后为什么按照下面的式子处理？请看下方的梯度下降公式的演绎推导），求导过程如下：

(7) 得到更新方向后便可使用下面的式子不断迭代更新得到最终结果。

## 三、梯度下降公式的演绎推导

(1) 首先对x任取一个值，比如x = -4，可以得到一个y值。

(2) 求得更新方向（如果不求更新方向对x更新，比如x-0.5，或x+0.5，得到图像如下）。

(3) 不断重复之前的(1),(2)步，直到x收敛。

(1) m是样本总数，即每次迭代更新考虑所有的样本，那么就叫做批量梯度下降（BGD），这种方法的特点是很容易求得全局最优解，但是当样本数目很多时，训练过程会很慢。当样本数量很少的时候使用它。

(2)当m = 1，即每次迭代更新只考虑一个样本，公式为，叫做随机梯度下降（SGD），这种方法的特点是训练速度快，但是准确度下降，并不是全局最优。比如对下列函数(当x=9.5时，最终求得是区部最优解)：

(3) 所以综上两种方法，当m为所有样本数量的一部分（比如m=10），即我们每次迭代更新考虑一小部分的样本，公式为，叫做小批量梯度下降（MBGD），它克服了上述两种方法的缺点而又兼顾它们的优点，在实际环境中最常被使用。

### Guess you like

Origin blog.csdn.net/qq_41282102/article/details/104320253
Recommended
Ranking
Daily