Logistic regression classification algorithm of (Logistic Regression)
1. binary classification
Now there is a hospital, you want to analyze the patient's condition, of which one is judged on benign \ malignant tumors, there are a number of data sets is about the size of a tumor, the task is to be determined according to the size of the tumor is benign or malignant. This is a typical two-class problem, that is the result of only two output values ---- benign and malignant (usually represented by numbers 0 and 1). 1, we can make a visual determination of tumor size is greater than 5, i.e. nausea tumors (output 1); 5 or less, that is benign (output 0).
2. The classification of the nature of the problem
Essentially belongs to the classification of supervised learning, for a given set of data known classification, and then let the computer through the data set classification algorithm learning, so that the computer that the data can be predicted. Tumors of example, the existing data set shown in Figure 1, now to diagnose a patient's condition, the computer needs only the patient and the tumor size compared 5, it can then be inferred malignant or benign. Classification and regression problems have some similarities, are unknown to predict the results of the learning data sets, except that different output values. Regression output value is continuous values (eg house prices), the output value of the classification of discrete values (for example, malignant or benign). Since the classification and regression problems have some similarities, we can not be classified on the basis of return on it? The answer is yes. One possible idea is to use a linear fit, then the predicted result value of the linear fit to quantify, is about continuous value quantized to discrete values.
3. Classification assume the function of
Although the classification and regression problems have some similar, but we can not directly use the regression assumptions function as a classification problem hypothesis function. Or in the example of FIG. 1 as an example, if we use simple linear function (i.e. \ (H (X) = \ theta_0 + \ theta_1x \) ) to fit, the result could be like this: $ h_ \ theta (x ) = \ dfrac {5} { 33} x- \ frac {1} {3} $, is reflected in the picture:
这样,你可能会做这样的一个判断:对于这个线性拟合的假设函数,给定一个肿瘤的大小,只要将其带入假设函数,并将其输出值和0.5进行比较,如果大于0.5,就输出1;小于0.5,就输出0。在图1的数据集中,这种方法确实可以。但是如果将数据集更改一下,如图3所示,此时线性拟合的结果就会有所不同:
如果采用相同的方法,那么就会把大小为6的情况进行误判为良好。所以,我们不能单纯地通过将线性拟合的输出值与某一个阈值进行比较这种方式来进行量化。对于逻辑回归,我们的量化函数为Sigmoid函数(也称Logistic函数,S函数)。其数学表达式为:\(S(x) = \dfrac{1}{1+e^{-x}}\) ,其图像如图4:
可以看到S函数的输出值就是0和1,在逻辑回归中,我们采用S函数来对线性拟合的输出值进行量化,所以逻辑回归的假设函数为:
\[h_\theta(x)=\dfrac{1}{1+e^{-\theta^Tx}}=\dfrac{1}{1+e^{-\sum_{i=0}^n\theta_ix_i}} \tag{3.1}\]
其中,\(x\)为增广特征向量(1*(n+1)维),\(\theta\)为增广权向量(1*(n+1)维)。这个假设函数所表示的数学含义是:对于特定的样本\(x\)与参数矩阵\(\theta\),分类为1的概率(假设y只有0和1两类),也就即\(h_\theta(x) = P(y=1|x;\theta)\)。根据其数学意义,我们可以这样认为:如果\(h_\theta(x)>0.5\),则判定y = 1;如果\(h_\theta(x)<0.5\),则判定y = 0。
4.逻辑回归的代价函数(Cost Function)
代价函数(成本函数),也就是损失函数。在逻辑回归中,代价函数并不是采用均方误差(主要原因是,逻辑回归的均方误差并不是一个凸函数,并不方便使用梯度下降法),而是通过极大似然法求解出来的一个函数,其数学表达式为:
\[J(\theta)= \dfrac{1}{m}\sum_{i=1}^m[-yln(h_\theta(x))-(1-y)ln(1-h_\theta(x))] \tag{4.1}\]
这个函数看起来有点复杂,我们将它一步步进行拆分来理解,我们将每个样本点与假设函数之间的误差记为\(Cost(h_\theta(x),y)=-yln(h_\theta(x))-(1-y)ln(1-h_\theta(x))\),这样代价函数就可以理解为误差的均值。下面我们再详细看一下这个误差函数,由于y的取值只能是0或者1,我们可以将误差函数写成分段函数的形式:
\[Cost(h_\theta(x),y)=\begin{cases} -ln(h_\theta(x)),\quad &y = 1 \\ -(1-y)ln(1-h_\theta(x)), &y=0 \end{cases} \tag{4.2}\]
4.2式和4.1式是等价的,依据4.2式,不难得出:当y=1时,如果判定为y=1(即\(h_\theta(x) = 1\)),误差为0;如果误判为y=0(\(即h_\theta(x) = 0\)),误差将会是正无穷大。当y=0时,如果判定为y=0(即\(h_\theta(x) = 0\)),误差为0;如果误判为y=1(即\(h_\theta(x) = 1\)),误差将会是正无穷大。(注意:\(h_\theta(x) = 1\)表示y等于1的概率为1,等价于认为y=1;\(h_\theta(x) = 0\)表示y等于1的概率为0,等价于认为y=0)
如果用矩阵来表示代价函数,就是:
\[J(\theta)=-\dfrac{1}{m}Y^Tln(h_\theta(X))-(E-Y)^Tln(E-h_\theta(X)) \tag{4.3}\]
其中\(Y\)为样本真实值组成的列向量(m*1维),\(X\)为增广样本矩阵((1+n)*m维),E为全1列向量(m*1维)。
5.逻辑回归使用梯度下降法
逻辑回归的代价函数和线性回归的损失函数一样,都是凸函数,所以我们可以采用梯度下降法来求参数矩阵\(\theta\)使得代价函数\(J(\theta)\)取得最小值。其具体算法与线性回归中的梯度下降法(可以参考我的另一篇博客线性回归之梯度下降法)并没有太大区别,只是对应的偏导有所不同。逻辑回归的代价函数的偏导为:
\[\dfrac{\partial J(\theta)}{\theta_i} = \dfrac{1}{m}\sum_{j=1}^m(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)} = \dfrac{1}{m}\sum_{j=1}^m(\dfrac{1}{1+e^{-\sum_{i=0}^n\theta_ix_i^{(j)}}}-y^{(j)})x_i^{(j)}\quad (i=0,1,\dots,n)\tag{5.1}\]
对应的参数更新规则为:
\[\theta_i = \theta_i-\alpha\dfrac{\partial J(\theta)}{\theta_i} = \theta_i-\alpha\dfrac{1}{m}\sum_{j=1}^m(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)}\quad (i=0,1,\dots,n)\tag{5.2}\]
如果用矩阵表示就是:
\[\dfrac{\partial J(\theta)}{\theta} = \dfrac{1}{m}X^T(h_\theta(X)-Y),\quad \theta=\theta-\alpha\dfrac{1}{m}X^T(h_\theta(X)-Y) \tag{5.3}\]
其中,\(\alpha\)为迭代步长。
6.多元逻辑回归
For multivariate logistic regression, one possible idea is to simplify its binary. For example, if the classification data set comprises 1,2,3 three categories. If you now want to determine if a sample is not a class 1, we can be seen as a set of data types ---- namely Class 1 and non-Class 1 (the Class 2 and Class 3), so that we can obtain for Class 1 hypothesis function \ ({H ^ (. 1)} _ \ Theta (X) \) , as well as empathy \ (h ^ {(2) } _ \ theta (x) \) and \ (h ^ {(3 )} _ \ Theta (X) \) . So that our decision rule becomes:
\[if \quad max\{h^{(i)}_\theta(x)\} = h^{(j)}_\theta(x), then \quad y = j\quad(i,j=1,2,3) \tag{6.1}\]
7. Summary
Although logistic regression with the "return" of the word, but in fact it is a classification algorithm. Logistic regression thinking and pattern recognition discriminant function is very similar, the two can combine learning.
Reference links: