The value of the classification learning output is selected from several possible values , but this method cannot continue to use linear regression, and a new model is required: logistic regression
1. Introduction to binary classification problems
Binary classification model
Many questions dictate answers:
For example: the answer below is only yes/no (1/0)
For the above problem, the dataset used to represent:
1: It is a tumor
0: Not a tumor
The following will describe how to process the above data to get a desired result
logistic regression model
f w , b ( x ) = g ( ( w ⃗ ⋅ x ⃗ ) + b ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=g((\vec{w} \cdot \vec{x})+b)=\frac{1}{1+e^{-(\vec{w} \cdot \vec{x})+b}} fw,b(x)=g((w⋅x)+b)=1+e−(w⋅x)+b1
Since our output is in the interval of **0(no)-1(yes)**, we want to get a function, and the output is also 0-1:
First let's look at the function that builds the logistic regression
g ( z ) = 1 / 1 + e − zz approaches 1 when z is large, and approaches 0 when z is small g(z)=1/1+e^{-z}\\\textcolor{green} {z It tends to 1 when it is large, and it tends to 0 when z is small}g(z)=1/1+e−zIt tends to 1 when z is large , and tends to 0 when z is small
The linear regression model is set to z substitution
The function is a logistic regression model
In this way, the results in the linear regression equation can be in the range of 0-1
Example analysis of function output results:
例如:
有一个病人,x为肿瘤大小,y输出为肿瘤结果的概率(0-1),则下述的例子为:有70 %的概率患病
Combined with the knowledge of probability theory, it is usually possible to write:
Indicates the probability of y=1 in the case of x (corresponding to w and b)
decision boundary
If we only want 0/1 results, we can set the threshold
For example:
$$ f(x)= \begin{cases} 0,& \text{if x >0.5} \\ 1, & \text{if x <0.5} \end{cases} $$When x=0.5, it means w*x+b=0 (as shown below)
It also means that f(x) has the same probability of being 0 and 1
What is the significance of such a fitting function, as follows:
例:训练集如下所示,x为1,o为0,假设函数为回归函数,有两个维度,w1=1,w2=1,b=-3
The decision boundary can be expressed as:
z = w ⃗ ⋅ x ⃗ + b = 0 Substitute into w 1 , w 2 , bx 1 + x 2 − 3 = 0 x 1 + x 2 = 3 z=\vec{w} \cdot \ vec{x}+b=0\\ \textcolor{red}{substitute into w1, w2, b}\\ x_1+x_2-3=0\\ x_1+x_2=3z=w⋅x+b=0Substitute into w 1 , w 2 , bx1+x2−3=0x1+x2=3
This purple line is the decision boundary
This is the basic function of the regression model. Let's introduce how to train the regression model in detail.
- build cost function
- Implementing the Gradient Descent Algorithm
2. Cost function in logistic regression
Cost Function Implementation in Logistic Regression
If using the previous cost function:
J ( w , b ) = 1 2 m ∑ i = 0 m ( fw , b ( x ( i ) ) − y ) 2 J(w,b)=\frac{1} {2m}\sum_{i=0}^{m} (f_{w,b}(x^{(i)})-y)^2J(w,b)=2 m1i=0∑m(fw,b(x(i))−y)2
There will be many local minimum points, which is not convenient to use
Put 1/2 inside the summation function
Using a convex function like this guarantees the use of the gradient descent algorithm
Can tell us how well the model is trained on the sample
If the value of y is 0, the loss function is:
understand:
If the value of y is 1 , the loss function is:
− l o g ( f ) : 0 < f < 1 -log(f):0<f<1 −log(f):0<f<1
That is, when the threshold is f(x), the result is regarded as y=1 , what is the corresponding loss value
If the value of y is 0 , the loss function is:
− l o g ( 1 − f ) : 0 < f < 1 -log(1-f):0<f<1 −log(1−f):0<f<1
That is, when the threshold is f(x), the result is regarded as y=0 , what is the corresponding loss value
After building the loss function, then if you can know the w&b that makes J(w,b) the smallest , then you can implement logistic regression
Simplify the cost function
The above formula can be simplified into one formula:
Then substitute the above formula into the cost function
得到最优式
J ( w , b ) = − 1 m ∑ i = 0 m [ y ( i ) l o g ( f w , b ( x ⃗ ( i ) ) ) − ( 1 − y ( i ) ) l o g ( 1 − f w , b ( x ⃗ ( i ) ) ) ] J(w,b)=-\frac{1}{m}\sum_{i=0}^{m} [y^{(i)}log(f_{w,b}(\vec{x}^{(i)}))-(1-y^{(i)})log(1-f_{w,b}(\vec{x}^{(i)}))] J(w,b)=−m1i=0∑m[y(i)log(fw,b(x(i)))−(1−y(i))log(1−fw,b(x(i)))]
This cost function is inferred based on the maximum likelihood estimation , and those who are interested can understand it by themselves
3. Implement Gradient Descent
Through the gradient descent algorithm, we can find the w&b that minimizes J(w,b)
The partial derivatives of the above formulas are calculated separately:
Substituting partial derivatives, the following formula is obtained:
Note: Here fw,b and different
fw of linear regression, b ( x ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=\frac{1}{ 1+e^{-(\vec{w} \cdot \vec{x})+b}}fw,b(x)=1+e−(w⋅x)+b1
4. Overfitting problem
Introducing overfitting
-
Terminology introduction
-
Generalization: the ability of the model to reason about unknown samples
-
Underfitting (underfit)
The following is the model of the price
There is a bias in fitting a set of data , so it can also be called: high bias (high bias)
For a set of data that fits too well and loses predictive power , it can also be called: high variance (high variance)
So, we should look for a model that has neither high bias nor high variance
Solve the overfitting problem
Disadvantage: some important features may be discarded
narrow the value of the parameter as much as possible
Keep all features, but prevent feature weights from being too large
Regularization
Regularization only considers the influence of w , not b (b has less influence on the overall fitting)
If the binomial model is normal
But the tetranomial model overfits
We need to reduce the size of w3 and w4 to prevent the model from overfitting
So we can add w3 and w4 after the cost function :
In order to minimize the cost function, w3 and w4 must be given small values
λ: regularization parameter, which needs to be given manually
λ cannot be too large ( all items tend to 0 ), nor can it be too small ( useless )
1. Regularized linear regression
official:
Substitute the gradient descent algorithm into:
Regularize only w
2. Regularized logistic regression
Overfitting Logistic Regression Plot
official:
Substitute the gradient descent algorithm into:
Regularize only w
**Note:** Here fw,b and different
fw of linear regression, b ( x ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=\frac {1}{1+e^{-(\vec{w} \cdot \vec{x})+b}}fw,b(x)=1+e−(w⋅x)+b1