Chapter 1: part3 classification model

The value of the classification learning output is selected from several possible values , but this method cannot continue to use linear regression, and a new model is required: logistic regression

1. Introduction to binary classification problems

Binary classification model

Many questions dictate answers:

For example: the answer below is only yes/no (1/0)

image-20221217142349392

For the above problem, the dataset used to represent:

1: It is a tumor

0: Not a tumor

image-20221217142630759

  • It is obviously unreasonable to use linear regression for prediction

image-20221217142749986

The following will describe how to process the above data to get a desired result

logistic regression model

f w , b ( x ) = g ( ( w ⃗ ⋅ x ⃗ ) + b ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=g((\vec{w} \cdot \vec{x})+b)=\frac{1}{1+e^{-(\vec{w} \cdot \vec{x})+b}} fw,b(x)=g((w x )+b)=1+e(w x )+b1

Since our output is in the interval of **0(no)-1(yes)**, we want to get a function, and the output is also 0-1:

  • logic function

First let's look at the function that builds the logistic regression

g ( z ) = 1 / 1 + e − zz approaches 1 when z is large, and approaches 0 when z is small g(z)=1/1+e^{-z}\\\textcolor{green} {z It tends to 1 when it is large, and it tends to 0 when z is small}g(z)=1/1+ezIt tends to 1 when z is large , and tends to 0 when z is small

image-20221217143854778

  • The establishment of the logistic regression algorithm

  1. Substituting the straight line linear regression model into

The linear regression model is set to z substitution

image-20230107175338058

  1. Substitute into the above function

The function is a logistic regression model

image-20230107175416893

In this way, the results in the linear regression equation can be in the range of 0-1

Example analysis of function output results:

例如:
有一个病人,x为肿瘤大小,y输出为肿瘤结果的概率(0-1),则下述的例子为:有70 %的概率患病
image-20230107175820485

Combined with the knowledge of probability theory, it is usually possible to write:

image-20230107180125778

Indicates the probability of y=1 in the case of x (corresponding to w and b)

decision boundary

If we only want 0/1 results, we can set the threshold

For example:

image-20230107182358958 $$ f(x)= \begin{cases} 0,& \text{if x >0.5} \\ 1, & \text{if x <0.5} \end{cases} $$
  • definition

When x=0.5, it means w*x+b=0 (as shown below)

image-20221217143854778

It also means that f(x) has the same probability of being 0 and 1

What is the significance of such a fitting function, as follows:

例:训练集如下所示,x为1,o为0,假设函数为回归函数,有两个维度,w1=1,w2=1,b=-3

image-20230107183838248

image-20230107183757852

The decision boundary can be expressed as:
z = w ⃗ ⋅ x ⃗ + b = 0 Substitute into w 1 , w 2 , bx 1 + x 2 − 3 = 0 x 1 + x 2 = 3 z=\vec{w} \cdot \ vec{x}+b=0\\ \textcolor{red}{substitute into w1, w2, b}\\ x_1+x_2-3=0\\ x_1+x_2=3z=w x +b=0Substitute into w 1 , w 2 , bx1+x23=0x1+x2=3

This purple line is the decision boundary

image-20230107184317408

This is the basic function of the regression model. Let's introduce how to train the regression model in detail.

  • build cost function
  • Implementing the Gradient Descent Algorithm

2. Cost function in logistic regression

Cost Function Implementation in Logistic Regression

If using the previous cost function:
J ( w , b ) = 1 2 m ∑ i = 0 m ( fw , b ( x ( i ) ) − y ) 2 J(w,b)=\frac{1} {2m}\sum_{i=0}^{m} (f_{w,b}(x^{(i)})-y)^2J(w,b)=2 m1i=0m(fw,b(x(i))y)2

There will be many local minimum points, which is not convenient to use

image-20230107190205782

  • Optimize the cost function:

Put 1/2 inside the summation function

Using a convex function like this guarantees the use of the gradient descent algorithm

image-20230107195632274

  • Loss function L

Can tell us how well the model is trained on the sample

If the value of y is 0, the loss function is:

image-20230107190328946

understand:

If the value of y is 1 , the loss function is:

− l o g ( f ) : 0 < f < 1 -log(f):0<f<1 log(f):0<f<1

That is, when the threshold is f(x), the result is regarded as y=1 , what is the corresponding loss value

image-20230107193617505

If the value of y is 0 , the loss function is:

− l o g ( 1 − f ) : 0 < f < 1 -log(1-f):0<f<1 log(1f):0<f<1

That is, when the threshold is f(x), the result is regarded as y=0 , what is the corresponding loss value

image-20230107194352637

After building the loss function, then if you can know the w&b that makes J(w,b) the smallest , then you can implement logistic regression

image-20230107184317408

Simplify the cost function

image-20230107212537520

The above formula can be simplified into one formula:

image-20230107212604749

Then substitute the above formula into the cost function

image-20230107212814740

得到最优式
J ( w , b ) = − 1 m ∑ i = 0 m [ y ( i ) l o g ( f w , b ( x ⃗ ( i ) ) ) − ( 1 − y ( i ) ) l o g ( 1 − f w , b ( x ⃗ ( i ) ) ) ] J(w,b)=-\frac{1}{m}\sum_{i=0}^{m} [y^{(i)}log(f_{w,b}(\vec{x}^{(i)}))-(1-y^{(i)})log(1-f_{w,b}(\vec{x}^{(i)}))] J(w,b)=m1i=0m[y(i)log(fw,b(x (i)))(1y(i))log(1fw,b(x (i)))]

This cost function is inferred based on the maximum likelihood estimation , and those who are interested can understand it by themselves


3. Implement Gradient Descent

Through the gradient descent algorithm, we can find the w&b that minimizes J(w,b)

image-20230108155504893

The partial derivatives of the above formulas are calculated separately:

image-20230108155825959

Substituting partial derivatives, the following formula is obtained:

image-20230108155946500

Note: Here fw,b and different
fw of linear regression, b ( x ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=\frac{1}{ 1+e^{-(\vec{w} \cdot \vec{x})+b}}fw,b(x)=1+e(w x )+b1


4. Overfitting problem

Introducing overfitting

  • Terminology introduction
  • Generalization: the ability of the model to reason about unknown samples
  • Underfitting (underfit)

The following is the model of the price

There is a bias in fitting a set of data , so it can also be called: high bias (high bias)

image-20230108161152882
  • Overfitting (overfit)

For a set of data that fits too well and loses predictive power , it can also be called: high variance (high variance)

image-20230108161820311

So, we should look for a model that has neither high bias nor high variance

Solve the overfitting problem

image-20230108165008849
  1. more training data
image-20230108165040707
  1. Observe if you can use fewer features (x1, x2, x3...)
image-20230108170007536

Disadvantage: some important features may be discarded

  1. Regularization

narrow the value of the parameter as much as possible

Keep all features, but prevent feature weights from being too large

image-20230108170247952

Regularization

Regularization only considers the influence of w , not b (b has less influence on the overall fitting)

  • principle

If the binomial model is normal

image-20230108170810650

But the tetranomial model overfits

We need to reduce the size of w3 and w4 to prevent the model from overfitting

image-20230108170852665

So we can add w3 and w4 after the cost function :

In order to minimize the cost function, w3 and w4 must be given small values

image-20230108171433985

image-20230108171550455
  • official

λ: regularization parameter, which needs to be given manually

image-20230108171917312

λ cannot be too large ( all items tend to 0 ), nor can it be too small ( useless )

1. Regularized linear regression

official:

image-20230108173856818

Substitute the gradient descent algorithm into:

Regularize only w

image-20230108174113439

2. Regularized logistic regression

Overfitting Logistic Regression Plot

image-20230108175011951

official:

image-20230108181823513

Substitute the gradient descent algorithm into:

Regularize only w

image-20230108174113439

**Note:** Here fw,b and different
fw of linear regression, b ( x ) = 1 1 + e − ( w ⃗ ⋅ x ⃗ ) + b f_{w,b}(x)=\frac {1}{1+e^{-(\vec{w} \cdot \vec{x})+b}}fw,b(x)=1+e(w x )+b1

Guess you like

Origin blog.csdn.net/weixin_66261421/article/details/130050629