Machine Learning Notes (5) Logistic Regression, Decision Boundary, OvR, OvO

1. Classification problem

There are three major problems in machine learning, namely regression, classification and clustering. Linear regression belongs to regression tasks, while logistic regression and k-nearest neighbor algorithm belong to classification tasks. Logistic regression algorithms are mainly used in classification problems, such as the classification of spam (is spam or not spam), or the judgment of tumors (malignant tumor or not malignant tumor). In the problem of binary classification, we often use 1 to represent the positive category and 0 to represent the negative category.

2. Logistic regression

Logistic regression is generalized linear regression, so it has many similarities with linear regression. Their model forms are basically the same, both have y ^ = wx + b \hat y=wx+by^=wx+b , where w and b are the parameters to be requested. The difference is that their ranges are different. The range of linear regression is[ + ∞ , − ∞ ] [+\infty,-\infty][+,] , the logistic regression range is[ 0 , 1 ] [0,1][0,1 ] . Logistic regression can be regarded as both a regression algorithm and a classification algorithm. It is usually used as a classification algorithm and can only solve binary classification problems. How does the regression problem solve the classification problem? As a classification task, we associate the characteristics of the sample with the probability of occurrence of the sample, and the probability is a number between [0,1]. putp ^ ≥ 0.5 \hat p\geq0.5p^0.5 is classified as positive example 1, and the probability p ^ ≤ 0.5 \ hat p\leq0.5p^0 . 5Indefinitely 0 that
p ^ = f ( x ) y ^ = { 1 , p ^ ≥ 0.5 0 , p ^ ≤ 0.5 \hat p=f(x) \qquad\hat y=\left\{ \begin {aligned} 1, & & \hat on \geq0.5 \\ 0, & & \hat on \leq0.5 \end{aligned} \right.p^=f(x)y^={ 1,0,p^0.5p^0.5
So since logistic regression is inextricably linked with linear regression, how does it change the range of linear regression [ + ∞ , − ∞ ] [+\infty,-\infty][+,] bound to
[ 0 , 1 ] [0,1][0,1 ] between a probability value? Hypothetical function y ^ = θ T x \hat y=\theta^Txin linear regression problemsy^=iT x, and want its output value to satisfy0 ≤ y ^ ≤ 1 0\leq\hat y\leq10y^1 , introduce the sigmoid function, as follows:
sigmoid ( x ) = 1 1 + e − xx = θ T x sigmoid(x)=\frac{1}{1+e^{-x}}\qquad x=\ theta^Txsigmoid(x)=1+ex1x=iThe T x
sigmoid function image is shown below. From the function image, the sigmoid function can meet the basic requirements of logistic regression, and its value range is( 0 , 1 ) (0,1)(0,1 ) atx > 0 x > 0x>When 0 ,p ^ > 0.5 \hat p>0.5p^>0.5,在 x < 0 x<0 x<When 0 ,p ^ < 0.5 \hat p<0.5p^<0.5 . _ _ So the hypothesis function in a logistic regression problem:
p ^ = sigmoid ( θ T ⋅ x ) = 1 1 + e − θ T ⋅ xy ^ = { 1 , p ^ ≥ 0.5 0 , p ^ ≤ 0.5 \hat p=sigmoid( \theta^T\cdot x_)=\frac{1}{1+e^{-\theta^T\cdot x}} \qquad\hat y=\left\{ \begin{aligned} 1, & & \ hat p \geq0.5 \\ 0, & & \hat p \leq0.5 \end{aligned} \right.p^=sigmoid(θTx)=1+eiTx1y^={ 1,0,p^0.5p^0.5
Please add a picture description

3. Loss function

  • Representation of the cost function

For a given sample data set X,y, how do we find the parameters θ \thetaθ . As with the linear regression problem, find a suitableθ \thetaThe θ parameter is very important for fitting the hypothesis function to the training set, and finding the minimum value of the loss function is undoubtedly crucial. So the question is, how should we represent the loss function. For logistic regression, we require the cost to meet the following requirements:
cost = { If y = 1, the smaller p is, the larger the cost is. If y = 0, the larger p is, the larger the cost is. cost=\left\{ \begin{aligned} &If y=1, the smaller p is, the greater the cost\\ &If y=0, the larger p is, the greater the cost\end{aligned} \right.cost={ if y=1 , the smaller p is , the larger c o s tif y=0 , the bigger p is , the bigger c o s t
At this moment, do you think of the − log ( x ) -log(x) learned in high school?l o g ( x ) function at( 0 , 1 ) (0,1)(0,1 ) just meet our requirements, so we can express the cost as follows:
cost = { − log ( p ^ ) ify = 1 − log ( 1 − p ^ ) ify = 0 ⟺ cost = − ylog ( p ^ ) − ( 1 − y ) log ( 1 − p ^ ) cost=\left\{ \begin{aligned} && -log(\hat p) && if && y=1\\ && -log(1 -\hat p) && if && y=0 \end{aligned} \right.\qquad \iff \qquad cost=-ylog(\hat p)-(1-y)log(1-\hat p)cost={ log(p^)log(1p^)ifify=1y=0cost=y l o g (p^)(1y)log(1p^)
according toy = 1 y=1y=1y = 0 y=0y=0 two different cases, the image of the cost function is as follows:
insert image description here

It is easy to find that if y=1, for − log ( x ) -log(x)l o g ( x ) , the loss tends to ∞ \inftywhen taking 0 , taking 1 means no loss. And if y=0, for− log ( 1 − x ) -log(1-x)log(1x ) , when taking 1, the loss tends to∞ \infty , taking 0 means no loss. At this point we get the cost function of logistic regression
J ( θ ) = − 1 m ∑ i = 1 my ( i ) log ( simgoid ( X b ( i ) θ ) ) + ( 1 − y ( i ) ) log ( 1 − sigmoid ( X b ( i ) θ ) ) J(\theta)=-\frac{1}{m}\sum\limits_{i=1}\limits^my^{(i)}log(simgoid( X_b^{(i)}\theta))+(1-y^{(i)})log(1-sigmoid(X_b^{(i)}\theta))J(θ)=m1i=1my(i)log(simgoid(Xb(i)i ) )+(1y(i))log(1sigmoid(Xb(i)i ) )

  • Minimization of the cost function
    Same as linear regression, after obtaining the cost function, you need to find the minimum value of the cost function, still using the gradient descent algorithm, the solution result is as follows:
    J ( θ ) θ j = 1 m ∑ i = 1 m ( simoid ( X b ( i ) ) − y ( i ) ) X j ( i ) \frac{J(\theta)}{\theta_j}=\frac{1}{m}\sum\limits_{i=1 }\limits^m(simoid(X_b^{(i)})-y^{(i)})X_j^{(i)}ijJ(θ)=m1i=1m(simoid(Xb(i))y(i))Xj(i)
    ∇ J ( θ ) = 1 m ⋅ [ ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X 0 ( i ) ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X 1 ( i ) ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X 1 ( i ) ⋯ ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) ⋅ X n ( i ) ] = 1 m ⋅ X b T ⋅ ( s i g m o i d ( X b θ ) − y ) \nabla J(\theta)= \frac{1}{m} \cdot \left[\begin{matrix} \sum\limits_{i=1}^m(\hat y^{(i)}-y^{(i)})\cdot X_0^{(i)} \\ \sum\limits_{i=1}^m(\hat y^{(i)}-y^{(i)})\cdot X_1^{(i)} \\ \sum\limits_{i=1}^m(\hat y^{(i)}-y^{(i)})\cdot X_1^{(i)} \\ \cdots \\ \sum\limits_{i=1}^m(\hat y^{(i)}-y^{(i)})\cdot X_n^{(i)} \end{matrix}\right]=\frac{1}{m}\cdot X_b^T\cdot (sigmoid(X_b\theta)-y) J ( θ )=m1i=1m(y^(i)y(i))X0(i)i=1m(y^(i)y(i))X1(i)i=1m(y^(i)y(i))X1(i)i=1m(y^(i)y(i))Xn(i)=m1XbT(sigmoid(Xbi )y )
    Then X b = [ 1 X 1 ( 1 ) X 2 ( 1 ) ⋯ X n ( 1 ) 1 X 1 ( 2 ) X 2 ( 2 ) ⋯ X n ( 2 ) ⋯ ⋯ 1 X 1 ( m ) X 2 ( m ) ⋯ X n ( m ) ] θ = [ θ 0 θ 1 θ 2 ⋯ θ n ] Let X_b=\left[\begin{matrix} 1 & X_1^{(1)} & X_2^{( 1)} &\cdots&X_n^{(1)}\\1&X_1^{(2)}&X_2^{(2)}&\cdots&X_n^{(2)}\\\cdots&& & & \cdots\\ 1 & X_1^{(m)} & X_2^{(m)} &\cdots & X_n^{(m)} \end{matrix}\right] \quad \theta=\left[ \begin{matrix}\theta_0&\theta_1&\theta_2&\cdots&\theta_n\end{matrix}\right]where Xb=111X1(1)X1(2)X1(m)X2(1)X2(2)X2(m)Xn(1)Xn(2)Xn(m)i=[i0i1i2in]

4. Decision boundary

The so-called decision boundary is a boundary that can correctly classify samples, mainly including linear decision boundary and nonlinear decision boundary. Here is a chestnut:
for sigmoid ( x ) sigmoid(x)s i g m o i d ( x ) , whenx > 0 x>0x>When 0 , the predicted probabilityp > 0.5 p>0.5p>0 . 5 , the model classifies this class as 1, whenx < 0 x<0x<When 0 , the predicted probabilityp < 0.5 p<0.5p<0 . 5 , the model class is divided into 0, the model is atx = 0 x=0x=0 successfully separated the categories. And for logistic regressionx = θ T ⋅ xb = 0 x=\theta^T\cdot x_b=0x=iTxb=0 , the decision boundary isθ T ⋅ xb = 0 \theta^T\cdot x_b=0iTxb=0 , if X has two features: θ 0 + θ 1 x 1 + θ 2 x 2 = 0 ⇒ x 2 = θ 0 − θ 1 x 1 θ 2 \theta_0+\theta_1x_1+ \theta_2x_2=0\Rightarrow x_2= \frac{\theta_0-\theta_1x_1}{\theta_2}i0+i1x1+i2x2=0x2=i2i0i1x1, we represent x 2 with y, x 1 with x , y = θ 0 − θ 1 x θ 2 x_2 with y, x_1 with x, y=\frac{\theta_0-\theta_1x}{\theta_2}x2denoted by y , x1Express x , y with=i2i0i1xWe can draw its decision boundary
insert image description here
insert image description here
nonlinear decision boundary
Please add a picture description
insert image description here

5. OvR and OvO

The logistic regression algorithm can only solve the binary classification problem, but we can use some strategies to make the logistic regression algorithm also able to solve the multi-classification problem. There are two strategies to allow algorithms that can only solve binary classification problems to solve multi-classification: OvR and OvO. Of course, these two methods are not only applicable to the logistic regression algorithm, but a general method that can transform almost all binary classification algorithms.

  • OvR(One vs Rest)

Let’s first look at OvR (One vs Rest). From the English name of OvR, we can know that it is a pair of remaining possessions. However, in some other machine learning textbooks or materials, OvR may be called OvA (One vs All). The two express means the same. However, it is more accurate to call it OvR, and it is also named after OvR in the Sklearn documentation.
What is a pair of remaining all? For example, for the four classification tasks in the figure below.
insert image description here
Here, four different colors of points are used to represent four different categories. Obviously, for such a four-category task, the logistic regression algorithm that can only solve the two-category problem cannot be directly used. However, we can convert it into a two-category problem, and select one of the categories accordingly, for example, choose the red category at this time, and call them other categories for the remaining three categories, which is also the meaning of One vs Rest. The red category is One and all the remaining categories are Rest.

Please add a picture description
So far, the four-category problem has been converted into a two-category problem. Of course, the process of converting the two classifications not only includes the red category, but also the two classifications corresponding to the other three categories. Correspondingly, there will be four kinds of two classifications. The details are shown in the figure below.Please add a picture description

  • OVO(One vs One)

OvO (One vs One), through the English name of OvO, you can know that it means one-to-one. Here we still use the four categories with four categories as an example, and each category is represented by a sample point of a different color. Please add a picture description
OvO is to directly select two of the categories each time. For example, the two categories of red and blue are picked out here, and then the two categories of red and blue are classified into two categories. Please add a picture description
So far, the four-category problem has been converted into a two-category problem. For the four categories, this process can be repeated. Two categories are selected for each of the four categories. There are a total of C(4, 2) = 6 different pairwise category pairs, that is, 6 binary classification tasks are formed. Please add a picture description
For 6 binary classification problems, each binary classification can estimate which of the two categories the new sample belongs to, and then the 6 classification results vote to select the category with the largest number of classification results as the category of the new sample point .

Reference:
https://cloud.tencent.com/developer/article/1605609

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123819802