Learning Classification of Machine Learning

  This article is the third article of the "Mathematics of Machine Learning" column. After briefly introducing why there is machine learning and what machine learning can do, and learning regression, I will summarize the knowledge of learning classification in machine learning.

1. Set up the model

In order to facilitate the description of the problem, the size of the image is divided into two categories: horizontal and vertical, so this is a two-category problem.

1.1 Data

Here, 7 images are used as training data, as shown in the following table.

Width high shape
80 150 portrait
60 110 portrait
35 130 portrait
160 50 horizontal
160 20 horizontal
125 30 horizontal

2.1 Graphical description

Display the above 7 data in a two-dimensional coordinate system, as shown in the figure below.
data graphics
Q1: How to classify?
A1: Our goal is to find a straight line that can divide the existing horizontal and vertical data into two parts. It is easy to know from the above figure that using a straight line can divide the two types of data into two parts. Suppose it is a straight line as shown in the figure below.
Data Graphics 2
Q2: How to find this straight line?
A2: The method of finding a straight line in the regression part is to find the slope and intercept of the straight line (for details, please refer to the regression of machine learning ), but it is different in the classification part. We use it 向量and introduce it hereweight vector www, the goal is further specified, now the goal is to findLine with weight vector as normal vector

Q3: What is a weight vector? A3: The weight vector is the parameter we require, and the parameter θ \theta
required for regressionsame as θ . Knowing the weight vector will know the straight line of classification, so we need to find the most suitable parameter vectorwww

Q4: What is the mathematical relationship between the weight vector and the required classification line?
A4: w ⋅ x = 0 w x = 0wx=0 , that is, all vectors on the line are perpendicular to the weight vector. The weight vectors are shown in the figure below.
weight vector

2. Inner product

Next, it mainly revolves around w ⋅ x = 0 w x = 0 with the background of machine learningwx=The calculation of 0 is expanded.

  1. w ⋅ x = ∑ i = 0 n w i ⋅ x i = 0 w·x=\sum_{i=0}^{n}w_i·x_i=0 wx=i=0nwixi=0 . In connection with the background of this question, since only the height and width of the image are considered, it istwo-dimensional problem, so here it can be expressed as w ⋅ x = w 1 ⋅ x 1 + w 2 ⋅ x 2 = 0 w·x=w_1·x_1+w_2·x_2=0wx=w1x1+w2x2=0

    Suppose the w 1 = 1 w_1=1 calculated herew1=1, w 2 = 1 w_2=1 w2=1 , the expression isx 1 + x 2 = 0 x_1+x_2=0x1+x2=0 meansx 1 = − x 2 x_1=-x_2x1=x2. Therefore, as long as the weight vector is known, the straight line of classification is found. The obtained straight line and weight vector are graphically represented as follows:
    weight vector 3
    due to the relationship with wwThere are many vertical vectors of w , and they even form a straight line. This is what it means to find the line that makes the weight vector a normal vector.

  2. w ⋅ x = ∣ w ∣ ⋅ ∣ x ∣ ⋅ c o s θ = 0 w·x=|w|·|x|·cos\theta=0 wx=wxcosθ=0 , here due to∣ w ∣ |w|w and∣ x ∣ |x|x is a non-negative number, if result = 0, thenθ = 9 0 . \theta=90^.i=90. , or 9 0if the result is negative< θ < 27 0 . 90^. <\theta<270^.90<i<270. , and 0 if the result is positive. < θ < 9 0 . or 27 0 . < θ < 36 0 . 0^. <\theta<90^. or 270^. <\theta<360^.0<i<90. or270<i<360

3. Perceptron

Q5: What is a perceptron?
A5: A perceptron is a model that receives multiple input variables and outputs the sum of the products of each variable and its respective weight. Its graphical representation is as follows:
Perceptron
   The above perceptron model is also calledsimple perceptronorsingle layer perceptron

Q6: What are the advantages and disadvantages of the perceptron?
A6: The advantage is that the principle is simple, and it is the model basis for deep learning and neural networks. But the disadvantage is also obvious, that is, it can only be used to solve the problem of linear separability (see the fourth part for details). From the perspective of expression, it is xxx can only be a one-time power, and most problems in reality are multidimensional and linearly inseparable, so they are rarely used in practical problems.

Q7: Why introduce the perceptron?
A7: Since the perceptron is to calculate the product of multiple variables and their weights and output the sum, it is consistent with the goal of the classification algorithm (to obtain the classification line from the weight vector).

hereUsing the Perceptron Model to Find the Target Straight Line. Before that, the training data needs to be prepared.

3.1 Prepare training data

   Classification is a method of using labeled data (labeled data can be understood as knowing the input and output of the model) to classify unknown data. Here the width of the image is set to x 1 x_1x1, the height is set to x 2 x_2x2, the label is set to 1 for landscape images and -1 for portraits. Then the training data is as follows:

image size shape x 1 x_1 x1 x 2 x_2 x2 yyy
80x150 portrait 80 150 -1
60x110 portrait 60 110 -1
35x130 portrait 35 130 -1
160x50 horizontal 160 50 1
160x20 horizontal 160 20 1
125x30 horizontal 125 30 1

   The next step is to 参数向量judge whether the image is landscape or portrait. A function that judges data like this is called 判别函数.
   Let's understand it from a graphical point of view first. In the figure below, on the upper left of the straight line is the vertical image, so the data in this range ( x 1 , x 2 ) (x_1,x_2)(x1,x2) and the weight vector <90, that is,w ⋅ x ( i ) ≥ 0 w x^{(i)}\ge 0wx(i)0 , as shown in the figure below.
discriminant function
  Similarly, the horizontal image is below the straight line, and the angle between it and the weight vector is greater than 90 and less than 270, that is,w ⋅ x ( i ) < 0 w x^{(i)}< 0wx(i)<0。可得到判别函数表达式如下:
f w x ( i ) = { 1 w ⋅ x ( i ) ≥ 0 − 1 w ⋅ x ( i ) < 0 f_{w}x^{(i)}=\left\{ \begin{aligned} 1&&&& w·x^{(i)}\ge 0 \\ -1&&&&w·x^{(i)}<0 \end{aligned} \right. fwx(i)={ 11wx(i)0wx(i)<0

3.2 Update expression of weight vector

  With the discriminant function, we can update the parameter ww according to the value obtained by the discriminant function and the real valuew to find values ​​that classify the data well. Give the conclusion first. Here is the update expression for the weight vector: w : = { w + y ( i ) ⋅ x ( i ) fw ( x ( i ) ) ≠ y ( i ) wfw ( x ( i ) ) = y ( i ) w: =\left\{ \begin{aligned} w+y^{(i)} x^{(i)} && && f_w(x^{(i)}) \ne y^{(i)} \\ w && && f_w(x^{(i)}) = y^{(i)} \end{aligned} \right.w:={ w+y(i)x(i)wfw(x(i))=y(i)fw(x(i))=y(i)
   fw ( x ( i ) ) ≠ y ( i ) f_w(x^{(i)}) \ne y^ { (i)}fw(x(i))=y( i ) represents the result of the discriminant functionfw ( x ( i ) ) f_w(x^{(i)})fw(x( i ) )and the real resulty ( i ) y^{(i)}y( i ) dissimilarity, andfw ( x ( i ) ) = y ( i ) f_w(x^{(i)}) = y^{(i)}fw(x(i))=y( i ) means that the results of the two are consistent. Observe this formula again, only when the results of the two are inconsistentwww is updated, when the two results are consistentwww remains unchanged.

Q8: Why when the results of the two are inconsistent, www 's update formula isw + y ( i ) ⋅ w ⋅ x ( i ) w+y^{(i)} w x^{(i)}w+y(i)wx( i ) ?
A8: From a graphical point of view, the weight vector is randomly determined at the beginning (just likeθ \thetaThe determination of θ is also random), if it is the weight vector in the figure below.
Weight Vector Update 1
   The target classification line can be obtained from the weight vector, as shown in the figure below.
Weight Vector Update 2
   Now the data (80, 150) is judged by the discriminant function where the weight vector is located, and the data (80, 150) is in the position shown in the figure.
Update of the weight vector 1
   The data is also represented as a vector, as shown in the figure below.
Weight Vector Update 2
   For (80, 150) data:

  1. The result of discriminant function fwxi = w ⋅ xi f_{w}x^{i}=w x^{i}fwxi=wxi , since the angle between the two is less than 90 degrees, sofwxi = 1 f_{w}x^{i}=1fwxi=1
  2. Since it is a vertical image, the real value yi = − 1 y^{i}=-1yi=1
    From the above results, it can be seen that the classification result is different from the real value, so the weight vector needs to be updated. When the results of the two are different,wwThe update expression of w is w = w + yi ⋅ xiw=w+y^{i}·x^{i}w=w+yixi . yi= − 1 y^{i}=-1yi=1,causew = w − xiw=wx^{i}w=wxi . That is, the subtraction of vectors can be expressed in the figure as:
    Weight Vector Update 3
       the parallelogram rule is used, and the red vector in the figure is the updated weight vector.
       For (80, 150) data:
  3. The result of discriminant function fwxi = w ⋅ xi f_{w}x^{i}=w x^{i}fwxi=wxi , since the angle between the two is greater than 90 degrees, sofwxi = − 1 f_{w}x^{i}=-1fwxi=1
  4. Since it is a vertical image, the real value yi = − 1 y^{i}=-1yi=1
    From the above results, it can be seen that the updated classification result is the same as the real value, and there is no need to update the weight vector, and then train the next data.A method like this that repeatedly updates all parameters is the perceptron model.

4. Linearly separable

   Linearly separable literally means classifying data using a straight line. The perceptron model can only solve linearly separable problems. This feature also limits the scope of use of the perceptron model, because real problems are mostly affected by multiple variables. Then it leads to a question, how to solve the linear inseparable problem?

5. Logistic regression

  When calculating the regression, we define the objective function, differentiate the objective function, obtain the update function of the parameters, use the data to update the parameters, and then use the obtained parameters to calculate the data xxx for regression. This method is logistic regression. Logistic regression can also be used for classification, which is a method different from perceptrons, which divides the classification results fromprobabilityperspective. For the convenience of understanding, for the time being, only the height and width of the image are considered to divide the image into vertical and horizontal categories.

5.1 sigmoid function

  First introduce the sigmoid function, its form is as follows: f θ ( x ) = 1 1 + exp ( − θ T ⋅ x ) f_{\theta}(x)=\frac{1}{1+exp(-\theta^T x)}fi(x)=1+e x p ( iTx)1  where, exp ( − θ T ⋅ x ) exp(-\theta^T·x)e x p ( iTx ) is equivalent toe θ T ⋅ xe^{\theta^T·x}eiTx, sinceθ T ⋅ x \theta^T·xiTx as an exponent may be unclear, so the former is used. The image of the sigmoid function is shown in the image:sigmoid
  From the figure, two main characteristics of the sigmoid function can be observed:
    1.θ T ⋅ x = 0 \theta^T·x=0iTx=If 0 , f θ ( x ) = 0.5 f_\theta(x)=0.5fi(x)=0.5
    2. f θ ( x ) f_\theta(x) fiThe value range of ( x ) is (0-1)
  . The second feature makes the sigmoid function very suitable as a probability.
  In the perceptron part, it was mentioned that the discriminant function is used to discriminate the category of the data. In this question, it refers to whether the image is a vertical or horizontal category. For ease of handling, the width is set tox 1 x_1x1, the height is set to x 2 x_2x2, the horizontal setting is set to 1, and the vertical setting is set to 0(different from the vertical setting of the perceptron, which is -1, which is easy to handle, and can be flexibly set here according to the situation). When using the method of logistic regression, there are similar functions for classification, because it is from the perspective of probability, that is, if 70% of an image is horizontal, then this image should be classified as horizontal. SoDiscriminant/classification function expressionFor example: y = { 1 ( f θ ( x ) ≥ 0.5 ) 0 ( f θ ( x ) < 0.5 ) y=\left\{ \begin{aligned} 1&&(f_\theta(x)\ge0.5)\ \ 0&& ( f_\theta ( x ) < 0.5 ) \ end { aligned } \ righty={ 10(fi(x)0.5)(fi(x)<0.5)
  It can be seen from the image of the sigmoid function that f θ ( x ) ≥ 0.5 f_\theta(x)\ge0.5fi(x)0 . 5 ,θ T ⋅ x ≥ 0 \theta^T·x\ge0iTx0 , soDiscriminant/classification function expressionFor example: y = { 1 θ T ⋅ x ≥ 0 0 θ T ⋅ x < 0 y=\left\{ \begin{aligned} 1&&\theta^T·x\ge0\\ 0&&\theta^T·x<0 \ end { aligned } \ righty={ 10iTx0iTx<0

5.2 Decision boundary

  Since f θ ( x ) f_\theta(x)fi( x ) is used as a probability, and thenput unknown data xxx is the probability of a landscape image asf θ ( x ) f_\theta(x)fi(x)For f θ ( x ) = P ( y = 1 ∣ x ) f_\theta(x)=P(y=1|x)fi(x)=P ( and=1 x )
  After having a discriminant function, you can bring in specific values ​​to feel it. First randomly determineθ \thetaθ , if it is the following value: θ = [ θ 0 θ 1 θ 2 ] = [ − 100 2 1 ] \theta=[ \begin{aligned} \theta_0\\ \theta_1\\ \theta_2 \end{aligned}] =[ \begin{aligned} -100\\ 2\\ 1 \end{aligned}]i=[i0i1i2]=[10021] x = [ 1 x 1 x 2 ] x=[ \begin{aligned} 1\\ x_1\\ x_2 \end{aligned}] x=[1x1x2]   θ T ⋅ x = − 100 + 2 ⋅ x 1 + x 2 \theta^T·x=-100+2·x_1+x_2 iTx=100+2x1+x2, if the image is horizontal, then f θ ( x ) ≥ 0 f_\theta(x)\ge0fi(x)0 , thenθ T ⋅ x ≥ 0 \theta^T·x\ge0iTx0 , then− 100 + 2 ⋅ x 1 + x 2 ≥ 0 -100+2·x_1+x_2\ge0100+2x1+x20,即 x 2 ≥ − 2 ⋅ x 1 + 100 x_2\ge-2·x_1+100 x22x1+1 0 0 . The classification line is shown in the figure below.
classification line
  It can be seen from the figure that we willθ T ⋅ x = 0 \theta^T·x=0iTx=0 as the boundary to classify the data, a straight line like this is also calleddecision boundary. The classification line cannot completely classify the data correctly because we randomly determine the parameters θ \thetaThe value of θ . Therefore, in order to find the correct parameters, we need to define the objective function of the parameters, and differentiate the objective function to obtain the update function of the parameters.

6. Likelihood function

  The objective function here is different from the error function of the least squares method in regression. The error function is obtained by making the error between the result of the discriminant function and the true value as small as possible. And what is the most ideal result in classification? Consider the following conditional probability:
  P ( y = 1 ∣ x ) P(y=1|x)P ( and=1 x ) , which means the probability of classifying data x as horizontal. Therefore, the ideal result is:
    when the image is horizontal, that is, y=1, we hope thatP ( y = 1 ∣ x ) P(y=1|x)P ( and=1 x ) maximum
    When the image is vertical, that is, y=0, we hope thatP ( y = 0 ∣ x ) P(y=0|x)P ( and=0 x ) The maximum
  existing data is shown in the table below:

image size shape y probability
80x150 portrait 0 P(y=0 |x)
60x110 portrait 0 P(y=0 |x)
35x130 portrait 0 P(y=0 |x)
80x50 horizontal 1 P(y=1 |x)
160x20 horizontal 1 P(y=1 |x)
125x130 horizontal 1 P(y=1 |x)

  Therefore, for all data, we are seeking the overall probability of all data, that is, the joint probability is the largest. The function is as follows: L ( θ ) = P ( y ( 1 ) ∣ x ( 1 ) ) ⋅ P ( y ( 2 ) ∣ x ( 2 ) ) ⋅ ⋅ ⋅ ⋅ P ( y ( 6 ) ∣ x ( 6 ) ) L (\theta)=P(y^{(1)}|x^{(1)})·P(y^{(2)}|x^{(2)})····P(y^ {(6)}|x^{(6)})L ( i )=P ( and(1)x(1))P ( and(2)x(2))P ( and(6)x( 6 ) )  Among them,L ( θ ) L(\theta)LLin L ( θ )L is taken from the likelihood, also known as the likelihood function.likelihood functionGeneralized form: L ( θ ) = ∏ i = 1 n P ( y ( i ) = 1 ∣ x ( i ) ) y ( i ) ⋅ P ( y ( i ) = 0 ∣ x ( i ) ) 1 − y ( i ) L(\theta)=\prod_{i=1}^{n}P(y^{(i)}=1|x^{(i)})^{y^{(i)} } P(y^{(i)}=0|x^{(i)})^{1-y^{(i)}}L ( i )=i=1nP ( and(i)=1x(i))y(i)P ( and(i)=0x(i))1y(i)Q9:式中 y ( i ) y^{(i)} y( i ) Sum1− y ( i ) 1-y^{(i)}1yWhat does ( i )
mean? A9: The two are indices. For their understanding, considery ( i ) = 1 y^{(i)}=1y(i)=1时,则 P ( y ( i ) = 1 ∣ x ( i ) ) y ( i ) ⋅ P ( y ( i ) = 0 ∣ x ( i ) ) 1 − y ( i ) = P ( y ( i ) = 1 ∣ x ( i ) ) P(y^{(i)}=1|x^{(i)})^{y^{(i)}}·P(y^{(i)}=0|x^{(i)})^{1-y^{(i)}}=P(y^{(i)}=1|x^{(i)}) P ( and(i)=1x(i))y(i)P ( and(i)=0x(i))1y(i)=P ( and(i)=1x( i ) ).
  now ourobjective functionis the likelihood function. Our goal is to maximize the value of the objective function. We need to differentiate the objective function. If the result function of the differentiation is negative, the value of the parameter is updated to the left; if the result function of the differentiation is positive, the value of the parameter is updated to the right. to find the best parameters. This is exactly the opposite of the direction of updating the parameters to find the minimum value of the objective function in regression.
  But directly differentiating the likelihood function has two problems. One is that all the probabilities are less than 1. When calculating the joint probability, it will get smaller and smaller, and the accuracy of the result will appear. The second is that computers take longer to calculate multiplication than to calculate addition. Therefore, it is necessary to take the logarithm of the likelihood function.

7. Log-likelihood function

Q10: Can the logarithm of the likelihood function be directly taken without affecting the final result?
A10: The logarithmic function is monotonically increasing, so if L ( θ 1 ) < L ( θ 2 ) L(\theta_1)<L(\theta_2)L ( i1)<L ( i2),有logL ( θ 1 ) < logL ( θ 2 ) logL(\theta_1)<logL(\theta_2)l o g L ( i1)<l o g L ( i2) , so thatL ( θ ) L(\theta)L ( θ ) maximization is to makelog L ( θ ) logL(\theta)l o g L ( θ ) is maximized. Now forlog L ( θ ) logL(\theta)Look at l o g L ( θ ) deformation.
log-likelihood function
  The above formula mainly uses the basic properties of logarithms, the fourth line uses the property that the sum of all probabilities = 1, and the fifth line is our preset. (See Section 5.2 Decision Boundary)
  So our objective function is the log likelihood function log L ( θ ) = ∑ i = 1 n ( y ( i ) logf θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − f θ ( x ( i ) ) ) ) logL(\theta)=\sum_{i=1}^{n}(y^{(i)}logf_\theta(x^{(i) })+(1-y^{(i)})log(1-f_\theta(x^{(i)})))log L ( θ ) _ _=i=1n(y(i)logfi(x(i))+(1y(i))log(1fi(x( i ) )))  Then it is necessary to differentiate the objective function to find the parameterθ \thetaUpdate expression for θ .
  Differentiation of the log-likelihood function:
differential 1
  letu = log L ( θ ) u=logL(\theta)u=l o g L ( θ ) ,v = f θ ( x ) v = f_\theta(x)v=fi( x ) , as follows:
Differential 2
  Letz = θ T xz=\theta^{T}xz=iT x, and the differential of the sigmoid function is known, that is,d σ ( x ) dx = σ ( x ) ( 1 − σ ( x ) ) \frac{d\sigma(x)}{dx}=\sigma( x)(1-\sigma(x))dxdσ(x)=σ ( x ) ( 1σ ( x ) )
Differential 3
  Finally, the differential of each part is combined and simplified, as follows:
Differential 4
  Since the maximum value is sought, the parameters move in the same direction as the sign of the differential result. Therefore, the update expression of the parameters is:
θ j = θ j + η ∑ i = 1 n ( y ( i ) − f θ ( x ( i ) ) ) xj ( i ) \theta_j=\theta_j+\eta\sum_{i =1}^{n}(y^{(i)}-f_\theta(x^{(i)}))x_j^{(i)}ij=ij+thei=1n(y(i)fi(x(i)))xj(i)
  In order to be consistent with the parameter update expression of regression, the negative sign in the formula after the learning rate is advanced, as follows: θ j = θ j − η ∑ i = 1 n ( f θ ( x ( i ) ) − y ( i ) ) xj ( i ) \theta_j=\theta_j-\eta\sum_{i=1}^{n}(f_\theta(x^{(i)})-y^{(i)})x_j^{ (i)}ij=ijthei=1n(fi(x(i))y(i))xj(i)

8. Linear inseparability

  It can be seen from the previous knowledge that the perceptron model can only be used to solve linearly separable problems, which makes it almost impossible to solve practical problems. The logistic regression method can be used to solve linearly inseparable problems, but in the previous example, for ease of understanding, we still use the linearly separable problem of dividing images into horizontal and vertical categories. For the figure below:
linear inseparability 1
  For the classification of the figure above, it is obviously not a linearly separable problem. So you can increase the training data x 1 2 x_1^2x12. Similar to polynomial regression in Regression 1, increase the degree of polynomial. Vector θ \thetaθ andxxx如下:
θ = [ θ 0 θ 1 θ 2 θ 3 ] x = [ x 1 x 2 x 3 x 1 2 ] \theta=[\begin{aligned} \theta_0\\ \theta_1\\ \theta_2\\ \theta_3 \end{aligned}]\qquad x=[\begin{aligned} x_1\\ x_2\\ x_3\\ x_1^2 \end{aligned}] i=[i0i1i2i3]x=[x1x2x3x12]   θ T x \theta^Tx iDefine T x: θ T x = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 \theta^Tx=\theta_0x_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2iTx=i0x0+i1x1+i2x2+i3x12  Now for θ \thetaAssign specific values ​​to θ , as follows : θ = [ 0 0 1 − 1 ] \theta=[\begin{aligned} 0\\ 0\\ 1\\ -1 \end{aligned}]i=[0011]
   θ T x \theta^Tx iThe expression of T xisθ T x = x 2 − x 1 2 \theta^Tx=x_2-x_1^2iTx=x2x12. Then the decision boundary is θ T x = x 2 − x 1 2 = 0 \theta^Tx=x_2-x_1^2=0iTx=x2x12=0 , that is,x 2 = x 1 2 x_2=x_1^2x2=x12, as shown in the figure below: linear inseparability 2
  As can be seen from the above figure, the decision boundary becomes a curve. Since the parameters are randomly set, the data cannot be completely classified. The parameters can be determined using the stochastic gradient descent method and the most value descent method in regression. The higher the degree of the term, the more complex curves can be simulated to solve more complex problems.

Guess you like

Origin blog.csdn.net/weixin_43943476/article/details/121159134