Machine learning notes (2) k nearest neighbor algorithm kNN, linear regression method Linear Regression

1. K nearest neighbor algorithm KNN

The k-neighbor algorithm, or the k-nearest neighbor (kNN, k-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification techniques. The so-called k nearest neighbors means the k nearest neighbors, which means that each sample can be represented by its closest k neighbors. The nearest neighbor algorithm is a method of classifying each record in the data set.

1.1. Core idea

The core idea of ​​the kNN algorithm is that if most of the K nearest neighbor samples of a sample in the feature space belong to a certain category, the sample also belongs to this category and has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the sample to be divided according to the category of the nearest one or several samples. The kNN method is only related to a very small number of adjacent samples when making category decisions. Since the kNN method mainly relies on the limited surrounding samples rather than the method of discriminating the class domain to determine the category to which it belongs, the KNN method is more accurate than other methods for the sample sets to be divided when the class domain crosses or overlaps more. for fit.

1.2. Algorithm process

In general, the KNN classification algorithm includes the following four steps:
① Prepare data and preprocess the data.
② Calculate the distance from the test sample point (that is, the point to be classified) to each other sample point.
③Sort each distance, and then select K points with the smallest distance.
④ Compare the categories of K points, and classify the test sample points into the category with the highest proportion among the K points according to the principle that the minority obeys the majority.

1.3. Definition of distance

  • 欧拉距离
    ∑ i = 1 n ( X i ( a ) − X i ( b ) ) 2 \sqrt{\sum\limits_{i=1}\limits^n(X_i^{(a)}-X_i^{(b)})^2} i=1n(Xi(a)Xi(b))2
  • Manha顿Distance∑
    i = 1 n ∣ X i ( a ) − X i ( b ) ∣ \sum\limits_{i=1}\limits^n|X_i^{(a)}-X_i^{(b)}|i=1nXi(a)Xi(b)
  • Minkowski distance
    ( ∑ i = 1 n ∣ X i ( a ) − X i ( b ) ∣ p ) 1 p (\sum\limits_{i=1}\limits^n|X_i^{(a) }-X_i^{(b)}|^p)^\frac{1}{p}(i=1nXi(a)Xi(b)p)p1
  • Vector Space Cosine Similarity Cosine Similarity
  • Adjusted Cosine Similarity Adjusted Cosine Similarity
  • Pearson Correlation Coefficient
  • Jaccard similarity coefficient Jaccard Coefficient

1.3. Advantages and disadvantages

Advantages
The KNN method is simple in thinking, easy to understand, easy to implement, and does not need to estimate parameters. The main
disadvantage
of this algorithm in classification is that when the samples are unbalanced, for example, the sample size of one class is large, while the sample size of other classes is small. , it may cause that when a new sample is input, among the K neighbors of the sample, the samples of the high-capacity class are in the majority.
Another disadvantage of this method is that the amount of calculation is large, because for each text to be classified, the distance to all known samples must be calculated to obtain its K nearest neighbors.

1.4. Parameters and model parameters

  • Hyperparameters
    Parameters that need to be decided before the algorithm runs
  • Model parameters
    Parameters learned during the algorithm process

The kNN algorithm has no model parameters and is a non-parametric learning algorithm. In the kNN algorithm, k is a typical parameter.

2. Linear Regression

2.1. Simple linear regression

Looking for a straight line, the relationship between the sample output marks that "fits" the sample characteristics to the greatest extent There
insert image description here
is only one sample characteristic, which is called: simple linear regression.

Suppose we find the best-fit equation of the line: y = ax + by=ax+by=ax+b , then for each sample pointx ( i ) x^{(i)}x( i ) , according to our line equation, the predicted value is:y ^ ( i ) = ax ( i ) + b \hat{y}^{(i)}=ax^{(i)}+by^(i)=ax(i)+b , meaning:y ( i ) y^{(i)}y( i ) . We wanty ( i ) y^{(i)}y(i) y ^ ( i ) \hat{y}^{(i)} y^( i ) The gap is as small as possible, which meansy ( i ) y^{(i)}y(i) y ^ ( i ) \hat{y}^{(i)} y^( i ) distance:y ( i ) − y ^ ( i ) y^{(i)}-\hat{y}^{(i)}y(i)y^( i ) , but this means that the gap is positive or negative, and it is obviously unreasonable to add all the sample gaps to 0, so that the gap is 0. To remove the negative sign, we can easily think of using∣ y ( i ) − y ^ ( i ) ∣ |y^{(i)}-\hat{y}^{(i)}|y(i)y^( i )means, but there is a problem with this way that x is not derivable everywhere. In order to be derivable everywhere and remove the negative sign, the gap can be used as the variance:( y ( i ) − y ^ ( i ) ) 2 , (y^{(i)}-\hat{y}^{(i) })^2,(y(i)y^(i))2 ,considering all samples:∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-\hat{y }^{(i)})^2i=1m(y(i)y^(i))2 . Our goal is to make∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-\hat{y}^ {(i)})^2i=1m(y(i)y^(i))2 as small as possible, wherey ^ ( i ) = ax ( i ) + b \hat{y}^{(i)}=ax^{(i)}+by^(i)=ax(i)+b , then find a and b such that∑ i = 1 m ( y ( i ) − ax ( i ) − b ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}- a{x}^{(i)}-b)^2i=1m(y(i)ax(i)b)2 as small as possible.

2.1.1. The basic idea of ​​a class of machine learning algorithms

The goal is to find a and b such that ∑ i = 1 m ( y ( i ) − ax ( i ) − b ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-ax ^{(i)}-b)^2i=1m(y(i)ax(i)b)2 as small as possible. By analyzing the problem, the loss function or utility function of the problem is determined. ∑ i = 1 m ( y ( i ) − ax ( i ) − b ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-ax^{(i)}-b )^2i=1m(y(i)ax(i)b)2 is called the loss function (loss function) or utility function (utility function). By optimizing the loss function or utility function, a machine learning model is obtained. Almost all parameter learning algorithms are like this (linear regression, polynomial regression, logistic regression, SVM, neural network...)
For simple linear regression, find a and b such that∑ i = 1 m ( y ( i ) − ax ( i ) − b ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-ax^{(i)}-b)^2i=1m(y(i)ax(i)b)2 as small as possible is a typical least multiplication problem: minimize the variance of the error. a = ∑ i = 1 m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) ∑ i = 1 m ( x ( i ) − x ˉ ) 2 a=\frac{\sum\limits_{ i=1}\limits^m(x^{(i)}-\bar{x})(y^{(i)}-\bar{y})}{\sum\limits_{i=1}\ limits^m(x^{(i)}-\bar{x})^2}a=i=1m(x(i)xˉ)2i=1m(x(i)xˉ )(and(i)yˉ)   b = y ˉ − a x ˉ b = \bar{y}-a\bar{x} b=yˉaxˉVectorized
operation

∑ i = 1 m ( x ( i ) − x ˉ ) ( y ( i ) − y ˉ ) = ∑ i = 1 m w ( i ) ⋅ v ( i ) = w ⋅ v \sum\limits_{i=1}\limits^m(x^{(i)}-\bar{x})(y^{(i)}-\bar{y})=\sum\limits_{i=1}\limits^mw^{(i)} \cdot v ^{(i)}=w \cdot v i=1m(x(i)xˉ )(and(i)yˉ)=i=1mw(i)v(i)=wv

w = ( w ( 1 ) , w ( 2 ) , ⋯   , w ( m ) ) v = ( v ( 1 ) , v ( 2 ) , ⋯   , v ( m ) ) w=(w^{(1)}, w^{(2)},\cdots,w^{(m)})\qquad v=(v^{(1)}, v^{(2)},\cdots,v^{(m)}) w=(w(1),w(2),,w(m))v=(v(1),v(2),,v(m))

2.1.2. Measurement of regression algorithm

  1. Mean Square Error MSE (Mean Squqraed Error)

1 m ∑ i = 1 m ( y t e s t ( i ) ) − y ^ t e s t ( i ) ) 2 \frac{1}{m}\sum\limits_{i=1}\limits^m(y_{test}^{(i)})-\hat{y}_{test}^{(i)})^2 m1i=1m(ytest(i))y^test(i))2

  1. Root mean square error RMSE (Error of Root MeanSquare)

1 m ∑ i = 1 m ( y t e s t ( i ) ) − y ^ t e s t ( i ) ) 2 = M S E t e s t \sqrt{\frac{1}{m}\sum\limits_{i=1}\limits^m(y_{test}^{(i)})-\hat{y}_{test}^{(i)})^2} = \sqrt{MSE_{test}} m1i=1m(ytest(i))y^test(i))2 =MSEtest

  1. Mean Absolute Error MAE (Mean Absolute Error)

1 m ∑ i = 1 m ∣ y t e s t ( i ) − y ˉ t e s t ( i ) ∣ \frac{1}{m}\sum\limits_{i=1}\limits^m|y_{test}^{(i)}-\bar{y}_{test}^{(i)}| m1i=1mytest(i)yˉtest(i)

  1. R Squareed

     R 2 = 1 − S S r e s i d u a l S S t o t a l R^2=1-\frac{SS_{residual}}{SS_{total}} R2=1SStotalSSresidual

     R 2 = 1 − ∑ i ( y ( i ) − y ^ ( i ) ) 2 ∑ i ( y ˉ − y ( i ) ) 2 = 1 − ( ∑ i m ( y ( i ) − y ^ ( i ) ) 2 ) / m ( ∑ i m ( y ˉ − y ( i ) ) 2 ) / m = 1 − M S E ( y ^ , y ) V a r ( y ) ( 方 差 ) R^2=1-\frac{\sum\limits_i(y^{(i)}-\hat{y}^{(i)})^2}{\sum\limits_i(\bar{y}-y^{(i)})^2}=1-\frac{(\sum\limits_i\limits^m(y^{(i)}-\hat{y}^{(i)})^2)/m}{(\sum\limits_i\limits^m(\bar{y}-y^{(i)})^2)/m}=1-\frac{MSE(\hat{y},y)}{Var(y)_{(方差)}} R2=1i(yˉy(i))2i(y(i)y^(i))2=1(im(yˉy(i))2)/m(im(y(i)y^(i))2)/m=1V a r ( y )( variance )MSE(y^,y)

     ∑ i ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_i(y^{(i)}-\hat{y}^{(i)})^2 i(y(i)y^(i))2 is the error generated using our model prediction

    ∑ i ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_i(y^{(i)}-\hat{y}^{(i)})^2i(y(i)y^(i))2 is to usey = y ˉ y=\bar{y}y=yˉ(Baseline Model) prediction errors

  • R 2 < = 1 R^2<=1 R2<=1
  • R 2 R^2 R2 The bigger the better. When our predictive model makes no mistakes,R 2 R^2R2 gets the maximum value of 1
  • When our model is equal to the baseline model, R 2 R^2R2 is 0
  • If R 2 < 0 R^2<0R2<0 , indicating that the model we learned is not as good as the baseline model. at this time. Chances are that our data doesn't have any linear relationship.

2.2. Multiple linear regression

Sample features with multiple
targets: ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-\hat{ y}^{(i)})^2i=1m(y(i)y^(i))2 possible small, of whichy ^ ( i ) = θ 0 + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) ⋯ + θ n X n ( n ) \hat{y}^{(i)} =\theta_0+\theta_1X_1^{(i)}+\theta_2X_2^{(i)}\cdots+\theta_nX_n^{(n)}y^(i)=i0+i1X1(i)+i2X2(i)+inXn(n),inputsθ 0 , θ 1 , θ 2 , ⋯ , θ n \theta_0,\theta_1,\theta_2,\cdots,\theta_ni0,i1,i2,,in,使得 ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2 as small as possible. Convert it to vectorization:

y ^ ( i ) = θ 0 X 0 ( i ) + θ 1 X 1 ( i ) + θ 2 X 2 ( i ) ⋯ + θ n X n ( n ) , X 0 ( i ) ≡ 1 \hat{y}^{(i)}=\theta_0X_0^{(i)}+\theta_1X_1^{(i)}+\theta_2X_2^{(i)}\cdots+\theta_nX_n^{(n)}\quad,X_0^{(i)}\equiv1 y^(i)=i0X0(i)+i1X1(i)+i2X2(i)+inXn(n),X0(i)1

     θ = ( θ 0 , θ 1 , θ 2 , ⋯   , θ n ) T \theta = (\theta_0,\theta_1,\theta_2,\cdots,\theta_n)^T i=( i0,i1,i2,,in)T

     X ( i ) = ( X 0 ( i ) , X 1 ( i ) , X 2 ( i ) , ⋯   , X n ( i ) ) X^{(i)} = (X_0^{(i)},X_1^{(i)},X_2^{(i)},\cdots,X_n^{(i)}) X(i)=(X0(i),X1(i),X2(i),,Xn(i))

y ^ ( i ) = X ( i ) ⋅ θ \hat y^{(i)}=X^{(i)}\cdot\theta y^(i)=X(i)i

X b = [ 1 X 1 ( 1 ) X 2 ( 1 ) ⋯ X n ( 1 ) 1 X 1 ( 2 ) X 2 ( 2 ) ⋯ X n ( 2 ) ⋯ ⋯ 1 X 1 ( m ) X 2 ( m ) ⋯ X n ( m ) ] θ = [ θ 0 θ 1 θ 2 ⋯ θ n ] X_b=\left[\begin{matrix} 1 & X_1^{(1)} & X_2^{(1)} &\cdots & X_n^{(1)} \\ 1 & X_1^{(2)} & X_2^{(2)} &\cdots & X_n^{(2)} \\ \cdots & & & & \cdots\\ 1 & X_1^{(m)} & X_2^{(m)} &\cdots & X_n^{(m)} \end{matrix}\right] \qquad \theta=\left[\begin{matrix} \theta_0\\ \theta_1\\ \theta_2\\ \cdots\\ \theta_n \end{matrix}\right] Xb=111X1(1)X1(2)X1(m)X2(1)X2(2)X2(m)Xn(1)Xn(2)Xn(m)i=i0i1i2in

y ^ = X b ⋅ θ \hat y=X_b\cdot\thetay^=Xbi

使 ∑ i = 1 m ( y ( i ) − y ^ ( i ) ) 2 \sum\limits_{i=1}\limits^m(y^{(i)}-\hat{y}^{(i)})^2 i=1m(y(i)y^(i))2. Let( y ^ − X b ⋅ θ ) T ( y ^ − X b ⋅ θ ) (\hat y-X_b\cdot\theta)^T(\hat y-X_b\cdot\theta)(y^Xbi )T(y^Xbθ ) , then you can find the normal equation solution (Normal Equation) of multiple linear regression
θ = ( X b TX b ) − 1 X b T y \theta=(X^T_bX_b)^{-1}X^T_byi=(XbTXb)1XbTy
problem with this approach: high time complexity:O ( n 3 ) O(n^3)O ( n3 )(optimized O(n 2.4 n^{2.4}n2. 4 ))
advantage: no need to normalize the data

2.3. Polynomial regression

The multinomial regression analysis method that studies a dependent variable and one or more independent variables is called polynomial regression (Polynomial Regression). If there is only one independent variable, it is called univariate polynomial regression; if there are multiple independent variables, it is called multivariate polynomial regression. In univariate regression analysis, if the relationship between the dependent variable y and the independent variable x is nonlinear, but no appropriate function curve can be found to fit, then univariate polynomial regression can be used.
The unary m-degree polynomial regression equation is: y ^ = b 0 + b 1 x + b 2 x 2 + ⋯ + bmxm \hat y=b_0+b_1x+b_2x^2+\cdots+b_mx^my^=b0+b1x+b2x2++bmxm .
The binary quadratic polynomial regression equation is:y ^ = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 1 2 + b 4 x 2 2 + b 5 x 1 x 2 \hat y=b_0+b_1x_1 +b_2x_2+b_3x_1^2+b_4x_2^2+b_5x_1x_2y^=b0+b1x1+b2x2+b3x12+b4x22+b5x1x2.
The biggest advantage of polynomial regression is that the measured points can be approximated by increasing the high-order term of x until it is satisfied. In fact, polynomial regression can deal with quite a class of nonlinear problems, and it occupies an important position in regression analysis, because any function can be approximated by polynomials piecewise. Therefore, in common practical problems, regardless of the relationship between dependent variables and other independent variables, we can always use polynomial regression for analysis.
insert image description here

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123681455