"Machine Learning in Practice" - Chapter 8 Predicting Numerical Data: Regression

8.1 Finding the best-fit line with linear regression

Linear regression
Pros: The results are easy to understand and the calculations are not complicated.
Disadvantages: It does not fit nonlinear data well.
Applicable data types: numeric and nominal data.

The purpose of regression is to predict a numerical target value. The most direct way is to write a calculation formula for the target value based on the input. A regression equation is given below:

HorsePower = 0.0015*annualSalary-0.99*hoursListeningTopublicRadio

Among them, 0.0015 and -0.99 are called regression coefficients, and the process of finding regression coefficients is regression.
General regression refers to linear regression, so regression and linear regression in this chapter represent the same meaning. Linear regression means that the input terms can be multiplied by some constant and the results can be added together to get the output.
Assume that the input data is stored in the matrix x, and the regression coefficients are stored in the vector w. Then for the given data X_{1}, the predicted result will pass Y_{1}=X_{1}^{T}w. Knowing some x and corresponding y, a common way to find w is the difference between the predicted y value and the true y value. Simple accumulation of this error will make positive and negative differences cancel out, so we Squared error is used.
The squared error can be written as:

\sum_{i=1}^{m}\left(y_{i}-x_{i}^{\mathrm{T}} w\right)^{2}

It can also be written in matrix representation (Y-X w)^{T}(Y-X w). If we take the derivative of w, we get X^{T}(Y-Xw), let it be equal to zero, and solve w as follows:

\hat{w}=\left(\boldsymbol{X}^{\mathrm{T}} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\mathrm{T}} y

\wide hat{w}The meaning of is that this is the optimal solution of w that can be estimated at present. The w estimated from the existing data may not be the true value of w in the data, so a symbol is used here to indicate that it is only a best estimate of w.
The above formula contains X^{T}X^{-1}, that is, needs to invert the matrix, so this equation is only applicable when the inverse matrix exists. However, the inverse of the matrix may not exist, so this must be determined in the code.
A scatterplot is given below to describe how to draw the best-fit straight line for the data.

Create a new file regression.py and add the code:

from numpy import *

def loadDataSet(fileName):      #general function to parse tab -delimited floats
    numFeat = len(open(fileName).readline().split('\t')) - 1 #get number of fields 
    dataMat = []; labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat

def standRegres(xArr,yArr):
    xMat = mat(xArr); yMat = mat(yArr).T
    xTx = xMat.T*xMat
    if linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws

The first function, loadDataSet(), is the same as the previous function of the same name. It opens a text file separated by a tab key, and still defaults that the last value of each line of the file is the target value. The second function, standRegress(), is used to calculate the best-fit straight line. The function first reads in x and y and saves them into a matrix; then calculates X^{T}X, and then judges whether its determinant is zero, and if the determinant is zero, an error will occur when calculating the inverse matrix. NumPy provides a linear algebra library linalg, which contains many useful functions. You can directly call linalg.det() to calculate the determinant. Finally, if the determinant is non-zero, w is computed and returned. An error will occur if you try to compute the inverse of a matrix without checking whether the determinant is zero. NumPy's linear algebra library also provides a function to solve unknown matrices. If this function is used, the code ws=xTx.I*(xMat.T*yMat) should be written as ws=linalg.solve(xTx,xMat.T*yMatT).

Using loadDataSet() will get two arrays from the array, which are stored in x and y respectively. Similar to class labels in classification algorithms, here y is the target value.

import regression
from numpy import *
xArr,yArr = regression.loadDataSet('ex0.txt')
ws = regression.standRegres(xArr,yArr)
print(xArr[0:2])
print(ws)

What ws stores is the regression coefficient. When using the inner product to predict y, the first dimension will be multiplied by the previous constant x0, and the second dimension will be multiplied by the input variable x1. Because it is assumed that x0=1, we will end up with y=ws[0]+ws[1]*X1. The y here is actually predicted. In order to distinguish it from the real y value, it is recorded as yHat. The following uses the new ws value to calculate yHat, and draws the dataset scatter plot and best fit line plot:

import regression
from numpy import *
import matplotlib.pyplot as plt
xArr,yArr = regression.loadDataSet('ex0.txt')
ws = regression.standRegres(xArr,yArr)
xMat = mat(xArr)
yMat = mat(yArr)
yHat = xMat*ws
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xMat[:,1].flatten().A[0],yMat.T[:,0].flatten().A[0])
xCopy = xMat.copy()
xCopy.sort(0)
yHat = xCopy*ws
ax.plot(xCopy[:,1],yHat)
plt.show()

Almost any data set can be modeled with the above method, so how to judge the quality of the model? Comparing the two figures below, if linear regression is performed on the two data sets separately, the exact same model (fitting line) will be obtained. Obviously the two data are different, so how do we compare their effects? Calculating the correlation coefficient of these two sequences can judge the matching degree of the predicted value yHat sequence and the real value y sequence.

In Python, the NumPy library provides a calculation method for the correlation coefficient: the correlation between the predicted value and the actual value can be calculated by the command corrcoef(yEstimate, yActual).

import regression
from numpy import *
import matplotlib.pyplot as plt
xArr,yArr = regression.loadDataSet('ex0.txt')
ws = regression.standRegres(xArr,yArr)
xMat = mat(xArr)
yMat = mat(yArr)
yHat = xMat*ws
print(corrcoef(yHat.T,yMat))#计算相关系数

The output matrix contains the correlation coefficients for all pairwise combinations. The diagonal has a correlation of 1 because yMat matches itself perfectly, while yHat and yMat have a correlation of 0.98.

8.2 Locally Weighted Linear Regression

One problem with linear regression is the potential for underfitting, since it seeks an unbiased estimate with the least mean squared error. Obviously, if the model is underfit, it will not achieve the best prediction results. So there are methods that allow some bias to be introduced into the estimate, thereby reducing the mean squared error of the forecast.
One of these methods is locally weighted linear regression. In this algorithm, we assign a certain weight to each point near the point to be predicted; then, similar to Section 8.1, ordinary regression is performed on this subset based on the minimum mean square error. Like kNN, this algorithm needs to select the corresponding data subset in advance for each prediction. The form of the algorithm to solve the regression coefficient w is as follows:

\hat{w}=\left(\boldsymbol{X}^{\mathrm{T}} \boldsymbol{W} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{\mathrm{T}} \boldsymbol{W} y

where w is a matrix used to assign weights to each data point.
LWLR uses a "kernel" to give higher weight to nearby points. The type of kernel can be freely selected. The most commonly used kernel is the Gaussian kernel. The corresponding weight of the Gaussian kernel is as follows:

w(i, i)=\exp \left(\frac{\left|x^{(i)}-x\right|}{-2 k^{2}}\right)

In this way, a weight matrix w containing only diagonal elements is constructed, and the closer the point x is to x(i), the larger w(i,i) will be. The above formula contains a parameter k that needs to be specified by the user, which determines how much weight is given to nearby points, which is the only parameter that needs to be considered when using LWLR. The figure below reflects the relationship between the parameter k and the weight.

Add the following code to regression.py 

def lwlr(testPoint,xArr,yArr,k=1.0):
    xMat = mat(xArr); yMat = mat(yArr).T
    m = shape(xMat)[0]
    weights = mat(eye((m)))#eye(m,n)产生mxn的单位矩阵
    for j in range(m):                      #next 2 lines create weights matrix
        diffMat = testPoint - xMat[j,:]     #
        weights[j,j] = exp(diffMat*diffMat.T/(-2.0*k**2))
    xTx = xMat.T * (weights * xMat)
    if linalg.det(xTx) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))
    return testPoint * ws

def lwlrTest(testArr,xArr,yArr,k=1.0):  #loops over all the data points and applies lwlr to each one
    m = shape(testArr)[0]
    yHat = zeros(m)
    for i in range(m):
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat

The function of the above code is to calculate the corresponding predicted value yHat given any point in the x space. The beginning of the function lwlr() is similar to the previous standRegres(), read in the data and create the required matrix, and then create the diagonal weight matrix weights. The weight matrix is ​​a square matrix whose order is equal to the number of sample points. That is, the matrix initializes a weight for each sample point. Next, the algorithm will traverse the data set and calculate the weight value corresponding to each sample point: as the distance between the sample point and the point to be predicted increases, the weight will decay exponentially. The input parameter k controls the speed of the decay. Like the previous function stand-Regress(), after the weight matrix is ​​calculated, an estimate of the regression coefficient ws can be obtained.
Another function is lwlrTest(), which is used to call lwlr() for each point in the data set, which helps to solve the size of k.

import regression
from numpy import *
import matplotlib.pyplot as plt
xArr,yArr = regression.loadDataSet('ex0.txt')
# 对单点进行估计
print(yArr[0])
print(regression.lwlr(xArr[0],xArr,yArr,1.0))
print(regression.lwlr(xArr[0],xArr,yArr,0.001))
# 为了得到数据集里所有点估计,可以调用lwlrTest()函数
yHat = regression.lwlrTest(xArr,xArr,yArr,1.0)
# 绘出估计值和原始值,查看yHat拟合效果。绘图函数需要对数据点按序排列,首先对xArr排序:
xMat = mat(xArr)
srtInd = xMat[:,1].argsort(0)
xSort = xMat[srtInd][:,0,:]
# Matplotlib绘图
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(xSort[:,1],yHat[srtInd])
ax.scatter(xMat[:,1].flatten().A[0],mat(yArr).T.flatten().A[0],s=2,c='red')
plt.show()

The figure above shows the results of k=0.1, 0.01, 0.003. When k=1.0, the weight is very large, as if all the data are regarded as equal weight, the best fitting line obtained is consistent with the standard regression. Using k=0.01 yielded very good results, capturing the underlying patterns of the data. k=0.003 incorporates too many noise points, and the fitted straight line is too close to the data points. So the three are underfitting, normal, and overfitting images.

8.3 Example: Predicting the Age of Abalone

Add the following code to regression.py:

def rssError(yArr,yHatArr): #yArr and yHatArr both need to be arrays
    return ((yArr-yHatArr)**2).sum()

To analyze the size of the forecast error, this indicator can be calculated with rssError():

import regression
abX,abY = regression.loadDataSet('abalone.txt')
yHat01 = regression.lwlrTest(abX[0:99] ,abX[0:99],abY[0:99],0.1)
yHat1=regression. lwlrTest(abX[0:99],abX[0:99],abY[0:99],1)
yHat10=regression.lwlrTest(abX[0:99],abX[0:99],abY[0:99],10)
print(regression.rssError(abY[0:99],yHat01.T))
print(regression.rssError(abY[0:99],yHat1.T))
print(regression.rssError(abY[0:99],yHat10.T))

 

Using smaller kernels will result in lower errors. However, using the smallest kernel will cause overfitting and may not achieve the best prediction effect on new data. 

 

import regression
from numpy import *
import matplotlib.pyplot as plt
abX,abY = regression.loadDataSet('abalone.txt')
yHat01 = regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],0.1)
print(regression.rssError(abY[100:199],yHat01.T))
yHat1 = regression. lwlrTest(abX[100:199],abX[0:99],abY[0:99],1)
print(regression.rssError(abY[100:199],yHat1.T))
yHat10 = regression.lwlrTest(abX[100:199],abX[0:99],abY[0:99],10)
print(regression.rssError(abY[100:199],yHat10.T))
ws = regression.standRegres(abX[0:99],abY[0:99])
yHat = mat(abX[100:199]) * ws
print(regression.rssError(abY[100:199],yHat.T.A))

Among the three parameters, the test error is the smallest when the kernel size is equal to 10, but the error on the training set is the largest.
Simple linear regression achieves a similar effect to locally weighted linear regression.

8.4 Reduction factors to "understand" the data

When there are more data features than sample points, (X^{T}X)^{-1}linear regression and previous methods can no longer be used to make predictions due to errors during calculation.
If there are more features than sample points (m>m), that is to say, the input matrix x is not a full-rank matrix. Non-full-rank matrices can cause problems when inverting.
To solve this problem, statisticians introduce the concept of ridge regression, which is the first reduction method that will be introduced in this section. Then there is the lasso method, the effect is good but the calculation method. Finally, a second reduction method, called forward stepwise regression, is introduced. You can get the same effect as lasso, and it is easier to achieve.

8.4.1 Ridge regression

Simply put, ridge regression is X^{T}Xto add one to the matrix \lambda Iso that the matrix is ​​non-singular, and then can be X^{T}X+\lambda Iinverted. Among them, the matrix I is an m×m identity matrix, all the elements on the diagonal are 1, and the other elements are all 0. Whereas λ is a user-defined value. In this case, the formula for calculating the regression coefficient becomes:

\hat{w}=\left(\boldsymbol{X}^{\mathrm{T}} \boldsymbol{X}+\lambda \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\mathrm{T}} y

Ridge regression was first used to deal with the situation where the number of features exceeds the number of samples, and is now also used to add bias to the estimate, resulting in a better estimate. By introducing λ to limit the sum of all w, by introducing this penalty term, unimportant parameters can be reduced. This technique is also called reduction in statistics.

What is a ridge in ridge regression?
Ridge regression uses the identity matrix multiplied by the constant λ. We observe the identity matrix I in it, and we can see that the value 1 runs through the entire diagonal, and the rest of the elements are all 0. Visually, there is a "ridge" composed of 1s on the plane composed of 0s, which is the "ridge" in ridge regression.

Reduction methods remove unimportant parameters and thus provide a better understanding of the data. In addition, the reduction method can achieve better prediction results than simple linear regression.
Similar to the method used to train other parameters in the previous chapters, λ is obtained by minimizing the prediction error: after the data is acquired, a part of the data is first sampled for testing, and the rest is used as a training set for training parameter w. After training, test the predictive performance on the test set. Repeat the above test process by selecting different λ, and finally get a λ that minimizes the prediction error.

Add the following code to regression.py:

def ridgeRegres(xMat, yMat, lam=0.2):
    xTx = xMat.T * xMat
    denom = xTx + eye(shape(xMat)[1]) * lam
    if linalg.det(denom) == 0.0:
        print("This matrix is singular, cannot do inverse")
        return
    ws = denom.I * (xMat.T * yMat)
    return ws

def ridgeTest(xArr, yArr):
    xMat = mat(xArr);
    yMat = mat(yArr).T
    yMean = mean(yMat, 0)
    yMat = yMat - yMean  # to eliminate X0 take mean off of Y
    # regularize X's
    xMeans = mean(xMat, 0)  # calc mean then subtract it off
    xVar = var(xMat, 0)  # calc variance of Xi then divide by it
    xMat = (xMat - xMeans) / xVar
    numTestPts = 30
    wMat = zeros((numTestPts, shape(xMat)[1]))
    for i in range(numTestPts):
        ws = ridgeRegres(xMat, yMat, exp(i - 10))
        wMat[i, :] = ws.T
    return wMat

The above code contains two functions: the function ridgeRegress() is used to calculate the regression coefficients, and the function ridgeTest() is used to test the results on a set of λ.
The first function, ridgeRegres(), implements the ridge regression solution for a given lambda. If lambda is not specified, it defaults to 0.2. Since lambda is a reserved keyword in Python, lam is used instead in the program. The function first constructs the matrix X^{T}X, and then multiplies the identity matrix by lam (which can be generated by calling the method eye() in the NumPy library). Ridge regression works well where ordinary regression methods can produce errors. If lambda is set to 0, an error may occur, so it is still necessary to check whether the determinant is zero. Finally, if the matrix is ​​non-singular the regression coefficients are computed and returned.
In order to use ridge regression and reduction techniques, the features first need to be normalized. The ridgeTest() function demonstrates the process of data normalization. Specifically, all features are subtracted from their respective means and divided by the variance.
After the processing is complete, the ridgeRegres() function can be called under 30 different lambdas. The lambda here should change exponentially, so that we can see the influence of lambda on the result when it takes a very small value and when it takes a very large value. Finally output all regression coefficients to a matrix and return.

import regression
from numpy import *
import matplotlib.pyplot as plt
# 获得30个不同lambda对应的回归系数
abX,abY = regression.loadDataSet('abalone.txt')
ridgeWeights = regression.ridgeTest(abX,abY)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ridgeWeights)
plt.show()

The figure plots the relationship between the regression coefficient and log(λ). On the far left, that is, when λ is the smallest, the original values ​​of all coefficients can be obtained (consistent with linear regression); while on the right, all coefficients are reduced to 0; a value in the middle part will achieve the best prediction effect. To quantitatively find the optimal parameter values, cross-validation is also required. In addition, to judge which variables have the most influence on the result prediction, just observe their corresponding coefficients in the figure.
There are some other reduction methods like lasso, LAR, PCA regression and subset selection etc. Like ridge regression, these methods not only improve forecast accuracy, but also explain regression coefficients.

8.4.2 lasso

It is not difficult to prove that when the following constraints are added, ordinary least squares regression will get the same formula as ridge regression:

 

The above formula limits the sum of squares of all regression coefficients to not be greater than λ. Using ordinary least squares regression may yield a large positive coefficient and a large negative coefficient when two or more features are correlated. It is precisely because of the above constraints that using ridge regression can avoid this problem.
Similar to ridge regression, another reduction method, lasso, also limits the regression coefficients. The corresponding constraints are as follows:

\sum_{k=1}^{n}\left|w_{k}\right| \leqslant \lambda

The power-off is that this constraint uses absolute values ​​instead of sums of squares. Although the constraint form is only slightly changed, the results are very different: when λ is small enough, some coefficients will be forced to be reduced to 0. This feature can help us better understand the data. These two constraints appear to be similar in formula, but the slight change greatly increases the computational complexity (in order to unwind the regression coefficients under this new constraint, a quadratic programming algorithm is required). ,

8.4.3 Forward Stepwise Regression

The forward stepwise regression algorithm can get the same effect as lasso, but simpler. It belongs to a greedy algorithm, that is, each step reduces the error as much as possible. At the beginning, all weights are set to 1, and then the decision made at each step is to increase or decrease a certain weight by a small value. The pseudocode is as follows:

Open the regression.py file and add the following code:

def stageWise(xArr,yArr,eps=0.01,numIt=100):
    xMat = mat(xArr); yMat=mat(yArr).T
    yMean = mean(yMat,0)
    yMat = yMat - yMean     #can also regularize ys but will get smaller coef
    xMat = regularize(xMat)
    m,n=shape(xMat)
    returnMat = zeros((numIt,n)) #testing code remove
    ws = zeros((n,1)); wsTest = ws.copy(); wsMax = ws.copy()
    for i in range(numIt):
        print(ws.T)
        lowestError = inf;
        for j in range(n):
            for sign in [-1,1]:
                wsTest = ws.copy()
                wsTest[j] += eps*sign
                yTest = xMat*wsTest
                rssE = rssError(yMat.A,yTest.A)
                if rssE < lowestError:
                    lowestError = rssE
                    wsMax = wsTest
        ws = wsMax.copy()
        returnMat[i,:]=ws.T
    return returnMat

def regularize(xMat):#regularize by columns
    inMat = xMat.copy()
    inMeans = mean(inMat,0)   #calc mean then subtract it off
    inVar = var(inMat,0)      #calc variance of Xi then divide by it
    inMat = (inMat - inMeans)/inVar
    return inMat

The function stageWise() is an implementation of a stepwise linear regression algorithm, which is similar to lasso but computationally simple. The input of this function includes: input data xArr and predictor variable yArr. In addition, there are two parameters: one is eps, indicating the step size that needs to be adjusted for each iteration; the other is numIt, indicating the number of iterations.
The function first converts the input data and stores it in a matrix, and then standardizes the features according to the mean value of 0 and the variance of 1. After that, a vector ws is created to hold the value of w, and two copies of ws are created in order to implement the greedy algorithm. The following optimization process needs to iterate numIt times, and print out the w vector at each iteration, which is used to analyze the process and effect of algorithm execution.
The greedy algorithm runs two for loops on all features, calculating the effect of adding or subtracting that feature on the error, respectively. What is used here is the squared error, obtained through the previous function rssError(). The initial value of the error is set to positive infinity, and the smallest error is taken after comparing with all errors. The whole process is carried out iteratively.
Verify the code:

import regression
# 获得30个不同lambda对应的回归系数
xArr,yArr = regression.loadDataSet('abalone.txt')
print(regression.stageWise(xArr,yArr,0.01,200))

Both w1 and w6 are 0, which indicates that they do not cause any impact on the target value, which means that these features are probably not needed. In addition, when the parameter eps is set to 0.01, the coefficient is saturated after a while and oscillates back and forth between specific values, which is because the step size is too large. You can see that the first weight oscillates back and forth between 0.04 and 0.05.
Let's try using a smaller step size and a larger number of steps, and compare the results with the least squares method:

import regression
from numpy import *
import matplotlib.pyplot as plt
# 获得30个不同lambda对应的回归系数
xArr,yArr = regression.loadDataSet('abalone.txt')
print(regression.stageWise(xArr,yArr,0.001,5000))
xMat = mat(xArr)
yMat = mat(yArr).T
xMat = regression.regularize(xMat)
yM = mean(yMat,0)
yMat = yMat - yM
weights = regression.standRegres(xMat,yMat.T)
print(weights.T)

 

The observation results show that the iterative stepwise linear regression algorithm is similar to the conventional least squares method. The result after using an epsilon value of 0.005 and after 1000 iterations is as follows:

 

The beauty of the stepwise linear regression algorithm is that it can help people understand existing models and make improvements. After building a model, the algorithm can be run to find important features, so that it is possible to stop collecting those unimportant features in time. Finally, if used for testing, the algorithm can build a model after every 100 iterations, and these models can be compared using a method similar to 10-fold cross-validation, and finally the model that minimizes the error is selected.
When the reduction method is applied, the model also increases the bias (bias), and at the same time reduces the variance of the model.

8.5 Balancing Bias and Variance

To give an example, the formula for generating a data is as follows:

y=3.0+1.7 x+0.1 \sin (30 x)+0.06 \mathrm{~N}(0,1)

where N(0,1) is a normal distribution with mean 0 and variance 1. When trying to fit the data with only a straight line, it is not difficult to imagine that the best fit of the straight line should be 3.0+1.7x, and the error part is 0.1sin(30x)+0.06N(0,1) . The figure below shows the curves of training error and test error, the upper curve is the test error, and the lower curve is the training error. According to the experiments in Section 8.3, we know that if the size of the kernel is reduced, the training error will be smaller. From the figure below, from left to right, the process of gradually reducing the core is represented.

It is generally believed that the above two kinds of errors are composed of three parts: bias, measurement error and random noise. Before, we introduced three smaller and smaller kernels to continuously increase the variance of the model. 
Section 8.4 introduces the reduction method, which can reduce some coefficients to a small value or directly to 0. This is an example of increasing the deviation of the model. By reducing the regression coefficients of some features to 0, the complexity of the model is also reduced. There are 8 features in the example, and eliminating two of them not only makes the model easier to understand, but also reduces the prediction error. The left side of the figure above is the result of too severe parameter reduction, while the right side is the effect of no reduction.
Variance is measurable. If you take a random sample set from the abalone data (for example, take 100 of them) and fit it with a linear model, you will get a set of regression coefficients. Similarly, if another set of random samples is taken out and fitted, another set of regression coefficients will be obtained. The size of the difference between these coefficients is the reflection of the variance of the model. The above concept of tradeoff between bias and variance is very popular in machine learning and appears repeatedly.
 

Guess you like

Origin blog.csdn.net/fjyalzl/article/details/126914637