polynomial regression

concept

Linear regression studies the regression problem between a dependent variable and an independent variable.
Polynomial regression refers to a method of fitting nonlinear data by adding nonlinear features on the basis of linear regression. The polynomial regression model can use an n-th degree polynomial function to approximately describe the relationship between the target variable and the input variable. For example, for the case where there is only one independent variable
x, the fitting function can be written as:
Insert image description here
where
y represents the target variable and x represents the independent variable, which
is the parameter of the model. The goal of the model is to minimize the error between the predicted value and the true value by adjusting parameters.

Polynomial regression can be implemented through the PolynomialFeatures class of Scikit-Learn, which can transform the original independent variable data into new independent variable data containing polynomial features. In this way, we can use the linear regression algorithm to process the augmented nonlinear features to obtain a polynomial regression model.

Fitting example

Generate a polynomial simulation data y=3 x+2 x**2

import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y =  3*x+ 2*x**2+ np.random.normal(0, 1, size=100)
plt.scatter(x, y)
plt.show()

Insert image description here
If you use linear regression directly, take a look at the effect:

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(x, y_predict, color='r')
plt.show()

Insert image description here
Obviously, the fitting effect is not good. So what's the solution?
Solution: Add a feature. x**2

X2 = np.hstack([X, X**2])
lin_reg2 = LinearRegression()
lin_reg2.fit(X2, y)
y_predict2 = lin_reg2.predict(X2)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()

Insert image description here
This is much better than a straight line fit, the slope and intercept are.

[2.9391452  1.94366894]
0.04332751905483523

Polynomial regression in scikit-learn

polynomialFeatures

polynomialFeatures is a function in Scikit-Learn that is used to convert input data into a polynomial feature set. Its function is to fit nonlinear data by adding nonlinear features when performing linear regression on nonlinear data.

Specifically, the PolynomialFeatures function converts the original feature vector into a new feature vector that contains all polynomial combinations. For example, if the original feature vector is [a,b] and degree=2 is used, then the new feature vector generated by PolynomialFeatures is [1,a,b,a^2, ab,b^2]. If the original feature vector is [x] and degree=2 is used, then the new feature vector generated by PolynomialFeatures is [1,x,x^2]

In this way, since the new eigenvector contains all the polynomials of the original eigenvector, the nonlinear function can be better fitted.

PolynomialFeatures mainly have the following parameters:

Degree: represents the degree of the polynomial, which determines how many degree terms the polynomial will be generated.
interaction_only: The default is False, indicating that the new feature vector contains cross terms and higher-order terms, such as a×b, a^2, etc.
include_bias: Defaults to True, indicating whether to create a bias column.

In short, PolynomialFeatures is a very useful function that can help us better handle nonlinear data, thereby improving the predictive power of the model.

from sklearn.preprocessing import PolynomialFeatures
# 这个degree表示我们使用多少次幂的多项式
poly = PolynomialFeatures(degree=2)    
poly.fit(X)
X2 = poly.transform(X)
print(X2.shape)
print(X2)

Output result (the first column is the constant 1, the second column is the previous x, and the third column is x**2):


(100, 3)
[[ 1.00000000e+00 -2.37462045e+00  5.63882230e+00]
 [ 1.00000000e+00 -7.90962247e-01  6.25621276e-01]
 [ 1.00000000e+00 -7.02888543e-01  4.94052304e-01]
 [ 1.00000000e+00 -6.54589498e-01  4.28487411e-01]]

Use linear regression to fit

from sklearn.linear_model import LinearRegression
reg = LinearRegression() 
reg.fit(X2, y)
y_predict = reg.predict(X2) 
plt.scatter(x, y) 
plt.plot(np.sort(x), y_predict2[np.argsort(x)], color='r')
plt.show()
print(lin_reg2.coef_)
# array([0.90802935, 1.04112467])
print(lin_reg2.intercept_)

Insert image description here
Output intercept and slope

[3.02468873 1.94228967]
0.41539122650325755

Previously, we used 1-dimensional data. What if we use 2-dimensional, 3-dimensional or even higher-dimensional data?
Generate a two-dimensional data (1 to 10, converted to 5 rows and 2 columns)

import numpy as np
x = np.arange(1, 11).reshape(5, 2)
print(x)

output

[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]

Constructed using PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures()
poly.fit(x)
x2 = poly.transform(x)
print(x2)

output

[[  1.   1.   2.   1.   2.   4.]
 [  1.   3.   4.   9.  12.  16.]
 [  1.   5.   6.  25.  30.  36.]
 [  1.   7.   8.  49.  56.  64.]
 [  1.   9.  10.  81.  90. 100.]]

At this time, it can be seen that when the data dimension is 2 dimensions, 6-dimensional data is generated after polynomial preprocessing. The first column is obviously the coefficient of the 0th order term, and the second and third columns are also easy to understand, respectively x1 , x2, the fourth and sixth columns are x1 2 and x2 2 respectively, and there is another column, which is actually x1*x2. This is the fifth column, a total of 6 columns. From this, you can guess what will happen if the data is 3-dimensional?

poly = PolynomialFeatures(degree=3)
poly.fit(x)
x3 = poly.transform(x)
print(x3)

output

[[   1.    1.    2.    1.    2.    4.    1.    2.    4.    8.]
 [   1.    3.    4.    9.   12.   16.   27.   36.   48.   64.]
 [   1.    5.    6.   25.   30.   36.  125.  150.  180.  216.]
 [   1.    7.    8.   49.   56.   64.  343.  392.  448.  512.]
 [   1.    9.   10.   81.   90.  100.  729.  810.  900. 1000.]]

Insert image description here
So what do these 10 columns correspond to? Through PolynomiaFeatures, all possible combinations are exponentially increased in dimensionality. This will also cause certain problems. How to deal with this explosive growth? If you don't control it, just imagine that the difference between x and x[^100] is too big. This is the legendary overfitting.

Pipeline in sklearn

Pipeline in sklearn is a tool that can connect multiple data preprocessing steps (transformer) and a machine learning model (estimator) together to form a complete machine learning process. Each step in the Pipeline is an object containing fit and transform methods, where the fit method is used to fit the training data and the transform method is used to transform the data.

Through Pipeline, multiple preprocessing algorithms and machine learning algorithms can be combined, making the entire process standardized and simplified, and operations such as cross-validation and parameter adjustment can be easily performed. Each step in the Pipeline can be identified by a string, which can be used to adjust hyperparameters in the model.

Pipeline is usually used for feature engineering in machine learning. It builds a complete machine learning process during data preprocessing and applies it to the training set and test set. By combining multiple steps together, manual various combinations of feature engineering and model selection are avoided, while code reusability and maintainability are also improved.

In general, for polynomial regression, we normalize the data , then perform polynomial dimensionality enhancement, and then perform linear regression. Because polynomial regression is not encapsulated in sklearn, you can use Pipeline to integrate these operations.

#%%

import numpy as np
import matplotlib.pyplot as plt

x = np.random.uniform(-3, 3, size=100)
X = x.reshape(-1, 1)
y =  3*x+ 2*x**2+ np.random.normal(0, 1, size=100)
plt.scatter(x, y)
plt.show()

#%%

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

poly_reg = Pipeline([
    ('poly', PolynomialFeatures(degree=2)),   
    ('std_scale', StandardScaler()),
    ('lin_reg', LinearRegression())
])  
poly_reg.fit(X, y)
y_predict = poly_reg.predict(X)

plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

Overfitting and underfitting

The biggest advantage of polynomial regression is that it can approximate the measured point by adding higher-order terms of x until it is satisfactory. But this is also its biggest shortcoming, because usually if you try to fit the data with too high dimensions, you will have good performance on the training set, but the test set may not be so ideal, which is why It is a way to solve overfitting.

mean square error

mean_squared_error is a function used to calculate the Mean Squared Error (MSE) between two arrays. It is a common metric for evaluating the accuracy of regression models.
The input parameters of this function include:

y_true: true value array;
y_pred: array of predicted values;
sample_weight: Array used to weight the sample. It does not need to be passed in. The default value is None.

The function returns a numerical value representing the mean square error between the two arrays.

The mean square error is a widely used indicator in regression models to measure the predictive ability of the model. The smaller the mean square error, the more accurate the prediction of the model is. Its definition is: sum the deviations between the predicted value of each sample and the true value after squaring, and then divide it by the total number of samples to get the average value, that is
Insert image description here

The smaller the mean square error, the more accurate the model’s prediction ability is.

Fitting effect

After using the same method to generate the data set, we use different fitting methods and use the mean square error to compare the effects of several fittings.

linear fit

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_predict = lin_reg.predict(X)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()
print("线性均方误差",mean_squared_error(y, y_predict))

Output: 3.0750025765636577
Obviously, if a simple linear regression is used directly, the fitting result is underfitting.
Insert image description here

Quadratic polynomial fitting

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

def PolynomialRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])  

poly_reg = PolynomialRegression(degree=2)
poly_reg.fit(X, y)
Pipeline(memory=None,
     steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('std_scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('lin_reg', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])
y_predict = poly_reg.predict(X)
print("2次多项式均方误差",mean_squared_error(y, y_predict))
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

Output: 1.0987392142417856
Quadratic polynomial regression has a better fit than linear regression.
Insert image description here

Deca polynomial fitting

poly10_reg = PolynomialRegression(degree=10)
poly10_reg.fit(X, y)

y10_predict = poly10_reg.predict(X)
print("10次多项式均方误差",mean_squared_error(y, y10_predict))
plt.scatter(x, y)
plt.plot(np.sort(x), y10_predict[np.argsort(x)], color='r')
plt.show()

Output: 1.0508466763764202
Insert image description here

Hundred degree polynomial fitting

poly10_reg = PolynomialRegression(degree=100)
poly10_reg.fit(X, y)

y10_predict = poly10_reg.predict(X)
print("100次多项式均方误差",mean_squared_error(y, y10_predict))
plt.scatter(x, y)
plt.plot(np.sort(x), y10_predict[np.argsort(x)], color='r')
plt.show()

Output: 0.6870911922673567
Insert image description here

Hundred degree polynomial generation data set test

From the above graph, we can see that the y values predicted by the data generated by uniform are between -1 and 10. We use the same model to predict the value of x=3

y_plot = poly100_reg.predict([[3]])
print(y_plot)

Output: [-2.49133715e+06 -6.32965634e+24]
Conversion: -2.49133715e+06==-2491337.16790313

After finding >=3, if you follow this meniscus shape, it is obviously abnormal.
We generate an arithmetic sequence as a test set

from sklearn.preprocessing import PolynomialFeatures
x_plot = np.linspace(-3, 3, 100).reshape(100, 1)
y_plot = poly100_reg.predict(x_plot)
plt.scatter(x, y)
plt.plot(x_plot[:,0], y_plot, color='r')
# plt.axis([-3, 3, -1, 10])
plt.show()

Insert image description here
In this way, the graphics are messed up because x=3. We screenshot the graphics x from -3 to 3 and y from -1 to 10.

from sklearn.preprocessing import PolynomialFeatures
x_plot = np.linspace(-3, 3, 100).reshape(100, 1)
y_plot = poly100_reg.predict(x_plot)
plt.scatter(x, y)
plt.plot(x_plot[:,0], y_plot, color='r')
plt.axis([-3, 3, -1, 10])
plt.show()

Insert image description here

It shows that the model trained through the training set does not have good performance ability on the test set.

Solve the overfitting problem

Usually in the process of machine learning, the main problem is over-fitting, because it involves the generalization ability of the model. The so-called generalization ability means that the model can give good answers when verifying data outside the training set. It is meaningless just how well it fits the data in the training set, and how good the generalization ability of the model we need is.

Why do we need training data set and test data set?

Usually we divide the data set into a training set and a test set. It is meaningful if the model trained through the training data can perform better on the test set.

The following uses train_test_split to split the generated data into a training set and a test set, regenerate the model and calculate the mean square error
using linear regression

from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
np.random.seed(666)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, size=100)

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=666)
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
y_predict = lin_reg.predict(x_test)
mean_squared_error(y_test, y_predict)

Output result: 2.2199965269396573

Use binomial

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

def PolynomialRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])
poly2_reg = PolynomialRegression(degree=2)
poly2_reg.fit(x_train, y_train)
y2_predict = poly2_reg.predict(x_test)
mean_squared_error(y_test, y2_predict)

Output result: 0.8035641056297901

Use 10-nomial

poly10_reg = PolynomialRegression(degree=10)
poly10_reg.fit(x_train, y_train)
y10_predict = poly10_reg.predict(x_test)
mean_squared_error(y_test, y10_predict)

Output result: 0.9212930722150781

From the above example, we can find that when degree=2, the mean square error on the test set is much better than that of straight line fitting, but when degree=10, the mean square error on the test set is relatively better than degree=2. The effect is much worse, which means that the trained model has been overfitted.

100 terms

poly100_reg = PolynomialRegression(degree=100)
poly100_reg.fit(x_train, y_train)
y100_predict = poly100_reg.predict(x_test)
mean_squared_error(y_test, y100_predict)

Output result: 14440175276.314638
Insert image description here

Summary: Find the place with the best generalization ability among model complexity and model accuracy.

Underfitting: Underfitting, the model trained by the algorithm cannot fully express the data relationship.
Overfitting: Overfitting, the model trained by the algorithm expresses too much the noise relationship between the data.

learning curve

The learning curve of machine learning is a way to graphically represent the performance of a machine learning algorithm on training data. It usually takes the size of the training data set or the number of training iterations as the horizontal axis, and takes the model performance indicators (such as accuracy, error, etc.) as the vertical axis. axis. This curve can help us understand the learning process of the algorithm, evaluate the learning effect and adjust the model.

As the training data set increases, we hope to see the performance of the model continue to improve; if the performance is good on the training set but not good on the test set, it indicates that an overfitting problem has occurred; on the contrary, If the performance on both the training set and the test set is not good, you may need to reconsider issues such as data preprocessing, feature engineering, and model structure.

Concepts related to learning curves also include bias and variance, which are often used to diagnose and adjust models. When the deviation of the model is large, it means that the model is too simple and cannot accurately fit the training set and test set, and the model complexity needs to be increased; when the variance of the model is large, it means that the model is too complex and there is an over-fitting problem. , it is necessary to reduce the model complexity or increase the size of the training data set.

We try to split the entire data set into a training set and a test machine. The size of the training set increases from 1 to len (training set), and generates a model corresponding to the training set (using linear regression, 2nd degree polynomial, 100th degree polynomial), Use the same test set to test the corresponding model, and plot it as x=the number of training sets, y=mean square error to see underfitting (linear regression), best fitting (binomial), and overfitting (20 terms)

The function for drawing the learning curve is encapsulated below for easy calling later.


def plot_learning_curve(algo, x_train, x_test, y_train, y_test):

    train_score = []
    test_score = []
    for i in range(1, len(x_train)+1):
        algo.fit(x_train[:i], y_train[:i])

        y_train_predict = algo.predict(x_train[:i])
        train_score.append(mean_squared_error(y_train[:i], y_train_predict))

        y_test_predict = algo.predict(x_test)
        test_score.append(mean_squared_error(y_test, y_test_predict))

    plt.plot([i for i in range(1, len(x_train)+1)], np.sqrt(train_score), label='train')
    plt.plot([i for i in range(1, len(x_train)+1)], np.sqrt(test_score), label='test')
    plt.legend()
    plt.axis([0, len(x_train)+1, 0, 4])
    plt.show()
plot_learning_curve(LinearRegression(), x_train, x_test, y_train, y_test)

Generate dataset

np.random.seed(666)
np.random.seed(666)
x = np.random.uniform(-3.0, 3.0, size=100)
X = x.reshape(-1, 1)
y = 0.5 * x**2 + x + 2 + np.random.normal(0, 1, size=100)

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=10)

Using linear regression, the image is underfitted (underfitting means that the model cannot obtain a low enough error on the training set), and the training data error has reached 2.0. Using binomial
Insert image description here
review, the image is the best fit.

def PolynomialRegression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('std_scale', StandardScaler()),
        ('lin_reg', LinearRegression())
    ])
poly2_reg = PolynomialRegression(degree=2)
plot_learning_curve(poly2_reg, x_train, x_test, y_train, y_test)

Insert image description here
Using 20 terms, overfitting (overfitting means that the model performs well on the training set, but performs poorly on the test set)

poly2_reg = PolynomialRegression(degree=20)
plot_learning_curve(poly2_reg, x_train, x_test, y_train, y_test)

Insert image description here

Machine Learning Practical Tutorial (8): Polynomial Regression

polynomial regression

concept

Fitting example

Polynomial regression in scikit-learn

polynomialFeatures

Pipeline in sklearn

Overfitting and underfitting

mean square error

Fitting effect

linear fit

Quadratic polynomial fitting

Deca polynomial fitting

Hundred degree polynomial fitting

Hundred degree polynomial generation data set test

Solve the overfitting problem

learning curve

Guess you like