Study notes of polynomial regression and pipeline and deviation and variance

Today we briefly talk about the application of polynomial regression and Pipeline.
Before we learned about linear regression, the assumption of linear regression is that there is a linear relationship in the data. Not all data has a linear relationship. We want to use regression, which can perform dimension-up processing on features and convert it into polynomial regression.

1. Polynomial regression

Research on a regression analysis method of polynomial between a dependent variable and one or more independent variables is called polynomial regression (Polynomial Regression). Polynomial regression is a type of linear regression model whose regression function is linear with respect to the regression coefficient. The relationship between the independent variable x and the dependent variable y is modeled as a polynomial of degree n.

二、Pipeline

When using sklearn for modeling, we can consider simple data processing, feature processing, and modeling into a pipeline form. At this time, the Pipeline function is used.

Pipeline puts all these steps together. The parameters are passed into a list, and each element in the list is a step in the pipeline. Each element is a tuple, the first element of the tuple is the name (string), and the second element is the instantiation.

Third, the code implementation process

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

poly_reg =Pipeline([('poly',PolynomialFeatures(degree=2)),
                   ('scalar',StandardScaler()),
                   ('lr_reg',LinearRegression())]
)
poly_reg.fit(X,y)
y_predict = poly_reg.predict(X)
plt.scatter(x, y)
plt.plot(np.sort(x), y_predict[np.argsort(x)], color='r')
plt.show()

Insert picture description here

Four, deviation and variance

Model error = deviation + variance + inevitable error (noise). Generally speaking, as the complexity of the model increases, the variance will gradually increase and the deviation will gradually decrease.

偏差(bias):偏差衡量了模型的预测值与实际值之间的偏离关系。例如某模型的准确度为96%,则说明是低偏差;反之,如果准确度只有70%,则说明是高偏差。

方差(variance):方差描述的是训练数据在不同迭代阶段的训练模型中,预测值的变化波动情况(或称之为离散情况)。从数学角度看,可以理解为每个预测值与预测均值差的平方和的再求平均数。通常在模型训练中,初始阶段模型复杂度不高,为低方差;随着训练量加大,模型逐步拟合训练数据,复杂度开始变高,此时方差会逐渐变高。

The trade-off relationship between
Insert picture description heredeviation and variance : deviation and variance cannot be completely avoided, and their impact can only be minimized. The main challenge comes from variance. The general methods for dealing with high variance are:

降低模型复杂度
减少数据维度;降噪
增加样本数
使用验证集
正则化

Complete code
Reference article: https://mp.weixin.qq.com/s/KnOZ2mK15G1w9fRZCVHJmQ
https://mp.weixin.qq.com/s/K_4DH7BC7jIF2-ltHBWGmA

Published 12 original articles · Like9 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/sun91019718/article/details/105279128