What is the polynomial regression

import numpy as np
import matplotlib.pyplot as plt

x=np.random.uniform(-3,3,size=100)
X=x.reshape(-1,1)

y=0.5* x**2 +x+ 2 + np.random.normal(0,1,size=100)

plt.scatter(x,y)
plt.show()

1570114741603

Try using linear regression:

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)

y_predict=lin_reg.predict(X)

plt.scatter(x,y)
plt.plot(x,y_predict,color='r')
plt.show()

Discovery did not fit, we can put x ** 2 this one set to another feature, and after training with the original features of the merger, which is polynomial regression:

X2=np.hstack([X,X**2])

lin_reg2=LinearRegression()
lin_reg2.fit(X2,y)
y_predict2=lin_reg2.predict(X2)

plt.scatter(x,y)
plt.plot(np.sort(x),y_predict2[np.argsort(x)],color='r') #要对x排序，否则是混乱的折线图
plt.show()

Take a look at the predicted characteristic value and intercept:

lin_reg2.coef_

lin_reg2.intercept_

We found very close to the start of the set.

scikit-learn the polynomial regression and Pipeline

scikit-learn the polynomial regression

from sklearn.preprocessing import PolynomialFeatures

poly=PolynomialFeatures(degree=2)
poly.fit(X)
X2=poly.transform(X)

X2.shape

X2[:5,:] #分别是x的0次方,x的1次方,x的二次方

from sklearn.linear_model import LinearRegression
lin_reg2=LinearRegression()
lin_reg2.fit(X2,y)
y_predict2=lin_reg2.predict(X2)

plt.scatter(x,y)
plt.plot(np.sort(x),y_predict2[np.argsort(x)],color='r')
plt.show()

x ^ 0 coefficient is 0, which is correct because the constant is the intercept.

X=np.arange(1,11).reshape(-1,2)
poly=PolynomialFeatures(degree=2)
poly.fit(X)
X2=poly.transform(X)

X2.shape
X2#如果原来有两个特征a、b，那么平方后是三项：a^2,a*b,b^2

So there should be 10 degree = 3.

Pipeline

Pipeline scikit-learn is to provide a great tool, you can handle some flow sequence:

x=np.random.uniform(-3,3,size=100)
X=x.reshape(-1,1)
y=0.5* x**2 + x + 2 +np.random.normal(0,1,100)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
poly_reg=Pipeline([
    ("poly",PolynomialFeatures(degree=2)),#最大为2次方
    ("std_scaler",StandardScaler()),#归一化
    ("lin_reg",LinearRegression()) #线性回归
])

poly_reg.fit(X,y)
y_predict=poly_reg.predict(X)

plt.scatter(x,y)
plt.plot(np.sort(x),y_predict[np.argsort(x)],color='r')
plt.show()

Over-fitting and underfitting

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x=np.random.uniform(-3.0,3.0,size=100)
X=x.reshape(-1,1)
y=0.5* x**2 + x +2 +np.random.normal(0,1,size=100)

plt.scatter(x,y)
plt.show()

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(X,y)
lin_reg.score(X,y) #R^2值

Here we use the mean square error:

straight line:

from sklearn.metrics import mean_squared_error

y_predict=lin_reg.predict(X)
mean_squared_error(y,y_predict)

Polynomial:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialRegression(degree):
    return Pipeline([
        ("poly",PolynomialFeatures(degree=degree)),
        ("std_scaler",StandardScaler()),
        ("lin_reg",LinearRegression())
    ])

poly2_reg=PolynomialRegression(degree=2)
poly2_reg.fit(X,y)

y2_predict=poly2_reg.predict(X)
mean_squared_error(y,y2_predict)

100 test extended to power:

poly100_reg=PolynomialRegression(degree=100)
poly100_reg.fit(X,y)

y100_predict=poly100_reg.predict(X)
mean_squared_error(y,y100_predict)

plt.scatter(x,y)
plt.plot(np.sort(x),y100_predict[np.argsort(x)],'r')
plt.show()

-3 to 3 with arithmetic test series:

X_plot=np.linspace(-3,3,100).reshape(100,1)
y_plot=poly100_reg.predict(X_plot)
plt.scatter(x,y)
plt.plot(X_plot[:,0],y_plot,color="r")
plt.axis([-3,3,-1,10])
plt.show()

Curve becomes very complicated, which is over-fitting; and straight too simple, is underfitting.

The significance train_test_split

This article can be found in over-fitting curves to predict the face of new data is very weak, we call generalization ability of the model is weak.

So the significance of test data sets is to split the data into training and testing data, if the model is trained on the test data accuracy is relatively high, so strong generalization ability.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=666)

lin_reg=LinearRegression()
lin_reg.fit(X_train,y_train)
y_predict=lin_reg.predict(X_test)
mean_squared_error(y_test,y_predict)

For linear regression, split into training and test data set:

For polynomial regression:

poly2_reg=PolynomialRegression(degree=2)
poly2_reg.fit(X_train,y_train)
y2_predict=poly2_reg.predict(X_test)
mean_squared_error(y_test,y2_predict)

Parameter passing for the time 10:

Just find the entire data when the training data when errors than 2 mass participation is when young but later split the data set error is larger than the quadratic polynomial.

learning curve

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
x=np.random.uniform(-3.0,3.0,size=100)
X=x.reshape(-1,1)
y=0.5* x**2 + x + 2 + np.random.normal(0,1,size=100)

plt.scatter(x,y)
plt.show()

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=10)

X_train.shape

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

train_score=[]
test_score=[]
for i in range (1,76):
    lin_reg=LinearRegression()
    lin_reg.fit(X_train[:i],y_train[:i])
    y_train_predict=lin_reg.predict(X_train[:i])
    train_score.append(mean_squared_error(y_train[:i],y_train_predict))
    
    y_test_predict=lin_reg.predict(X_test)
    test_score.append(mean_squared_error(y_test,y_test_predict))
    
plt.plot([i for i in range(1,76)],np.sqrt(train_score),label="train")
plt.plot([i for i in range(1,76)],np.sqrt(test_score),label="test")
plt.legend()
plt.show()

The learning curve plotted as a function of the package:

def plot_learning_curve(algo,X_train,X_test,y_train,y_test):
    train_score=[]
    test_score=[]
    for i in range (1,len(X_train)+1):
        algo.fit(X_train[:i],y_train[:i])
        y_train_predict=algo.predict(X_train[:i])
        train_score.append(mean_squared_error(y_train[:i],y_train_predict))
    
        y_test_predict=algo.predict(X_test)
        test_score.append(mean_squared_error(y_test,y_test_predict))
        
    plt.plot([i for i in range(1,len(X_train)+1)],np.sqrt(train_score),label="train")
    plt.plot([i for i in range(1,len(X_train)+1)],np.sqrt(test_score),label="test")
    plt.legend()
    plt.axis([0,len(X_train)+1,0,4])
    plt.show()

The learning curve linear regression (less fit):

plot_learning_curve(LinearRegression(),X_train,X_test,y_train,y_test)

Quadratic polynomial regression learning curve (best):

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

def PolynomialRegression(degree):
    return Pipeline([
        ("poly",PolynomialFeatures(degree=degree)),
        ("std_scaler",StandardScaler()),
        ("lin_reg",LinearRegression())
    ])

poly2_reg=PolynomialRegression(degree=2)

plot_learning_curve(poly2_reg,X_train,X_test,y_train,y_test)

20 polynomial regression learning curve:

poly20_reg=PolynomialRegression(degree=20)

plot_learning_curve(poly20_reg,X_train,X_test,y_train,y_test)

It can be found, and overfitting underfitting close to final height (error) is higher than the optimum curve.

Validation data set and cross-validation

We use train_test_split approach each be modified according to the accuracy of the test data set of parameters refit, it might be the last set of test data over-fitting, and then you can split out a validation data set to assume the original test data sets task, and the test data set is the ultimate judge of the standard model:

Cross-validation:

K is split into a training data models, the mean of the K model parameter adjustment as a result.

import numpy as np
from sklearn import datasets

digits=datasets.load_digits()
X=digits.data
y=digits.target

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=666) 

#采用一般方法寻找最优超参数
from sklearn.neighbors import KNeighborsClassifier
best_score,best_p,best_k=0,0,0
for k in range(2,11):
    for p in range(1,6):
        knn_clf=KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        knn_clf.fit(X_train,y_train)
        score=knn_clf.score(X_test,y_test)
        if score>best_score:
            best_score,best_p,best_k=score,p,k
        
print("Best k=",best_k)
print("Best p=",best_p)
print("Best Score=",best_score)

Cross-validation call:

from sklearn.model_selection import cross_val_score
knn_clf=KNeighborsClassifier()
cross_val_score(knn_clf,X_train,y_train)

The default is split into the third training data, as shown above, three cross-validation.

#使用交叉验证寻找最优超参数
best_score,best_p,best_k=0,0,0
for k in range(2,11):
    for p in range(1,6):
        knn_clf=KNeighborsClassifier(weights="distance",n_neighbors=k,p=p)
        scores=cross_val_score(knn_clf,X_train,y_train)
        score=np.mean(scores)
        if score>best_score:
            best_score,best_p,best_k=score,p,k
        
print("Best k=",best_k)
print("Best p=",best_p)
print("Best Score=",best_score)

best_knn_clf=KNeighborsClassifier(weights="distance",n_neighbors=2,p=2)

best_knn_clf.fit(X_train,y_train)
best_knn_clf.score(X_test,y_test)

Although accuracy by a general method (train_test_split) a little lower than the above, but this approach may over-fitting test data set, the number of cross-validation more credible.

Recalling grid search:

from sklearn.model_selection import GridSearchCV
param_grid=[
    {
        'weights':['distance'],
        'n_neighbors':[i for i in range(2,11)],
        'p':[i for i in range(1,6)]
    }
]

grid_search=GridSearchCV(knn_clf,param_grid,verbose=1)
grid_search.fit(X_train,y_train)

Where CV is the cross-validation, the results returned by means of training data set is divided into three, a total of 45 models, for a total of 3 * 45 cross-validation.

grid_search.best_score_
grid_search.best_params_
best_knn_clf=grid_search.best_estimator_
best_knn_clf.score(X_test,y_test)

As can be seen above and the results we obtained with the same cross-validation.

Data set into 5 parts:

cross_val_score(knn_clf,X_train,y_train,cv=5)

GridSearchCV(knn_clf,param_grid,verbose=1,cv=5)

k-folds cross-validation: the training data set is divided into k parts, called the k-fold cross validation. The disadvantage is that each training models k, k is equivalent to the overall performance of slow times.

Leave-one (LOO-CV): The training data set is divided into m sub-called leave-one (Leave-One-Out Cross Validation). Completely free from random, the closest model for real performance indicators. The disadvantage is that a huge amount of calculation.

Bias variance balance

Deviation (bias), variance (variance)

Offset model error = + + inevitable error variance

The main cause of the deviation: Suppose the problem itself is not correct, such as non-linear data using linear regression, underfitting

Variance: a little bit of disturbance will be greater impact model data, usually because the model is too complicated to use, such as high-order polynomial regression, over-fitting.

Some algorithms are inherently high variance algorithm, such as KNN.

Non-parametric study are usually high variance, because the data does not make any assumptions.

Some algorithms are inherently high deviation algorithm, such as linear regression.

Usually high deviation parameter learning algorithm, since the data has a strong assumptions.

Most algorithms have corresponding parameters can be adjusted deviation and variance. KNN as the K, using linear regression polynomial regression.

Deviation and variance often contradictory, reduce errors will increase variance, variance reduction will increase the deviation.

The main challenge of machine learning, from the variance!

Resolve usually means high variance:

Reduce the complexity of the model
Reduce data dimensions, noise reduction
Increase the number of samples
Use validation set
Model regularization

Generalization and ridge regression model

Model regularization (Regularization): limiting the size of the parameters

Over-fitting of the curve is very steep, because the parameters obtained larger difference between large and small

Using a linear data plus noise to test:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
x=np.random.uniform(-3.0,3.0,size=100)
X=x.reshape(-1,1)
y=0.5* x + 3 +np.random.normal(0,1,size=100)

plt.scatter(x,y)
plt.show()

Polynomial regression:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

def PolynomialRegression(degree):
    return Pipeline([
        ("poly",PolynomialFeatures(degree=degree)),
        ("std_scaler",StandardScaler()),
        ("lin_reg",LinearRegression())
    ])

from sklearn.model_selection import train_test_split
np.random.seed(666)
X_train,X_test,y_train,y_test=train_test_split(X,y)

from sklearn.metrics import mean_squared_error
poly_reg=PolynomialRegression(degree=20)
poly_reg.fit(X_train,y_train)
y_poly_predict=poly_reg.predict(X_test)
mean_squared_error(y_test,y_poly_predict)

Visible MSE is very large.

The model fitting curve plotted as a function of the package:

def plot_model(model):
    X_plot=np.linspace(-3,3,100).reshape(100,1)
    y_plot=model.predict(X_plot)
    plt.scatter(x,y)
    plt.plot(X_plot[:,0],y_plot,color="r")
    plt.axis([-3,3,0,6])
    plt.show()
    
plot_model(poly_reg)

Obviously, the 20-order polynomial fit to the training data over.

We come out of ridge regression:

from sklearn.linear_model import Ridge
def RidgeRegression(degree,alpha):#alpha就是我们公式的阿尔法
    return Pipeline([
        ("poly",PolynomialFeatures(degree=degree)),
        ("std_scaler",StandardScaler()),
        ("ridge_reg",Ridge(alpha=alpha))
    ])
    
ridge1_reg=RidgeRegression(20,0.0001)
ridge1_reg.fit(X_train,y_train)
y1_predict=ridge1_reg.predict(X_test)
mean_squared_error(y_test,y1_predict)

MSE visible drop very much! !

plot_model(ridge1_reg)

If alpha = 1:

ridge2_reg=RidgeRegression(20,1)
ridge2_reg.fit(X_train,y_train)
y2_predict=ridge2_reg.predict(X_test)
mean_squared_error(y_test,y2_predict)

plot_model(ridge2_reg)

alpha=100：

ridge3_reg=RidgeRegression(20,100)
ridge3_reg.fit(X_train,y_train)
y3_predict=ridge3_reg.predict(X_test)
mean_squared_error(y_test,y3_predict)

plot_model(ridge3_reg)

Can be found, as alpha increases, the loss function factor accounting for more and more, so the focus will go to reduce the size of each coefficient last coefficient close to zero, is a straight line.

LASSO regression

Continue to use the data above ridge regression:

from sklearn.linear_model import Lasso

def LassoRegression(degree,alpha):
    return Pipeline([
        ("poly",PolynomialFeatures(degree=degree)),
        ("std_scaler",StandardScaler()),
        ("lasso_reg",Lasso(alpha=alpha))
    ])
    
lasso1_reg=LassoRegression(20,0.01)
lasso1_reg.fit(X_train,y_train)

y1_predict=lasso1_reg.predict(X_test)
mean_squared_error(y_test,y1_predict)

plot_model(lasso1_reg)

Incoming alpha = 0.1:

lasso2_reg=LassoRegression(20,0.1)
lasso2_reg.fit(X_train,y_train)

y2_predict=lasso2_reg.predict(X_test)
mean_squared_error(y_test,y2_predict)

plot_model(lasso2_reg)

The approximate straight line has been seen, the ridge regression is a graph.

Compare Ridge and LASSO:

L1, L2 and the elastic network regularization

L0 regular number theta is to make as little as possible.

Elastic network combines the ridge regression (computing is relatively accurate, but more and more features as time will not feature selection so computationally intensive) and lasso regression (feature selection, because the rush into the feature 0, so there may be an error) The advantages.

Machine Learning (5) - polynomial regression model generalization