Article directory

foreword
1. Realization of simple linear regression equation
2. Implementation and comparison of three ways of gradient descent
3. Realization of polynomial linear regression equation
4. Standardization and eigenvalue dimension changes
5. The influence of sample size on model results
6. Regularization
- 1.Ridge
- 2.Lasso
7. Summary

foreword

In machine learning, optimizing model parameters is a very critical step. For different models and data sets, we need to choose an appropriate optimization method to obtain the optimal model parameters. At the same time, the regularization method can effectively reduce the complexity of the model, avoid over-fitting, and improve the generalization ability of the model.

1. Realization of simple linear regression equation

We can achieve this process through python code

Build a dataset (y=8+2*x+p)

X=2*np.random.rand(100,1)#生成shape为(100,1)的随机矩阵,矩阵的值为0，到2
y=8+3*X+np.random.randn(100,1)#

Let's draw the image first to observe the data aggregation

plt.plot(X,y,'r.')
plt.show()

insert image description here
produce a bias term

X_b = np.c_[np.ones((100,1)),X]

The function of X_b = np.c_[np.ones((100,1)), X] is to add a column of constant items to the data set X, that is, the bias term (bias term), to generate a new data set X_b, where np. ones((100,1)) is used to create an array of all ones with 100 rows and 1 column.
In machine learning, a linear regression model is usually expressed as y = Xw + b, where y represents the target variable (target variable), X represents the independent variable (independent variable), w represents weights (weights), and b represents the bias (bias) . Given an independent variable X, the goal of a linear regression model is to find the optimal weights and biases such that the predicted outcome is closest to the true outcome.
In practical applications, it is usually necessary to add a column of constant items b to the independent variable X, that is, to expand X to [X, 1], because the constant item can be regarded as a special value of the independent variable, when all the independent variables are When 0, the constant term is equal to 1, which helps the model fit the dataset better.
Therefore, by adding a constant term, we can predict the target variable y more accurately, and X_b = np.c_[np.ones((100,1)),X] is the code used to do this.

Regression equation code implementation
insert image description here

theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

Let's print it to
insert image description here
get the loss function, the first number is the weight, and the second bias item is used to adjust the parameters of the model

Generate test data, predict, observe

X_new = np.array([[0],[2]])
X_new_b = np.c_[np.ones((2,1)),X_new]
y_predict = X_new_b.dot(theta_best)
y_predict
plt.plot(X_new,y_predict,'b--')
plt.plot(X,y,'r.')
plt.axis([0,2,6,15])
plt.show()

insert image description here
It can be seen that the predicted data still basically fits the process

2. Implementation and comparison of three ways of gradient descent

1. Batch Gradient Descent

Formula

python code implementation

eta=0.1
interations=1000
m=100
theta=np.random.rand(2,1)
for iteration in range(interations):
    gradients = 2/m* X_b.T.dot(X_b.dot(theta)-y)
    theta = theta - eta*gradients

We look at the effect of different learning rates on the fitting process

theta_path_bgd = []
def plot_gradient_descent(theta,eta,theta_path = None):
    m = len(X_b)
    plt.plot(X,y,'r.')
    n_iterations = 1000
    for iteration in range(n_iterations):
        y_predict = X_new_b.dot(theta)
        plt.plot(X_new,y_predict,'b-')
        gradients = 2/m* X_b.T.dot(X_b.dot(theta)-y)
        theta = theta - eta*gradients
        if theta_path:
            theta_path.append(theta)
    plt.xlabel('X_1')
    plt.axis([0,2,0,15])
    plt.title('eta = {}'.format(eta))

theta = np.random.randn(2,1)

plt.figure(figsize=(10,4))
plt.subplot(131)
plot_gradient_descent(theta,eta = 0.01)
plt.subplot(132)
plot_gradient_descent(theta,eta = 0.1,theta_path=theta_path_bgd)
plt.subplot(133)
plot_gradient_descent(theta,eta = 0.4)
plt.show()

insert image description here
It can be observed that the smaller the learning rate, the more gradient updates and the more time-consuming. We need to make the learning rate as large as possible while ensuring accuracy. Obviously, among the three, the learning rate is 0.1, which is the best

2. Stochastic Gradient Descent

theta_path_sgd=[]
m = len(X_b)
np.random.seed(42)
n_epochs = 100

t0 = 5
t1 = 50

def learning_schedule(t):
    return t0/(t1+t)

theta = np.random.randn(2,1)

for epoch in range(n_epochs):
    for i in range(m):
        if epoch < 10 and i<10:
            y_predict = X_new_b.dot(theta)
            plt.plot(X_new,y_predict,'b-')
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2* xi.T.dot(xi.dot(theta)-yi)
        eta = learning_schedule(epoch*m+i)
        theta = theta-eta*gradients
        theta_path_sgd.append(theta)

plt.plot(X,y,'r.')
plt.axis([0,2,0,15])
plt.show()

This code implements the Stochastic Gradient Descent (SGD) algorithm to fit a simple linear regression model. Explanation of code parameters:

theta_path_sgd=[]：初始化theta_path_sgd为空列表，用于存储每个迭代步骤的模型参数。
m = len(X_b)：数据集大小。
np.random.seed(42)：设置随机种子，保证每次运行代码时得到的随机数相同。
n_epochs = 100：设置训练迭代的次数。
t0 = 5 和 t1 = 50：设置学习率调度函数中的超参数。
def learning_schedule(t):：定义学习率调度函数，根据给定的t值返回相应的学习率。
theta = np.random.randn(2,1)：初始化模型参数θ为一个2行1列的随机向量。
for epoch in range(n_epochs):：开始进行n_epochs次训练迭代。
for i in range(m):：遍历整个数据集进行梯度下降更新模型参数。
if epoch < 10 and i<10:：可视化代码，用于在前10个步骤中绘制模型的拟合直线。
random_index = np.random.randint(m)：随机选择一个样本的索引。
xi = X_b[random_index:random_index+1] 和 yi = y[random_index:random_index+1]：根据选择的索引从数据集中选择一个随机样本进行训练。
gradients = 2* xi.T.dot(xi.dot(theta)-yi)：计算随机样本的梯度。
eta = learning_schedule(epoch*m+i)：计算当前迭代步骤的学习率。
theta = theta-eta*gradients：使用梯度下降法更新模型参数。
theta_path_sgd.append(theta)：将更新后的模型参数θ添加到theta_path_sgd列表中。

insert image description here

We can see from the figure that the gradient changes a lot at the beginning, and then it gets smaller and smaller, and the closer it is to the fitting contract, the smaller it is. Be careful and take your time

3. Mini-batch gradient descent

theta_path_mgd=[]
n_epochs = 100
minibatch = 16
theta = np.random.randn(2,1)
t0, t1 = 200, 1000
def learning_schedule(t):
    return t0 / (t + t1)
np.random.seed(42)
t = 0
for epoch in range(n_epochs):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0,m,minibatch):
        t+=1
        xi = X_b_shuffled[i:i+minibatch]
        yi = y_shuffled[i:i+minibatch]
        gradients = 2/minibatch* xi.T.dot(xi.dot(theta)-yi)
        eta = learning_schedule(t)
        theta = theta-eta*gradients
        theta_path_mgd.append(theta)

minibatch = 16
The dataset size for each training is 16

4. Comparison of three gradient descent methods

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)
plt.figure(figsize=(12, 6))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], 'r-s', linewidth=1, label='SGD')
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], 'g-+', linewidth=2, label='MINIGD')
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], 'b-o', linewidth=3, label='BGD')
plt.legend(loc='upper left')
plt.axis([4.0, 10, 2.0, 6.0])
plt.show()

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)
these three gradient model parameters stored in the above three methods

insert image description here
Batch gradient descent is a flat blue curve, while stochastic gradient descent is a red curve, and the green one is small batch gradient descent. Under normal circumstances, we generally choose small batch gradients, and the more batches each time, the better.
The advantages of mini-batch gradient descent are

Reduce the amount of calculation: The small batch gradient descent method only calculates the gradient of some samples each time, which can reduce the amount of calculation and thus improve the training speed.
Reduce memory consumption: The full-sample gradient descent method needs to load the entire training set into memory, while the small-batch gradient descent method only needs to load part of the samples, which can reduce memory consumption.
Easier to jump out of local minima: Since only the gradient of some samples is considered each time, the mini-batch gradient descent method is easier to jump out of local minima, thereby achieving a better global optimal solution.
Better generalization ability: The mini-batch gradient descent method introduces randomness in the training process, which makes the model more generalizable and better adaptable to unknown data

3. Realization of polynomial linear regression equation

Construct data set (y=o,5*x^2+x+p)

m = 100
X = 6 * np.random.rand(m, 1) - 2
y = 0.5 * X ** 2 + X + np.random.randn(m, 1)
plt.plot(X, y, 'b.')
plt.xlabel('X_1')
plt.ylabel('y')
plt.axis([-2, 4, -5, 10])
plt.show()

insert image description here

Polynomial expansion of features


from sklearn.preprocessing import PolynomialFeatures
#平方项，偏置项
poly_features=PolynomialFeatures(degree=2,include_bias=False)
X_poly=poly_features.fit_transform(X)

In machine learning, the quantity and quality of features have a great impact on the performance of the model. Polynomial regression is to transform the feature space into a higher-dimensional space by performing polynomial expansion on the original features, so that it can better fit the nonlinear model.
Specifically, polynomial expansion is to combine the original features to form new features. For example, for a quadratic equation y = ax^2 + bx + c, we can expand x to [x, x^2] to get a quadratic equation y = a1x + a2x^2 + b, so that Extend the feature space from one-dimensional to two-dimensional.
PolynomialFeatures is a function in the Scikit-Learn library that can be used to construct polynomial features. Its main function is to convert the original features into a new feature matrix, where each element is the product of some powers of the original features and the cross term. This function can set the degree parameter to specify the number of polynomial expansions, and can also set the include_bias parameter to specify whether to add a bias term (ie x^0, corresponding to the intercept term in the linear equation).

For example, assuming that the original features are x1 and x2, if you set degree=2 and include_bias=False, then PolynomialFeatures will generate a new feature matrix X_poly, where each row corresponds to the product of some powers and cross terms of the original features. For example, the first row in X_poly corresponds to the original features [x1, x2] and the resulting new features are [x1^2, x1*x2, x2^2]. In this way, the original two features are expanded into three new features, which can better fit the nonlinear model.

Model
We expand the eigenvalue polynomial and train our model. In order to see the effect, we generate some test data. The same operation polynomializes the eigenvalue of the test data, and then the prediction results are drawn and displayed.

from sklearn.linear_model import LinearRegression

lin_reg=LinearRegression()
lin_reg.fit(X_poly,y)#模型训练

X_new=np.linspace(-2,2,100).reshape(100,1)
X_new_poly=poly_features.transform(X_new)
y_new=lin_reg.predict(X_new_poly)
plt.plot(X,y,'r.')
plt.plot(X_new,y_new,'b.',label='predictions')

plt.axis([-3,3,-5,10])
plt.show()

insert image description here
It can be seen that our model fits well.

4. Standardization and eigenvalue dimension changes

In machine learning, normalization is a commonly used preprocessing technique to transform data into a form with a standard normal distribution. Standardization can make the values between different features comparable, and can improve the performance of many machine learning algorithms, such as support vector machines, k-nearest neighbors and other algorithms.

The implementation of standardization usually involves the following two steps:

中心化：将数据的均值移动到 0。

缩放：将数据的方差缩放到 1。

In machine learning for multiple eigenvalues, standardization can effectively reduce the fitting process

The implementation of standardization is usually done using the StandardScaler class in the Scikit-Learn library. This class can center and scale the data, and at the same time save the mean and standard deviation of the training set so that subsequent test sets can use the same standardization method.

from sklearn.preprocessing import StandardScaler

# 创建 StandardScaler 对象
scaler = StandardScaler()

# 对训练集进行标准化
X_train_scaled = scaler.fit_transform(X_train)

# 对测试集使用相同的标准化方式
X_test_scaled = scaler.transform(X_test)

For the convenience of operation and observation, we continue to use the data set generated by the above polynomial
and compare the effects of the conversion of different eigenvalue dimensions.

from sklearn.pipeline import  Pipeline
from sklearn.preprocessing import StandardScaler
for style,width,degree in (('g-',1,100),('b--',2,2),('r-+',3,1)):
    ploy_features=PolynomialFeatures(degree=degree,include_bias=False)
    std=StandardScaler()
    lin_reg = LinearRegression()
    Polynomial_reg=Pipeline([('poly_features',ploy_features),
                             ('StandardScaler', std),
              ('lin_reg',lin_reg)])
    Polynomial_reg.fit(X,y)
    y_new_2=Polynomial_reg.predict(X_new)
    plt.plot(X_new,y_new_2,style,label='degree'+str(degree),linewidth=width)
plt.plot(X,y,'r.')
plt.axis([-3,3,-5,10])
plt.legend(loc='upper center')
plt.show()

Among them, Pipeline is a class in scikit-learn, which is used to combine multiple operation processes to form a complete preprocessing and model training process. Multiple steps can be specified in the Pipeline, and each step can be a transformer or an estimator. Among them, the transformer is used to preprocess the data, and the estimator is used to train and predict the model.
The role of Pipeline is to combine multiple operation processes to form a complete preprocessing and model training process, which can greatly simplify the process of building machine learning models. Using Pipeline can combine multiple data preprocessing and model training steps into a single object, making model training and parameter adjustment more convenient. For example, in machine learning, operations such as data standardization, feature selection, and model training usually need to be performed. Using Pipeline can combine these steps to form a complete machine learning process, making the code more concise and easier to maintain.

insert image description here
We can find that the higher the feature latitude value, the higher the risk of overfitting. Choose the appropriate dimension. The latitude value in the figure is 2 is the best

5. The influence of sample size on model results

The number of samples has a very important impact on the results of machine learning. In general, the larger the number of samples, the better the machine learning algorithm can learn the characteristics and laws of the data, so as to obtain a more accurate and generalized model. In addition, increasing the number of samples can also effectively avoid the problem of overfitting, thereby improving the predictive ability of the model.

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=100)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict[:m]))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    plt.plot(np.sqrt(train_errors), 'r-+', linewidth=2, label='train_error')
    plt.plot(np.sqrt(val_errors), 'b-', linewidth=3, label='val_error')
    plt.xlabel('Trainsing set size')
    plt.ylabel('RMSE')
    plt.legend()


lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X_poly, y)
plt.axis([-1, 80, 0, 1.5])
plt.show()

The train_test_split function in this code is used to divide the dataset into a training set and a validation set, where the validation set size is 20% of the total number of samples. Then in the plot_learning_curves function, the learning curves are plotted by fitting the model multiple times with different training set sizes and calculating the training set error and validation set error. For each training set size m, the function first fits the model with the first m samples, and then calculates the error of the model on the training set and the validation set. The root mean square error (RMSE) is used here as the error metric. Finally, the function plots the training set error and validation set error as a function of the size of the training set to observe the performance of the model under different sample sizes.

mean_squared_error is a function in the sklearn.metrics module that is used to calculate the mean squared error (Mean Squared Error, MSE).
The mean square error is a standard to measure the fitting effect of the regression model. The calculation formula is: MSE=1n∑i=1n(yi−y ^{i)2MSE=n1∑i=1n(yi−y} i) 2 where $y_i$ is the real target value, $\hat{y}_i$ is the target value predicted by the model, $n$ is the sample size.
In machine learning, we usually use the mean square error to measure the prediction error of the regression model. The smaller the value of MSE, the more accurate the prediction result of the model. Therefore, the function of mean_squared_error is to calculate the MSE value between the prediction result of the regression model and the real result.

run
insert image description here

The plotted curve shows how the training set error and validation set error change as the size of the training set increases, which can be used to evaluate the generalization performance of the model and find out whether there is an underfitting or overfitting problem .

When the errors of both are relatively large, and the gap is not very large, it is underfitting, and adding more training data does not improve the performance of the model much. When the model is overfitting, the training error is small, but the verification error is large, and the gap between them is large. At this time, adding more training data can alleviate the overfitting problem to a certain extent.

6. Regularization

In machine learning, regularization is a commonly used technique to control the complexity of the model and prevent the model from overfitting. Regularization penalizes the weight of the model by adding a regular term to the loss function of the model, thereby making the model less complex and improving generalization performance.

In linear regression, commonly used regularization methods include L1 regularization and L2 regularization. L1 regularization will make some weights 0, so it can be used for feature selection. L2 regularization will tend to make the weights tend to small values, but will not zero the weights.

1.Ridge

Ridge is a class that implements ridge regression in Scikit-learn, which is used to regularize the linear regression model to prevent overfitting.
Ridge regression is an improved linear regression method that limits the complexity of the model by adding an L2 regularization term to the loss function. The parameter alpha of the Ridge class controls the strength of the regularization term. The larger the alpha, the higher the regularization strength and the lower the model complexity.
Unlike ordinary linear regression, the solution of ridge regression is not in closed form, but obtained by solving an optimization problem with a regularization term, so numerical optimization methods are required to solve it. The Ridge class in Scikit-learn encapsulates this solution process and provides some optional solution methods, such as using Cholesky decomposition solution, using L-BFGS optimizer, etc. At the same time, the Ridge class also supports the use of Pipeline for data preprocessing and feature extraction.

from sklearn.linear_model import Ridge
np.random.seed(42)
m = 20
X = 3*np.random.rand(m,1)
y = 0.5 * X +np.random.randn(m,1)/1.5 +1
X_new = np.linspace(0,3,100).reshape(100,1)

def plot_model(model_calss,polynomial,alphas,**model_kargs):
    for alpha,style in zip(alphas,('b-','g--','r.')):
        model = model_calss(alpha,**model_kargs)
        if polynomial:
            model = Pipeline([('poly_features',PolynomialFeatures(degree =10,include_bias = False)),
             ('StandardScaler',StandardScaler()),
             ('lin_reg',model)])
        model.fit(X,y)
        y_new_regul = model.predict(X_new)
        lw = 2 if alpha > 0 else 1
        plt.plot(X_new,y_new_regul,style,linewidth = lw,label = 'alpha = {}'.format(alpha))
    plt.plot(X,y,'b.',linewidth =3)
    plt.legend()

plt.figure(figsize=(14,6))
plt.subplot(121)
plot_model(Ridge,polynomial=False,alphas = (0,1,10))
plt.subplot(122)
plot_model(Ridge,polynomial=True,alphas = (0,10**-5,1))
plt.show()

This code defines a function plot_model that is used to plot the predictions of the ridge regression model with different regularization parameters (alpha). Specifically, this function accepts the following parameters:

model_calss: 一个模型类，如Ridge或Lasso。
polynomial: 一个布尔值，表示是否对特征进行多项式扩展。
alphas: 一个包含正则化参数alpha的元组，用于在不同的正则化强度下训练模型。
**model_kargs: 其他模型参数。

In the function, for each value of alpha, a corresponding model instance is created, polynomially expanded and normalized if needed. Then, fit each model to the data and visualize its results using the plotting tool matplotlib. The final output is a graph with two subplots showing the results for the non-polynomially expanded and polynomially expanded data, with three curves in each subplot corresponding to different values of alpha.

insert image description here
Because our eigenvalues are one-dimensional, now we mainly look at the picture on the right, and compare the effects of different punishments after multi-dimensional expansion of the eigenvalues. The greater the punishment, the greater the alpha value, the more stable the decision equation is. We It is found that when the penalty is 1, the fitting line is the most stable

2.Lasso

Lasso is a linear regression model based on L1 regularization. Its main goal is to reduce the coefficients of the model as much as possible, so that the coefficients of some useless features are zero. In machine learning, this approach is also known as feature selection. Unlike Ridge regularization, Lasso regularization makes some coefficients go straight to zero, so it can be used for feature selection.

Lasso is a regularization method for linear regression that limits the complexity of the model by adding a penalty term to prevent overfitting. When training the model, it will shrink some coefficients (features) or even become 0, so as to achieve the function of feature selection.
When using Lasso for linear regression, you can control the strength of the penalty term by adjusting the value of the regularization parameter alpha. When the alpha value is small, the impact of the penalty item is small, and the model is more likely to overfit; when the alpha value is large, the impact of the penalty item is greater, and the model is easier to underfit.

from sklearn.linear_model import Lasso

plt.figure(figsize=(14,6))
plt.subplot(121)
plot_model(Lasso,polynomial=False,alphas = (0,1,10))
plt.subplot(122)
plot_model(Lasso,polynomial=True,alphas = (0,0.1,1))
plt.show()

insert image description here
From the figure on the right, we can find that when the penalty is 1, it is too large and the model is underfitting. When the selection is 0.1, the most stable

7. Summary

In machine learning, whether it is a linear model or a nonlinear model, we need to choose an appropriate optimization method to obtain the optimal model parameters. The commonly used gradient descent method is one of the common ways to optimize model parameters. Among them, the small batch gradient descent method is an efficient optimization method. Each time a small batch of data is selected for calculation, it can reduce the amount of calculation and improve the training speed of the model.

In the model training of multi-feature and multi-dimensional data sets, we usually need to standardize the data and normalize each feature in the data set so as to converge to the optimal solution faster and effectively avoid Fitting problems caused by different magnitudes of data.

The number of samples has a great impact on the performance of the machine learning model. In theory, the larger the number of samples, the better the generalization ability of the model, which can effectively avoid the problem of overfitting, thereby improving the predictive ability of the model.

For overly complex models, we can use regularization to effectively reduce the complexity of the model, thereby avoiding overfitting. Regularization methods can limit the complexity of the model by penalizing the size of the model parameters, making it more robust and better able to adapt to unknown data. Common regularization methods include L1 regularization (Lasso) and L2 regularization (Ridge).

Due to my limited learning ability, please correct me for the above mistakes.

I hope you will support us a lot, and we will share more novel and interesting things in the future

[Machine Learning] Model Optimization and Regularization Based on Linear Regression