"Machine Learning in Practice: Based on Scikit-Learn, Keras and TensorFlow Version 2" - Study Notes (4): Training Model

Chapter 4 Training Model

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (O'Reilly). Copyright 2019 Aurélien Géron, 978-1-492-03264-9. Environment: Anaconda (Python
3.8 ) + Pycharm
Learning time: 2022.04.16

So far, we have explored different machine learning models, but their respective training algorithms are still largely a black box. Looking back at some of the cases in the previous chapters, you may be surprised that so much has been achieved without knowing anything about the internals of the system: optimizing a regression system, improving a digital image classifier, Started building a spam classifier, all of which, you don't know how they actually work. Indeed, in many cases, you don't need to know the implementation details.

However, it is also very helpful to have a good understanding of how the system works. It helps to quickly locate the right model, the right training algorithm, and an appropriate set of hyperparameters for your task. Not only that, but post-processing allows you to perform error debugging and error analysis more efficiently. Finally, it should be emphasized that most of the topics explored in this chapter are critical to understanding, building, and training neural networks (Part II of this book).

In this chapter we'll start with one of the simplest models, the linear regression model, and introduce two very different approaches to training models:

  • Through the "closed" equation, directly calculate the model parameters that best fit the training set (that is, the model parameters that minimize the cost function on the training set);
  • Using an iterative optimization method, that is, gradient descent (GD), gradually adjust the model parameters until the cost function on the training set is adjusted to the minimum, and finally converge to the model parameters calculated by the first method. We also look at several variants of gradient descent, including batch gradient descent, mini-batch gradient descent, and stochastic gradient descent. Variations of these will be used frequently when we get to the second part of neural network learning.

Then we will move into a discussion of polynomial regression, which is a more complex model that is more suitable for non-linear data sets. Since this model has more parameters than a linear model, it is more prone to overfitting the training data, and we will use learning curves to tell if this is happening. Then, several regularization techniques are introduced to reduce the risk of overfitting the training data.

Finally, we'll learn about two models that are often used for classification tasks: Logistic Regression and Softmax Regression.

There will be a lot of mathematical formulas in this chapter, and some basic concepts of linear algebra and calculus will be used. To understand these equations, you need to know what vectors and matrices are, how to transpose vectors and matrices, what are dot products, inverse matrices, partial derivatives. If you are unfamiliar with these concepts, start with an introduction to linear algebra and calculus with the Jupyter notebooks in the online supplement. For readers who hate math very much, you still need to study this chapter, but you can skip the mathematical formulas, and hope that the text is enough to understand most of the concepts.

4.1 Linear regression

A linear model is simply a weighted sum of the input features, plus a constant we call a bias term (also known as an intercept term) to make predictions.
y = θ 0 + θ 1 x 1 + θ 2 x 2 + ... ... + θ nxny = θ_0 + θ_1x_1 + θ_2x_2+ ... + θ_nx_ny=i0+i1x1+i2x2++inxn
Linear regression model prediction (vectorized form): y = h θ ( x ) = θ ⋅ xy = h_θ(x) = θ xy=hi(x)=θx

In machine learning, vectors are often represented as column vectors, which are two-dimensional arrays with a single column. if θ θθ andxxx is a column vector, then the prediction is $y = θ^Tx, where, where, where θT$ is the transpose of $θ$ (a row vector instead of a column vector), and $θTxis θ for θfor θ Tandand the matrix product of x$.

This is the linear regression model, how do we train the linear regression model? Recall that training a model is the process of setting model parameters until the model best fits the training set. To do this, we first need to know how to measure how well a model fits the training data. In Chapter 2, we learned that the most common performance metric for regression models is root mean square error (RMSE). Therefore, when training a linear regression model, you need to find the θ that minimizes the RMSETheta value. In practice, minimizing the mean squared error (MSE) is simpler than minimizing the RMSE, and both have the same effect (since the value that minimizes the function also minimizes its square root).

The MSE cost function of the linear regression model: MSE = ( X , h 0 ) = 1 m ∑ i = 1 m ( θ T x ( i ) − y ( i ) ) 2 MSE = (X, h_0) = \frac{1 }{m}\sum^m_{i=1}(θ^Tx^{(i)}-y^{(i)})^2MSE=(X,h0)=m1i=1m( iTx(i)y(i))2

4.1.1 Standard equation

To find the value of θ that minimizes the cost function, there is a closed-form solution—that is, a mathematical equation that leads directly to the result, the standard equation.

θ ′ = ( X T X ) − 1 X T y θ' = (X^TX)^{-1}X^Ty i=(XTX)1XT y, we generate some linear data to test this formula:

import numpy as np
import matplotlib.pyplot as plt

# 随机生成数据
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)  # dot()方法计算矩阵内积

print(theta_best)
# 输出:期待的是θ0=4,θ1=3得到的是θ0=3.6,θ1=3.2。非常接近,噪声的存在使其不可能完全还原为原本的函数

# 根据参数做出预测
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]  # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
print(y_predict)

# 绘制模型的预测结果
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

Performing linear regression with Scikit-Learn is simple:

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()  # 实例化线性模型
lin_reg.fit(X, y)  # 训练模型

print(lin_reg.intercept_, lin_reg.coef_)  # 输出模型训练后的参数
print(lin_reg.predict(X_new))  # 进行预测

# LinearRegression类基于scipy.linalg.lstsq()函数(名称代表“最小二乘”),你可以直接调用它:
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
print(theta_best_svd)  # 输出最优的参数

# LinearRegression()模型的参数和scipy.linalg.lstsq()函数的参数输出是一致的

scipy.linalg.lstsq()Function calculation θ ′ = X + y θ' = X^+yi=X+ y, whereX + X^+X+ isXXThe pseudo-inverse of X. The pseudo-inverse itself is computed using a standard matrix decomposition technique known as **Singular Value Decomposition (SVD)**, where the training set matrixXXX is decomposed into three matricesU Σ VT UΣV^TUΣVThe product of T. In order to compute the matrixΣ + Σ^+S+ , the algorithm takesΣ ΣΣ and set all values ​​less than a small threshold to zero, then replace all non-zero values ​​with their inverses, and finally transpose the resulting matrix. Coupled with the fact that it handles edge cases quite well, the method is more efficient than computing standard equations.

4.1.2 Computational Complexity

The standard equation computes the inverse of XT X, which is an (n+1) by (n+1) matrix (n is the number of features). The computational complexity of inverting such a matrix is ​​typically between O(n2.4) and O(n3), depending on the implementation. In other words, if you double the number of features, then the calculation time will be multiplied by about 22.4=5.3 times to 23=8 times.

The SVD method used by Scikit-Learn's LinearRegression class has a complexity of about O(n2). If you double the number of features, that computes about 4 times as long.

When the number of features is relatively large (for example, 100 000), the calculation of the standard equation and SVD will be extremely slow. On the bright side, both are linear with respect to the number of instances in the training set (O(m)), so can efficiently handle large training sets, as long as memory is sufficient.

Likewise, once a linear regression model is trained (whether it is the standard equation or another algorithm), predictions are very fast: because the computational complexity is linear in the number of instances and features you want to predict. In other words, making predictions on twice as many instances (or twice as many features) takes about twice as long.

Now, let's look at several different training methods for linear regression models. These methods are more suitable for scenarios where the number of features or the number of training instances is too large to meet the requirements of memory.

4.2 Gradient descent and its algorithm

Gradient descent is a very general optimization algorithm capable of finding optimal solutions to a wide range of problems. The central idea of ​​gradient descent is to iteratively adjust parameters to minimize the cost function.

**Suppose you are lost in a thick fog on a mountain and all you can feel is the slope of the road beneath your feet. One strategy for getting to the bottom of the mountain quickly is to descend in the steepest direction. **This is the way of gradient descent: by measuring the local gradient of the error function related to the parameter vector θ, and continuously adjusting in the direction of decreasing the gradient until the gradient drops to 0 and reaches the minimum value !

Specifically, first use a random θ θθ (this is called random initialization), and then gradually improve, each time you take a step, each step tries to reduce the cost function (such as MSE) a little bit, until the algorithm converges to a minimum value (see Figure 4-3).

insert image description here

An important parameter in gradient descent is the step size of each step, which depends on the hyperparameter learning rate. If the learning rate is too low, the algorithm needs to go through a lot of iterations to converge, which will take a long time; conversely, if the learning rate is too high, then you may cross the valley directly to the other side, and it may even be faster than the previous starting point. to be tall. This causes the algorithm to diverge, with larger and larger values, and finally fails to find a good solution.

Finally, not all cost functions look like a pretty bowl. Some may look like caves, mountains, plateaus or various irregular terrains, making it difficult to converge to the minimum. The figure below shows the two main challenges of gradient descent: If the initialization is random and the algorithm starts from the left, it will converge to a local minimum instead of a global minimum. If the algorithm starts on the right side, it will take a long time to cross the entire plateau, and if you stop too early, it will never reach the global minimum.

insert image description here

Fortunately, the MSE cost function of a linear regression model happens to be convex, which means that a line segment connecting any two points on the curve will never intersect the curve. That is, there is no local minimum, only a global minimum. It's also a continuous function, so there's no steep change in slope. The conclusion of these two guarantees is: Even if you walk randomly, the gradient descent can approach the global minimum (as long as the waiting time is long enough, the learning rate is not too high).

The cost function is bowl-shaped, but it could be a very elongated bowl if the dimensions of the different features vary greatly. For the gradient descent shown in Figure 4-7, feature 1 and feature 2 on the left training set have the same numerical scale, while on the right training set, the value of feature 1 is much smaller than feature 2 (Note: because The value of feature 1 is small, so a larger change in θ1 is needed to affect the cost function, which is why the bowl is elongated along the θ1 axis.).

insert image description here

As you can see, the gradient descent algorithm on the left goes straight to the minimum, which can be reached quickly. In the picture on the right, it first advances in a direction nearly perpendicular to the direction of the global minimum, followed by a long, almost flat valley. The minimum will still be reached eventually, but it will take a lot of time.

When applying gradient descent, you need to ensure that all eigenvalues ​​have similar size ratios (such as using Scikit-Learn's StandardScaler class), otherwise the convergence time will be much longer.

The figure above also shows that training the model is about finding the parameter combination that minimizes the cost function (on the training set). This is a search at the level of the model parameter space: the more parameters the model has, the more dimensions the space has, and the harder it is to search. Similarly, finding a needle in a haystack is much trickier in a three-hundred-dimensional space than in a three-dimensional space. Fortunately, the cost function of a linear regression model is convex, and the needle lies at the bottom of the bowl.

4.2.1 Batch Gradient Descent (BGD)

To implement gradient descent, you need to calculate each model with respect to the parameters θ j θ_jijThe gradient of the cost function of . In other words, what you need to calculate is how to change θ j θ_jij, how much the cost function will change. This is called a partial derivative.

insert image description here

If you don't want to calculate these partial derivatives separately, you can use a formula to calculate them all at once. The gradient vector is denoted as ▽ θ MSE ( θ ) ▽ θMSE(θ)θ M S E ( θ ) , containing the partial derivatives of all cost functions (one for each model parameter).

insert image description here

Note that each step of gradient descent is calculated on the full training set X. This is why the algorithm is called batch gradient descent: the entire batch of training data is used at each step (actually, full gradient descent might be a better name). As a result, the algorithm becomes extremely slow when faced with very large training sets (although we are about to see much faster gradient descent algorithms). However, the gradient descent algorithm scales better with the number of features. If the linear model to be trained has hundreds of thousands of features, using gradient descent is much faster than standard equations or SVD.

Once you have the gradient vector, which point goes up, goes downhill in the opposite direction. That is, from θ θθ中凯去▽ θ MSE ( θ ) ▽θMSE(θ)θ M S E ( θ ) . At this time the learning rateη ηη comes into play: multiply the gradient vector byη ηη determines the size of the downhill step.

insert image description here

Let's look at a quick implementation of the algorithm:

eta = 0.1  # learning rate
n_iterations = 1000
m = 100
theta = np.random.randn(2, 1)  # random initialization
for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    print(theta)

The figure below shows the first ten steps of gradient descent using three different learning rates (the dotted line indicates the starting point).

insert image description here

To find a suitable learning rate, you can use ⭐Grid Search⭐ (see Chapter 2). But you may want to limit the number of iterations so that the grid search can weed out models that take too long to converge.

You may ask, how to limit the number of iterations? If it is set too low, the algorithm may stop when it is far from the optimal solution. However, if the setting is too high, after the model reaches the optimal solution, the parameters will not change if the iteration continues, which will waste time. A simple approach is to set a very large number of iterations at the beginning, but interrupt the algorithm when the value of the gradient vector becomes very small - that is, when its norm becomes lower than (called the tolerance), Because at this point the gradient descent has (almost) reached its minimum.

convergence speed:

When the cost function is convex and there is no steep change in slope (like the MSE cost function), batch gradient descent with a fixed learning rate will eventually converge to the optimal solution, but you need to wait a while: it can do O(1/ ∈) iterations to reach the optimal value in the range of ∈, depending on the shape of the cost function. In other words, if you shrink the tolerance by 1/10 (to get a more accurate solution), the algorithm will have to run 10 times longer.

4.2.2 Stochastic Gradient Descent (SGD)

The main problem with batch gradient descent is that it uses the entire training set to compute the gradient at each step, so the algorithm is particularly slow when the training set is large. The opposite extreme is stochastic gradient descent, where each step randomly selects an instance in the training set and computes the gradient based on that single instance only. Obviously, this makes the algorithm much faster, since each iteration only needs to operate on a small amount of data. It can also be used to train on massive datasets, since only one instance needs to be run in memory per iteration (SGD can be implemented as an out-of-core algorithm, see Chapter 1).

On the other hand, it is much less regular than batch gradient descent due to the stochastic nature of the algorithm. The cost function will no longer slowly decrease until it reaches the minimum value, but will continue to go up and down, but overall, it is still slowly decreasing. Over time, it eventually gets very close to the minimum, but even when it does, it keeps bouncing back and never stops (see Figure 4-9). So the parameter values ​​at which the algorithm stops must be good enough, but not optimal.

When the cost function is very irregular (see Figure 4-6), stochastic gradient descent can actually help the algorithm jump out of the local minimum, so it has an advantage in finding the global minimum compared to batch gradient descent.

Therefore, the advantage of randomness is that it can escape from the local optimum, but the disadvantage is that the minimum value can never be located. One solution to this dilemma is to gradually reduce the learning rate. Start with a large step size (which helps to progress quickly and escape local minima), and then get smaller and smaller to get the algorithm as close to the global minimum as possible. This process is called simulated annealing because it is similar to the metallurgical annealing process in which molten metal is slowly cooled. The function that determines the learning rate for each iteration is called the learning rate schedule. If the learning rate is reduced too quickly, it may get stuck in a local minimum, or even get stuck halfway to the minimum. If the learning rate is dropped too slowly, it will take too long for you to jump near the minimum, and if you end training early, you may only get a suboptimal solution.

The following code implements stochastic gradient descent using a simple learning rate schedule:

def learning_schedule(t):
    return t0 / (t + t1)


theta = np.random.randn(2, 1)  # random initialization
for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch * m + i)
        theta = theta - eta * gradients
        print(theta)

By convention, we iterate for m rounds. Each round is called a round. While the batch gradient descent code does 1000 iterations over the entire training set, this code only iterates through the training set 50 times and reaches a nice solution:

When using stochastic gradient descent, the training instances must be independent and uniformly distributed (IID) to ensure that the parameters are pulled towards the global optimum on average. An easy way to ensure this is to shuffle the instances during training (e.g., randomly select each instance, or shuffle the training set at the beginning of each epoch). If the instances are not shuffled (for example, if the instances are sorted by label), then SGD will first optimize for one label, then the next label, and so on, and it will not approach the global minimum.

To perform linear regression using stochastic gradient descent with Scikit-Learn, you can use the SGDRegressor class , which optimizes the squared error cost function by default. The following code runs for up to 100,000 epochs, or until the loss drops less than 0.001 during an epoch (max_iter=1000, tol=1e-3). It starts with a learning rate of 0.1 (eta0=0.1) using the default learning schedule (different from the previous one). Finally, it does not use any regularization (penalty=None, more on this later):

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=100000, tol=1e-3, penalty=None, eta0=0.1)  # 实例化SGDRegressor的类
sgd_reg.fit(X, y.ravel())  # 训练模型
print(sgd_reg.intercept_, sgd_reg.coef_)  # 输出模型的参数

4.2.3 Mini-Batch Gradient Descent

The last gradient descent algorithm we'll look at is called mini-batch gradient descent. Once you understand batch and stochastic gradient descent, it's easy to understand: at each step, instead of computing the gradient on the full training set (as in batch gradient descent) or only on a single instance (as in stochastic gradient descent), the mini-batch Gradient descent computes gradients on random sets of instances called mini-batches. The main advantage of mini-batch gradient descent over stochastic gradient descent is that you can improve performance through hardware optimization of matrix operations, especially when using GPUs.

Compared to stochastic gradient descent, the algorithm progresses more steadily through parameter space, especially in fairly large mini-batches. As a result, mini-batch gradient descent will eventually go closer to the minimum than stochastic gradient descent, but it may have a hard time getting rid of local minima (unlike linear regression, where affected by local minima). The figure below shows the paths taken by the three gradient descent algorithms in parameter space during training. They both eventually approach the minimum, but the path of batch gradient descent actually stops at the minimum, while both stochastic gradient descent and mini-batch gradient descent continue to walk. However, don't forget that batch gradient descent takes a lot of time per step, and if you use a good learning rate schedule, stochastic gradient descent and mini-batch gradient descent will also reach a minimum.

insert image description here

Let's compare the linear regression algorithms discussed so far (recall that mmm is the number of training instances,nnn is the number of features).

insert image description here

After training there is little difference: all these algorithms end up with very similar models and predict in exactly the same way.

4.3 Polynomial regression

What if your data is more complex than a straight line? Surprisingly, you can use linear models to fit non-linear data. A simple approach is to add the power of each feature as a new feature and then train a linear model on this expanded feature set. This technique is called polynomial regression.

Let's look at an example.

First, let's generate some non-linear data based on a simple quadratic equation (note: the quadratic equation is of the form y=ax2+bx+c.) (plus some noise):

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

insert image description here

Obviously, a straight line will never fit this data correctly. So let's use Scikit-Learn's PolynomialFeatures class to transform the training data, adding the square (quadratic polynomial) of each feature in the training set as a new feature (in this case, only one feature):

# 将每一个特征的平方都变成一个新的特征
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print(X[0])
print(X_poly[0])

X_poly now contains the original feature of X as well as the square of that feature. Now you can fit a LinearRegression model to this extended training data:

# 进行预测
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
print(lin_reg.intercept_, lin_reg.coef_)  # 输出参数

insert image description here

Note that polynomial regression is able to find relationships between features when there are multiple features (something ordinary linear regression models cannot). PolynomialFeatures can also add all combinations of features up to a given polynomial order. For example, if there are two features a and b, PolynomialFeatures with degree=3 will not only add the features a2, a3, b2 and b3, but also the combination ab, a2b and ab2.

PolynomialFeatures(degree=d)One can include nnAn array of n features is transformed to contain( n + d ) ! d ! n ! \frac{(n+d)!}{d!n!}d!n!(n+d)!An array of features, where n ! n!n ! isnnThe factorial of n , equal to1 × 2 × 3 × … × n 1 × 2 × 3 × … × n1×2×3××n . Beware of the explosion in the number of feature combinations.

4.4 Learning Curve

If you perform a higher order polynomial regression, it will probably fit the data better than a normal linear regression.

This higher-order polynomial regression model severely overfits the training data, while the linear model underfits. In this case, the model that generalizes best is a quadratic model because the data were generated using a quadratic model. But in general, you don't know what function the data was generated by, so how do you determine the complexity of the model? How do you tell if a model is overfitting the data or underfitting the data?

In Chapter 2, you used cross-validation to estimate the generalization performance of a model. ⭐**If the model performs well on the training data, but generalizes poorly according to the cross-validation metric, then your model is overfitting. If neither performs well, then it is underfitting. **⭐ This is a way to tell if a model is too simple or too complex.

Another approach is to look at the learning curve: this curve plots the performance of the model on the training and validation sets as a function of the training set size (or training iterations). To generate this curve, it is only necessary to train the model multiple times on training subsets of different sizes. The following code defines a function that plots the learning curve of the model given the training set:

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    plt.show()


# 看一下普通线性回归模型的学习曲线
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

insert image description here

This underfitting model deserves some explanation. First, let's look at the performance on the training data: when there are only one or two instances in the training set, the model fits them well, which is why the curve starts from zero. However, as new instances are added to the training set, it is impossible for the model to fit the training data perfectly, both because the data is noisy and because it is not linear at all. Therefore, the error on the training data keeps going up until it reaches a plateau, at which point adding new instances to the training set does not make the average error better or worse. Now let's look at the performance of the model on the validation data. When the model is trained on very few training examples, it fails to generalize correctly, which is why the validation error is initially large. Then, as the model goes through more training examples, it starts to learn, so the validation error gradually decreases. However, straight lines don't model the data well, so the error eventually plateaus very close to the other curve.

These learning curves are typical of underfit models. Both curves reach a plateau. They are close and tall.

If your model is underfitting the training data, adding more training examples won't help. You need to use more complex models or provide better features.

Now let's look at the learning curve of a polynomial model of order 10 on the same data:

# 在相同数据上的10阶多项式模型的学习曲线
from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline([("poly_features", PolynomialFeatures(degree=10, include_bias=False)),("lin_reg", LinearRegression())])
plot_learning_curves(polynomial_regression, X, y)

insert image description here

These learning curves look a bit like the previous ones, but there are two very important differences:

  • The error on the training data is much lower compared to the linear regression model.

  • There are gaps between the curves. This means that the model performs better on the training data than on the

The performance on the validation data is much better, which is a sign of an overfit model. However, if you use a larger training set, the two curves continue to approach.

One way to improve an overfit model is to feed it more training data until the validation error reaches the training error.

Bias/variance trade-off

An important theoretical result in statistics and machine learning is the fact that the generalization error of a model can be expressed as the sum of three very different errors:

  • Bias: This part of the generalization error is due to incorrect assumptions, such as assuming that the data is linear when in fact it is quadratic. Highly biased models are most likely to underfit the training data.
  • Variance: This is partly due to the model being too sensitive to small changes in the training data. Models with many degrees of freedom, such as high-order polynomial models, may have high variance and thus may overfit the training data.
  • Unavoidable errors This part of the error is due to the noise of the data itself. The only way to reduce this part of the error is to clean the data (such as repairing data sources such as broken sensors, or detecting and removing outliers).

Increasing the complexity of the model usually significantly increases the variance of the model and reduces bias. Conversely, reducing model complexity increases model bias and reduces variance. That's why it's called a tradeoff.

Do not confuse the concept of bias here with the concept of a bias term in linear models.

4.5 Regularization of Linear Models

As we saw in Chapters 1 and 2, a good way to reduce overfitting is to regularize the model (i.e. constrain the model): the fewer degrees of freedom it has, the less The more difficult it is. A simple way to regularize a polynomial model is to reduce the degree of the polynomial.

For linear models, regularization is usually achieved by constraining the weights of the model. Now, we look at Ridge Regression, Lasso Regression, and Elastic Net, which implement three methods of constraining weights.

4.5.1 Ridge regression

Ridge regression (also known as Tikhonov regularization) is a regularized version of linear regression: will be equal to α ∑ i = 1 n θ i 2 α\sum^n_{i=1}θ_i^2ai=1nii2The regularization term of is added to the cost function. This forces the learning algorithm to not only fit the data, but also make the model weights as small as possible. Note: The regularization term is only added to the cost function during training . After training the model, you evaluate the performance of the model using an unregularized performance metric.

It is very common that the cost function used during training differs from the performance metric used for testing. Besides regularization, another reason why they may differ is that a good training cost function should have an optimization-friendly derivative, while the performance metric used for testing should be as close as possible to the final goal. For example, it is common to train a classifier using a cost function such as log loss (discussed later), but evaluate it using precision/recall.

The hyperparameter α controls how much to regularize the model. If α=0, ridge regression is only linear regression. If α is very large, all weights end up very close to zero, and the result is a flat line through the mean of the data. Ridge regression cost function is given by the formula:
Ridge regression cost function: J ( θ ) = MSE ( θ ) + α 1 2 ∑ i = 1 n θ i 2 Ridge regression with closed-form solution: θ ′ = ( XTX + α A ) − 1 XT y Ridge regression cost function: J(θ) = MSE(θ) + α\frac{1}{2}\sum^n_{i=1}θ^2_i\\ Ridge regression of closed-form solution: θ' = (X^TX+αA)^{-1}X^TyRidge regression cost function : J ( θ )=MSE(θ)+a21i=1nii2Ridge regression with closed- form solution : θ=(XTX+αA)1XT y
⭐ It is important to scale the data (e.g. using StandardScaler) before performing ridge regression, as it is sensitive to the scaling of the input features. This is required for most regularized models.

Here's how to perform ridge regression with Scikit-Learn and the closed-form solution (a variant of Equation 4-9 that uses the André Louis Cholesky matrix factorization technique):

# 岭回归求解
from sklearn.linear_model import Ridge

# 直接使用Ridge()函数
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
print(ridge_reg.predict([[1.5]]))

# 或者直接使用随机梯度下降法:
sgd_reg = SGDRegressor(penalty="l2")  # 超参数penalty设置的是使用正则项的类型。
# 设为"l2"表示希望SGD在成本函数中添加一个正则项,等于权重向量的l2范数的平方的一半,即岭回归。
sgd_reg.fit(X, y.ravel())  # 需要对y进行一个塑形
print(sgd_reg.predict([[1.5]]))

4.5.2 Lasso regression

Another regularization of linear regression is called Least Absolute Shrinkage and Selection Operator Regression (Lasso regression for short). Like ridge regression, it also adds a regularization term to the cost function, but it adds l 1 l1 of the weight vectorl 1 norm, notl 2 l2Half of the square of the l2 norm .
L asso regression cost function: J ( θ ) = MSE ( θ ) + α 1 2 ∑ i = 1 n ∣ θ i ∣ Lasso regression cost function: J(θ) = MSE(θ)+α\frac{1}{ 2}\sum^n_{i=1}|θ_i|\\L a s s o regression cost function : J ( θ )=MSE(θ)+a21i=1nθi

insert image description here

An important feature of Lasso regression is that it tends to completely remove the weights of the least important features (i.e. set them to zero). In other words, Lasso regression automatically performs feature selection and outputs a sparse model (i.e. only few features have non-zero weights).

To avoid that gradient descent ends up bouncing around the optimal solution when using Lasso, you need to gradually reduce the learning rate during training (it will still bounce around the optimal solution, but with smaller and smaller steps, so it will converge).

Here's a small Scikit-Learn example using the Lasso class:

# Lasso回归
from sklearn.linear_model import Lasso

# 直接使用Lasso()函数
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
print(lasso_reg.predict([[1.5]]))

# 或者使用SGDRegressor(penalty="l1")
sgd_lasso_reg = SGDRegressor(penalty='l1')
sgd_lasso_reg.fit(X, y.ravel())  # 需要对y进行一个塑形
print(sgd_lasso_reg.predict([[1.5]]))

4.5.3 Elastic Network

Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularizer is a simple mixture of Ridge and Lasso regularizers, and you can control the mixture ratio r. When r=0, Elastic Net is equivalent to Ridge Regression, and when r=1, Elastic Net is equivalent to Lasso Regression.
Elastic network cost function: J ( θ ) = MSE ( θ ) + r α 1 2 ∑ i = 1 n ∣ θ i ∣ + 1 − r 2 α ∑ i = 1 n θ i 2 Elastic network cost function: J(θ ) = MSE(θ)+rα\frac{1}{2}\sum^n_{i=1}|θ_i|+\frac{1-r}{2}α\sum^n_{i=1}θ ^2_iElastic network cost function : J ( θ )=MSE(θ)+rα21i=1nθi+21rai=1nii2
So when should you use plain old linear regression (i.e. without any regularization), Ridge, Lasso or elastic net?

⭐In general, regularization - even if small - is preferable to none at all. So most of the time, you should avoid pure linear regression. Ridge regression is a good default choice, but if you feel that only a few features are actually used, you should prefer Lasso regression or elastic net, because they will reduce the weight of useless features to zero. In general, Elastic Network is better than Lasso Regression, because when the number of features exceeds the number of training examples, or when several features are strongly correlated, the performance of Lasso Regression may be very unstable.

Here is a small example of ElasticNet using Scikit-Learn ( l1_ratiocorresponding to the mixture ratio r):

# 弹性回归ElasticNet
from sklearn.linear_model import ElasticNet

# 使用ElasticNet()函数
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
print(elastic_net.predict([[1.5]]))

# 使用SGDRegressor(penalty="elasticnet")
sgd_lasso_reg = SGDRegressor(penalty='elasticnet')
sgd_lasso_reg.fit(X, y.ravel())   # 需要对y进行一个塑形
print(sgd_lasso_reg.predict([[1.5]]))

4.5.4 Early stopping

For iterative learning algorithms such as gradient descent, there is also a distinctive regularization method, which is to stop training when the verification error reaches the minimum value, which is called early stopping method. The figure below shows a complex model (higher order polynomial regression model) trained with batch gradient descent. After rounds of training, the algorithm continues to learn, and the prediction error (RMSE) on the training set naturally continues to decrease, and its prediction error on the verification set also decreases. However, after a while, the validation error stops falling and starts to rise instead. This indicates that the model is starting to overfit the training data. With early stopping, training is stopped as soon as the validation error reaches a minimum. This is a very simple and effective regularization technique, so Geoffrey Hinton called it a "beautiful free lunch".

With stochastic and mini-batch gradient descent, the curve is not as smooth and it can be difficult to know if you hit a minimum. One solution is to stop only after the validation error exceeds the minimum for a period of time (when you are sure the model will not do better), then roll back the model parameters to where the validation error was minimal. The following is a basic implementation of the early stopping method:

# 提前停止法的基本实现:
from sklearn.base import clone
from sklearn.preprocessing import StandardScaler

# prepare the data
poly_scaler = Pipeline([("poly_features", PolynomialFeatures(degree=90, include_bias=False)), ("std_scaler", StandardScaler())])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)  # 训练集完成数据处理
X_val_poly_scaled = poly_scaler.transform(X_val)  # 验证集完成数据处理
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005)
# 请注意,在使用warm_start=True的情况下,当调用fit()方法时,它将在停止的地方继续训练,而不是从头开始。
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train)  # continues where it left off  在停止的地方继续训练 用训练集训练模型
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)  # 预测验证集
    val_error = mean_squared_error(y_val, y_val_predict)   # 计算均方误差
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

4.6 Logistic regression

As mentioned in Chapter 1, some regression algorithms can also be used for classification (and vice versa). Logistic regression (also known as Logit regression) is widely used to estimate the probability that an instance belongs to a particular class. (For example, what is the probability that this email is spam?) If the estimated probability is more than 50%, the model predicts that the instance belongs to that class (called positive class, marked as "1"), otherwise, the prediction is not (called the negative class, labeled "0"). This way it becomes a binary classifier.

4.6.1 Estimated probability

So how does logistic regression work? Like the linear regression model, the logistic regression model also calculates the weighted sum of the input features (plus a bias term), but unlike the linear regression model that directly outputs the result, it outputs the mathematical logic value of the result.
Estimated probability of logistic regression model (vectorized form): p ′ = h θ ( x ) = σ ( x T θ ) logistic function: σ ( t ) = 1 1 + exp ( − t ) Estimated probability of logistic regression model ( Vectorized form): p' = h_θ(x) = σ(x^Tθ)\\ Logistic function: σ(t) = \frac{1}{1+exp(-t)}Estimated probability of logistic regression model ( vectorized form ) : p _=hi(x)=s ( xT i)Logistic function : σ ( t )=1+exp(t)1
Once the logistic regression model estimates the probability p' p' of the instance x belonging to the positive classp , you can make a prediction:
y ′ = { 0 if p ′ < 0.5 1 if p ′ ≥ 0.5 y' = \begin{cases}0 & if p'<0.5\\1 & if p'≥0.5\end {cases}y={ 01if p<0.5if p0.5

The score t is usually called logit. The name comes from the fact that the logit function, defined as logit(p) = log(p/(1–p)), is the inverse of the logistic function. Indeed, if you compute the logarithm of the estimated probability p, you will find that the result is t. The logarithm is also called log-odd because it is the logarithm of the ratio of the estimated probability of the positive class to the estimated probability of the negative class.

4.6.2 Training and cost function

Now you know how logistic regression models estimate probabilities and make predictions. But how to train? The purpose of training is to set the parameter vector θ so that the model makes a high probability estimate (y=1) for positive class instances and a low probability estimate (y=0) for negative class instances.

insert image description here

This cost function makes sense because -log(t) becomes very large when t is close to 0, so if the model estimates the probability of a positive class instance close to 0, the cost will become very high. Similarly, it is estimated that the probability of a negative class instance is close to 1, and the cost will become very high. Then in turn, when t is close to 1, -log(t) is close to 0, so the estimated probability of a negative instance is close to 0, and the estimated probability of a positive instance is close to 1, and the cost are close to 0, isn't this exactly what we want? The cost function for the entire training set is the average cost of all training instances. can be represented by a single expression called the log loss, see Equation 4-17.

insert image description here

But the bad news is that this function has no known closed-form equation (there is no equivalent of a standard equation) to calculate the value of θ that minimizes the cost function. The good news is that this is a convex function, so gradient descent (or any other optimization algorithm) is guaranteed to find the global minimum (as long as the learning rate is not too high and you can wait for a long time).

4.6.3 Decision Boundary (Iris Plant Dataset)

Here we use the iris plant dataset to illustrate logistic regression. This is a very famous data set with a total of 150 iris flowers from three different species (Mountain Iris, Vermeil Iris and Virginia Iris), and the data contains the sepals of the flowers and the length and width of the petals.

Let's try to create a classifier to detect Virginia irises based on only one feature, petal width.

# 鸢尾花 逻辑回归
from sklearn import datasets

# 导入数据
iris = datasets.load_iris()
print(list(iris.keys()))
# 输出:['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']
X = iris["data"][:, 3:]  # petal width  花瓣宽度
y = (iris["target"] == 2).astype(int)  # 1 if Iris virginica, else 0  如果是Iris virginica花,就是1,否则就是0

# 训练一个逻辑回归模型
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X, y)

# 我们来看看花瓣宽度在0到3cm之间的鸢尾花,模型估算出的概率
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris virginica")
plt.legend()
plt.show()

insert image description here

The above drawing code is as follows:

# + more Matplotlib code to make the image look pretty
X_new = np.linspace(0, 3, 1000, dtype=object).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]
plt.figure(figsize=(8, 3))
plt.plot(X[y == 0], y[y == 0], "bs")
plt.plot(X[y == 1], y[y == 1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
plt.text(decision_boundary+0.02, 0.15, "Decision  boundary", fontsize=14, color="k", ha="center")
plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1, fc='b', ec='b')
plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1, fc='g', ec='g')
plt.xlabel("Petal width (cm)", fontsize=14)
plt.ylabel("Probability", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 3, -0.02, 1.02])
plt.show()

The petal width of the Virginia iris (shown as a triangle) ranges from 1.4 to 2.5 cm, while the petals of the other two species of iris (shown as a square) are usually narrower, with a petal width ranging from 0.1 to 1.8 cm. Note that there is some overlap here. For flowers with a petal width of more than 2cm, the classifier can confidently say that it is a Virginia iris (output a high probability value for this category), and for petals with a width of less than 1cm, it can also be confident Say it's not (output a high probability value for the "non-Virginia iris" category). Between these two extremes, the classifier is less confident. However, if you ask it to predict the class (using the predict() method instead of the predict_proba() method), it will return a most likely class. That is to say, there is a decision boundary at about 1.6cm, where the possibility of "yes" and "no" are both 50%. If the petal width is greater than 1.6cm, the classifier predicts that it is a Virginia iris, Otherwise predict no (even if it's not sure):

# 测试预测
print(log_reg.predict([[1.7], [1.5]]))  # 测试1.7和1.5两个值的输出

The image below is the same dataset, but this time showing two features: petal width and petal length. After training, this logistic regression classifier can predict whether a new flower belongs to Virginia iris or not based on these two features. The dashed line indicates the point at which the model estimates a probability of 50%, which is the decision boundary of the model. Note that this is a linear boundary (Note: This is a collection of points x such that θ0+θ1x1+θ2x2=0, which defines a straight line.). Each parallel line represents a specific probability of a model output, ranging from 15% in the lower left to 90% in the upper right. According to this model, all flowers above the upper right line have a greater than 90% probability of belonging to Virginia irises.

insert image description here

Like other linear models, logistic regression models can be modeled with l 1 l1l 1 orl 2 l2l 2 penalty function to regularize. Scikit-Learn addsl 2 l2l 2 function.

The hyperparameter controlling the regularization strength of the Scikit-Learn LogisticRegression model is not alpha (like other linear models), but the inverse value C. The higher the value of C, the less regularization is applied to the model.

4.6.4 Softmax regression

Logistic regression models are generalized to support multiple categories directly without the need to train and combine multiple binary classifiers (as described in Chapter 3). This is Softmax regression, or multiple logistic regression.

The principle is simple: given an instance x, the Softmax regression model first calculates the score sk(x) of each class k, and then applies the softmax function (also called the normalized index) to these scores to estimate the probability of each class . You should be familiar with the formula for calculating the sk(x) score (see formula), as it looks exactly like the equation for linear regression prediction.
Softmax score for class k: S k ( x ) = x T θ ( k ) Softmax score for class k: S_k(x) = x^Tθ^{(k)}S o f t m a x score for class k : Sk(x)=xT i( k )
Note that each class has its own specific parameter vectorθ( k ) θ(k)θ ( k ) . All these vectors are usually stored as rows in the parameter matrixθθ中.

Once the scores for each class are computed for an instance x, the probability that the instance belongs to class k can be estimated by a softmax function. The function calculates the exponent for each fraction and then normalizes it (divides by the sum of all exponents). Fractions are often called log or log odd (although they are actually unnormalized log odd).

Just like logistic regression classifiers, softmax regression classifiers predict the class with the highest estimated probability (in simple terms, the class with the highest score).

The Softmax regression classifier can only predict one class at a time (i.e. it is multiclass, not multioutput), so it can only be used with mutually exclusive classes (eg different types of plants). You can't use it to identify multiple people in a photo.

Let's use Softmax regression to classify irises into three categories. When training with more than two classes, Scikit-Learn's LogisticRegressio chooses to use the one-to-many training method by default, but setting the hyperparameter multi_class to "multinomial" can switch it to Softmax regression. You must also specify a solver that supports Softmax regression, such as the "lbfgs" solver. Use l 2 l2 by defaultl 2 regularization, you can control it through the hyperparameter C:

# softmax预测鸢尾花
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", C=10)
softmax_reg.fit(X, y)

# 当你下次碰到一朵鸢尾花,花瓣长5cm宽2cm,你就可以让模型告诉你它的种类,它会回答说:94.2%的概率是维吉尼亚鸢尾(第2类)或者5.8%的概率为变色鸢尾:
print(softmax_reg.predict([[5, 2]]))  # 预测属于哪一类
print(softmax_reg.predict_proba([[5, 2]]))  # 得到属于每一类的概率

The figure below shows the decision boundary represented by different background colors. Note that the decision boundary between any two classes is linear. The broken lines in the graph represent the probability of belonging to Iris versicolor (for example, the line labeled 0.45 represents the 45% probability boundary). Note that the class predicted by the model may have an estimated probability lower than 50%, e.g. where all decision boundaries intersect, all classes have an estimated probability of 33%.

insert image description here

4.7 Exercises

question

  1. Which linear regression training algorithm can be used if the training set has millions of features?

  2. Which algorithm might be affected if the numerical sizes of the features in the training set are very different? How is it affected? what should you do

  3. Can gradient descent get stuck in local minima when training a logistic regression model?

  4. Do all gradient descent algorithms produce the same model if you let them run long enough?

  5. Suppose you use batch gradient descent and plot the validation error at each epoch. If you see validation errors keep going up, what might be going on? How do you solve it?

  6. Is it a good idea to stop mini-batch gradient descent immediately when the validation error rises?

  7. Which gradient descent algorithm (among the ones we discussed) will get near the optimal solution the fastest? Which will actually converge? How to make the others converge too?

  8. Suppose you are using polynomial regression. After plotting the learning curve, you can see that there is a large gap between the training error and the validation error. what happened? What are the three ways to solve this problem?

  9. Suppose you are using ridge regression, and you notice that the training error and validation error are almost equal and quite high. Would you say the model has high bias or high variance? Should you increase the regularization hyperparameter α or decrease it?

  10. Why use:

    • Ridge regression instead of simple linear regression (i.e. without any regularization)?
    • Lasso instead of ridge regression?
    • Elastic Web instead of Lasso?
  11. Suppose you want to classify images into outdoor/indoor and day/night. Should you implement two logistic regression classifiers or one Softmax regression classifier?

  12. Batch gradient descent training with Softmax regression, implementing early stopping (without using Scikit-Learn).

Answer

  1. If your training set has millions of features, you can use stochastic gradient descent or mini-batch gradient descent. If the training set fits in memory, batch gradient descent can be used. But you can't use the standard equation method or the SVD method, because the computational complexity grows rapidly (more than quadratic) as the number of features increases.

  2. If the features in your training set have different size ratios, the cost function has the shape of an elongated bowl, so the gradient descent algorithm takes a long time to converge. To solve this problem, you should scale the data before training the model. Note that the standard equation method or the SVD method will work without scaling. Furthermore, regularized models may converge to a suboptimal solution if the features are not scaled: since regularization penalizes larger weights, features with smaller values ​​tend to be ignored.

  3. When training a logistic regression model, gradient descent does not get stuck in a local minimum because the cost function is convex.

  4. If the optimization problem is convex (such as linear regression or logistic regression), and assuming the learning rate is not too high, then all gradient descent algorithms will approach the global optimum and end up producing very similar models. However, stochastic gradient descent and mini-batch gradient descent will never truly converge unless the learning rate is gradually reduced. Instead, they keep bouncing back and forth around the global optimum. This means that even if you let them run for a long time, these gradient descent algorithms will produce slightly different models.

  5. If the validation error keeps rising after each epoch, one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also increases, then this is clearly the problem and you should reduce the learning rate. However, if the training error does not increase, your model has overfit the training set and you should stop training.

  6. Due to randomness, neither stochastic gradient descent nor mini-batch gradient descent is guaranteed to make progress in each training iteration. Therefore, if you stop training as soon as the validation error rises, you may be stopping too early, before reaching the optimal value. A better option is to save the model at regular intervals. Then, when it doesn't improve for a long time (meaning it may never exceed the optimal value), you can revert to the best model you saved.

  7. Stochastic gradient descent has the fastest training iterations because it only considers one training instance at a time, so it is usually the first to get near the global optimum (or mini-batch gradient descent with very small batch sizes). However, given enough training time, only batch gradient descent will actually converge. As mentioned before, stochastic gradient descent and mini-batch gradient descent will bounce around the optimal value unless you gradually reduce the learning rate.

  8. If the validation error is much higher than the training error, it may be because the model overfits the training set. One way to solve this problem is to reduce the polynomial order : models with fewer degrees of freedom are less likely to overfit. Another approach is to regularize the model , e.g., by l 2 l2l 2 (Ridge) orl 1 l1l 1 (Lasso) penalty added to the cost function. This also reduces the degrees of freedom of the model. Finally, you cantry increasing the size of the training set.

  9. If the training error and validation error are nearly equal and fairly high, the model is likely underfitting the training set, which means it has high bias. You should try reducing the regularization hyperparameter αa .

  10. let's see:

    • A model with some regularization is usually better than a model without any regularization, so you should usually prefer ridge regression over simple linear regression.
    • Lasso regression using l 1 l1l 1 penalty, which usually reduces the weights to zero. This will result in a sparse model where all but the most important weights are zero. This is a way to automate feature selection, whichis great if you suspect that only a few features are actually important. If you are unsure, Ridge regression should be preferred.
    • Elastic nets are generally preferred over Lasso because Lasso can produce anomalies in some cases (when several features are strongly correlated or when there are more features than training examples) . However, it does add additional hyperparameters that need to be tuned. If you want your Lasso to have no erratic behavior, you can just use l1_ratioan elastic net close to 1.
  11. If you were to classify images as outdoor/indoor and day/night as they are not exclusive

  12. classes (i.e. all four combinations are possible), two logistic regression classifiers should be trained.

  13. See Jupyter notebooks at https://github.com/ageron/handson-ml2 .

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124221601