[Python] How to use scikit-learn library for linear regression training and prediction

What is Scikit-learn?

Scikit-learn is a machine learning library for the Python programming language. It provides a variety of supervised and unsupervised learning algorithms, including classification, regression, clustering, dimensionality reduction, and more. Scikit-learn is easy to use and powerful, can handle large data sets, and is very scalable. It also provides many convenient tools such as data preprocessing, model selection, evaluation and visualization, etc. Scikit-learn is one of the preferred libraries used in many machine learning projects.

Install scikit-learn

pip install scikit-learn

Installing scikit-learn — scikit-learn 1.3.2 documentation

Linear regression using scikit-learn

The following is a detailed example of linear regression using scikit-learn:

1) First, we need to import the required libraries and datasets:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

2) Next, we create a simple data set where x is the independent variable and y is the dependent variable:

# 创建数据集
np.random.seed(0)
x = np.random.rand(100, 1)
y = 2 + 3 * x + np.random.rand(100, 1)

# 绘制数据集
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

3) Divide the data set into a training set and a test set:

# 将数据集分为训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

4) Create a linear regression model and fit the training data:

# 创建线性回归模型并拟合训练数据
regressor = LinearRegression()
regressor.fit(x_train, y_train)

5) Use the model to make predictions:

# 使用模型进行预测
y_pred = regressor.predict(x_test)

6) Evaluate the performance of the model:

# 评估模型的性能
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('均方误差:', mse)
print('R2分数:', r2)

The smaller the mean square error, the better the fitting effect, and the minimum value is 0.

The R2 score is just the opposite of the mean square error, the bigger the better, the maximum value is 1.

Mean Squared Error (MSE) is a statistical index used to evaluate the goodness of fit of a regression model. It represents the average of the sum of squares of the differences between the predicted value and the actual value, that is, the average of the sum of squares of the errors.

The smaller the MSE, the better the prediction effect of the model, and the smaller the MSE, the worse the prediction effect of the model. Usually, the smaller the MSE, the better the fitting effect of the model.

The R2 score (Coefficient of Determination), also known as the coefficient of determination, is a statistical index used to evaluate the goodness of fit of a regression model. It represents the extent to which the independent variable explains the dependent variable, that is, the proportion of variation in the dependent variable that the independent variable can explain.

The R2 score ranges from 0 to 1. The closer it is to 1, the better the model fitting effect. The closer it is to 0, the worse the model fitting effect is.

The R2 score can also be used to compare the fitting effects of different models. Generally, the larger the R2 score, the better the fitting effect of the model.

7) Finally, we can plot the fitting results:

# 绘制拟合结果
plt.scatter(x_test, y_test, color='blue', label='actual')
plt.plot(x_test, y_pred, color='red', linewidth=2, label='predict')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

This case shows how to use scikit-learn to perform linear regression. First, we create a simple dataset and then split it into training and test sets. Next, we created a linear regression model and fit it using the training data. Finally, we evaluated the model using the test data and plotted the fit results.

Guess you like

Origin blog.csdn.net/u011775793/article/details/135436120