Machine learning practice: Python predicts based on LR linear regression (10)

Article Directory

1 Introduction

Note that LR here refers to Linear Regression , linear regression. Rather than logistic regression Logistic Regression , although the abbreviation of both is LR, but the latter is commonly known as Logistic multi-point

1.1 Introduction to LR

Linear Regression is a statistical and machine learning method used to model the linear relationship between an independent variable and a continuous dependent variable. It is one of the simplest and most common regression analysis methods.

The goal of linear regression is to describe the relationship between independent variables and dependent variables by fitting the optimal straight line ( univariate linear regression ) or hyperplane ( multiple linear regression ). It assumes that there is a linear relationship between the independent variable and the dependent variable, that is, the dependent variable can be explained by a linear combination of the independent variables.

The mathematical expression of the unary linear regression model is: Y = β0 + β1*X + ε, where Y is the dependent variable, X is the independent variable, β0 and β1 are the regression coefficients, and ε is the error term. This model describes the linear relationship between the dependent variable Y and the independent variable X, with β0 being the intercept and β1 being the slope.

Multiple linear regression models extend unary linear regression to handle multiple independent variables. The mathematical expression is: Y = β0 + β1X1 + β2X2 + ... + βn*Xn + ε，其中Y是因变量，X1, X2, ..., Xn is multiple independent variables, β0, β1, β2, ..., βn are regression coefficients, ε is the error term.

advantage:

Simplicity and Interpretation: Linear regression is a simple and intuitive method that is easy to understand and interpret. It establishes a linear relationship between the independent variable and the dependent variable, and the degree and direction of the influence of the independent variable on the dependent variable can be explained through the regression coefficient.
Computationally efficient: Linear regression is often computationally efficient, especially with large samples and low-dimensional feature spaces. Fitting a linear regression model has low computational complexity and can handle large datasets.
Strong interpretability: Linear regression can provide quantitative information on the relationship between variables and the degree of influence. The regression coefficient can quantify the contribution of the independent variable to the dependent variable and help understand the relationship between variables.
High prediction accuracy: Linear regression can provide high prediction accuracy when the data conforms to a linear relationship. When there is a linear relationship between the independent variable and the dependent variable, linear regression can obtain a better fitting effect.

shortcoming:

Linear Assumption Limitation: Linear regression assumes that there is a linear relationship between the independent and dependent variables, which is not always true in practical problems. If the true relationship in the data is non-linear, linear regression models may fail to capture complex patterns and associations.
Sensitive to outliers: Linear regression is sensitive to outliers (extreme values in the dependent or independent variables). Outliers can have a significant impact on the fit of the model, leading to model inaccuracy.
Inability to handle high-dimensional features: Linear regression faces challenges when dealing with problems with high-dimensional feature spaces. When the number of independent variables is much larger than the sample size, linear regression can suffer from overfitting.
Lack of flexibility: Linear regression is less flexible and cannot capture complex nonlinear relationships. For nonlinear problems, additional, more complex models are required to improve the fit.

1.2 Application of LR

This is the most basic machine learning algorithm, and it has a wide range of applications:

Economics and Finance: Linear regression can be used to predict the relationship between economic indicators (such as GDP, inflation rate, etc.) and independent variables (such as consumption, investment, exports, etc.) for economic forecasting and policy analysis. In the financial field, linear regression can be used to predict financial indicators such as stock prices and interest rates.
Marketing: Linear regression can be used in marketing research, such as predicting the relationship between sales volume, advertising investment, price and other factors, and conducting market demand analysis and marketing strategy formulation.
Medicine and Health Sciences: Linear regression can be used to analyze data in the fields of medicine and health, such as predicting the relationship between disease development and risk factors, evaluating the effect of treatment methods, analyzing biomedical data, etc.
Social science: Linear regression can be used in research in the field of social sciences, such as predicting the relationship between student grades and study time, family background and other factors in educational research, and analyzing the correlation between income and education level, occupation, etc. in socioeconomics.
Environmental Science: Linear regression can be used to analyze environmental data, such as predicting the relationship between temperature and greenhouse gas emissions, atmospheric pollutants, and evaluating the impact of environmental factors on ecosystems.
Engineering and Physical Sciences: Linear regression can be used for building physical models and predictions in engineering design. For example, predict the relationship between material strength and factors such as temperature and pressure, and analyze the relationship between the performance of electronic components and design parameters.

2. Actual demonstration of weather data set

2.1 Import function

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline

2.2 Import data

The weather dataset includes precipitation, snowfall, temperature, wind speed, and whether the day included thunderstorms or other severe weather conditions. The task is to predict the maximum temperature with the input feature as the minimum temperature.
Download address: https://github.com/Vaibhav-Mehta-19/linear-regression-weather-dataset

dataset = pd.read_csv('weather.csv')
print(dataset.shape)
dataset.describe()

2.3 Overall data visualization

# 最高温和最低温的二维散点图
dataset.plot(x='MinTemp', y='MaxTemp', style='o')  
plt.title('MinTemp vs MaxTemp')  
plt.xlabel('MinTemp')  
plt.ylabel('MaxTemp')  
plt.show()

# 检查平均最高温
plt.figure(figsize=(15,10))
plt.tight_layout()
seabornInstance.distplot(dataset['MaxTemp'])

According to the results, it is about 15-20.

2.4 Training model

X = dataset['MinTemp'].values.reshape(-1,1)
y = dataset['MaxTemp'].values.reshape(-1,1)
# 老惯例，训练集/测试集按7/3分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

Compute the intercept and slope:

print(regressor.intercept_)
print(regressor.coef_)

Means that for every one unit change in the minimum temperature, the maximum temperature changes by about 0.82

2.5 Prediction Model

y_pred = regressor.predict(X_test)
df = pd.DataFrame({
    
    'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df

# 柱状图可视化结果
df1 = df.head(25)
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

25 sets of data are shown here. The accuracy of the model is a bit low, but the predicted percentage is still relatively close to the actual percentage.

# 绘制组间比较线
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

2.6 Evaluation Model

Mean Absolute Error (MAE), the average of the absolute values of the errors:

MAE = (1/n) * Σ|i - yi|

Mean squared error (MSE), the average of the squared errors:

MSE = (1/n) * Σ(i - yi)^2

Root mean square error (RMSE), the square root of the mean of the squared errors:

RMSE = √(MSE)

The evaluation model is mainly the above three indicators, which can be calculated with the pre-built functions of the Scikit-Learn library

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The smaller the three indicators, the better, although a bit unsatisfactory, but the root mean square error is 4.42, the average absolute error is 3.76, it can be considered that the prediction error of the model is relatively small

3 Discussion

I think linear regression is one of the most fundamental and common models in machine learning. Linear regression models make predictions by establishing a linear relationship between features and a target variable. It has some advantages, such as the simplicity and interpretability of the model, which make it widely used in many application domains.

However, linear regression models also have some limitations. It assumes that the relationship between features and targets is linear and is sensitive to outliers. As can be seen in the scatter diagram of 2.3 , the data divergence is strong. Furthermore, it cannot capture nonlinear relationships and complex interactions between features. For these cases, it may be necessary to consider more complex models or perform transformations on the features.