Table of contents
foreword
environment
- Microsoft Excel
- python
Data set
weights_heights (height-weight data set).xls
1 Excel implements linear regression
1.1 Add data analysis tools
Before starting, you need to check whether your excel has added data analysis tools, as shown in the figure below.
If not, select " in the menu bar 插入”->“我的加载项”->“管理其他加载项”
, check "Analysis ToolPak", "Analysis ToolPak-VBA", "Planning Solver Add-in", and confirm.
1.2 Excel completes linear regression analysis
Open the preface mentioned weights_heights(身高-体重数据集).xsl
, select the menu Data->Data Analysis->Regression->OK:
-
20 sets of measurement data
Use the weight as the Y value and the height as the X value, select 20 data, and determine the output area. Here I choose to output in a new Sheet, select, and click OK. The output chart线性拟合图
and
results are as follows
After adding the trend line, the result as follows
-
200 sets of measurement data
In the same way as above, select 200 data and
the resulting image is as follows
-
2000 sets of measurement data
Continue to select 2000 sets, the steps are the same, the output results are as follows
2 Python programming, without the help of third-party libraries
2.1 Without using third-party libraries
- 20 sets of data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math
#准备数据
p=pd.read_excel('weights_heights(身高-体重数据集).xlsx','weights_heights')
#读取20行数据
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# 平均值
x_mean = np.mean(x)
y_mean = np.mean(y)
#x(或y)列的总数(即n)
xsize = x.size
zi=((x-x_mean)*(y-y_mean)).sum()
mu=((x-x_mean)*(x-x_mean)).sum()
n=((y-y_mean)*(y-y_mean)).sum()
# 参数a b
a = zi / mu
b = y_mean - a * x_mean
#相关系数R的平方
m=((zi/math.sqrt(mu*n))**2)
# 这里对参数保留4位有效数字
a = np.around(a,decimals=4)
b = np.around(b,decimals=4)
m = np.around(m,decimals=4)
print(f'回归线方程:y = {
a}x +({
b})')
print(f'相关回归系数为{
m}')
#借助第三方库skleran画出拟合曲线
y1 = a*x + b
plt.scatter(x,y)
plt.plot(x,y1,c='r')
The result is as follows:
Linear regression equation: y=4.128x-152.2338
Correlation coefficient: R^2=0.3254
2. 200 sets of data
Just change 20 in the figure below to 200 to get the
output as follows:
Linear regression equation: y=3.4317x-105.959
correlation coefficient: R^2=0.31
3. 2000 sets of data
Similarly, change the code in the figure below to 2000
and the result is as follows:
Linear regression equation: y=2.9555x-73.6608
Correlation coefficient: R^2=0.2483
4.
The results of 20000 sets of data are as follows:
Regression line equation: y = 3.071x + (-81.691)
The correlation regression coefficient is 0.2513
3 Python programming, with skleran
3.1 Introduction to skleran
sklearn (full name Scikit-Learn) is a machine learning tool based on the Python language, and Sklea is a package for processing machine learning (supervised learning and unsupervised learning). It is built on top of NumPy, SciPy, Pandas and Matplotlib, which mainly integrates data preprocessing and data feature selection. sklearn has six task modules and a data import module:
- Classification tasks with supervised learning
- Regression Tasks with Supervised Learning
- Clustering Tasks for Unsupervised Learning
- Dimensionality Reduction Tasks for Unsupervised Learning
- Data Preprocessing Tasks
- Model Selection Task
- data import
The specific process is as follows:
3.2 skleran installation
Before installing sklearn, you need to install two libraries, namely numpy+mkl and scipy
! pip install scikit-learn
3.3 code
20 sets of data
# 导入所需的模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
p=pd.read_excel('weights_heights(身高-体重数据集).xls','weights_heights')
#读取数据行数
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# 数据处理
# sklearn 拟合输入输出一般都是二维数组,这里将一维转换为二维。
y = np.array(y).reshape(-1, 1)
x = np.array(x).reshape(-1, 1)
# 拟合
reg = LinearRegression()
reg.fit(x,y)
a = reg.coef_[0][0] # 系数
b = reg.intercept_[0] # 截距
print('拟合的方程为:Y = %.4fX + (%.4f)' % (a, b))
c=reg.score(x,y) # 相关系数
print(f'相关回归系数为%.4f'%c)
# 可视化
prediction = reg.predict(y) # 根据高度,按照拟合的曲线预测温度值
plt.xlabel('身高')
plt.ylabel('体重')
plt.scatter(x,y)
y1 = a*x + b
plt.plot(x,y1,c='r')
Output result:
Linear regression equation: y=4.128x-152.2338
Correlation coefficient: R^2=0.3254
200 sets of data, modified to read 200 pieces of data
The fitting equation is: Y = 3.4317X + (-105.9590)
The correlation regression coefficient is 0.3100
2000 sets of data
After modifying the code, the output is as follows
The fitting equation is: Y = 2.9555X + (-73.6608)
The correlation regression coefficient is 0.2483
Summarize
In this paper, linear regression is performed through Excel and python programming. By comparison, it is found that the two solve the linear regression problem and get roughly the same results. To solve the linear regression problem in Excel, you only need to select the data to get the result, which is very simple; it is also very convenient to use the related functions provided by the sklearn library
Reference:
https://blog.csdn.net/weixin_56102526/article/details/120495151
https://blog.csdn.net/weixin_46129506/article/details/120468232
https://blog.csdn.net/weixin_44838881/article/ details/124836755