[Artificial Intelligence and Machine Learning] Introduction to Linear Regression

foreword

environment

  • Microsoft Excel
  • python

Data set
weights_heights (height-weight data set).xls

1 Excel implements linear regression

1.1 Add data analysis tools

Before starting, you need to check whether your excel has added data analysis tools, as shown in the figure below.

insert image description here

If not, select " in the menu bar 插入”->“我的加载项”->“管理其他加载项”, check "Analysis ToolPak", "Analysis ToolPak-VBA", "Planning Solver Add-in", and confirm.

1.2 Excel completes linear regression analysis

Open the preface mentioned weights_heights(身高-体重数据集).xsl, select the menu Data->Data Analysis->Regression->OK:

insert image description here

  1. 20 sets of measurement data
    Use the weight as the Y value and the height as the X value, select 20 data, and determine the output area. Here I choose to output in a new Sheet, select, and click OK. The output chart 线性拟合图and
    insert image description here
    results are as follows
    insert image description here
    insert image description here
    After adding the trend line, the result as follows
    insert image description here

  2. 200 sets of measurement data
    In the same way as above, select 200 data and
    insert image description here
    the resulting image is as follows
    insert image description here

  3. 2000 sets of measurement data
    Continue to select 2000 sets, the steps are the same, the output results are as follows
    insert image description here

2 Python programming, without the help of third-party libraries

2.1 Without using third-party libraries

  1. 20 sets of data
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import math

#准备数据
p=pd.read_excel('weights_heights(身高-体重数据集).xlsx','weights_heights')
#读取20行数据
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# 平均值
x_mean = np.mean(x)
y_mean = np.mean(y)
#x(或y)列的总数(即n)
xsize = x.size
zi=((x-x_mean)*(y-y_mean)).sum()
mu=((x-x_mean)*(x-x_mean)).sum()
n=((y-y_mean)*(y-y_mean)).sum()
# 参数a b
a = zi / mu
b = y_mean - a * x_mean
#相关系数R的平方
m=((zi/math.sqrt(mu*n))**2)
# 这里对参数保留4位有效数字
a = np.around(a,decimals=4)
b = np.around(b,decimals=4)
m = np.around(m,decimals=4)
print(f'回归线方程:y = {
      
      a}x +({
      
      b})') 
print(f'相关回归系数为{
      
      m}')
#借助第三方库skleran画出拟合曲线
y1 = a*x + b
plt.scatter(x,y)
plt.plot(x,y1,c='r')

The result is as follows:

Linear regression equation: y=4.128x-152.2338
Correlation coefficient: R^2=0.3254

insert image description here
2. 200 sets of data
Just change 20 in the figure below to 200 to get the
insert image description here
output as follows:

Linear regression equation: y=3.4317x-105.959
correlation coefficient: R^2=0.31

insert image description here
3. 2000 sets of data
Similarly, change the code in the figure below to 2000
insert image description here
and the result is as follows:

Linear regression equation: y=2.9555x-73.6608
Correlation coefficient: R^2=0.2483

insert image description here
4.
The results of 20000 sets of data are as follows:

Regression line equation: y = 3.071x + (-81.691)
The correlation regression coefficient is 0.2513

insert image description here

3 Python programming, with skleran

3.1 Introduction to skleran

sklearn (full name Scikit-Learn) is a machine learning tool based on the Python language, and Sklea is a package for processing machine learning (supervised learning and unsupervised learning). It is built on top of NumPy, SciPy, Pandas and Matplotlib, which mainly integrates data preprocessing and data feature selection. sklearn has six task modules and a data import module:

  • Classification tasks with supervised learning
  • Regression Tasks with Supervised Learning
  • Clustering Tasks for Unsupervised Learning
  • Dimensionality Reduction Tasks for Unsupervised Learning
  • Data Preprocessing Tasks
  • Model Selection Task
  • data import

The specific process is as follows:

insert image description here

3.2 skleran installation

Before installing sklearn, you need to install two libraries, namely numpy+mkl and scipy

! pip install scikit-learn

3.3 code

20 sets of data

# 导入所需的模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

p=pd.read_excel('weights_heights(身高-体重数据集).xls','weights_heights')
#读取数据行数
p1=p.head(20)
x=p1["Height"]
y=p1["Weight"]
# 数据处理
# sklearn 拟合输入输出一般都是二维数组,这里将一维转换为二维。
y = np.array(y).reshape(-1, 1)
x = np.array(x).reshape(-1, 1)
# 拟合
reg = LinearRegression()
reg.fit(x,y)
a = reg.coef_[0][0]     # 系数
b = reg.intercept_[0]   # 截距
print('拟合的方程为:Y = %.4fX + (%.4f)' % (a, b))
c=reg.score(x,y)    # 相关系数
print(f'相关回归系数为%.4f'%c)

# 可视化
prediction = reg.predict(y)                # 根据高度,按照拟合的曲线预测温度值
plt.xlabel('身高')
plt.ylabel('体重')
plt.scatter(x,y)
y1 = a*x + b
plt.plot(x,y1,c='r')

Output result:

Linear regression equation: y=4.128x-152.2338
Correlation coefficient: R^2=0.3254

insert image description here

200 sets of data, modified to read 200 pieces of data

insert image description here

The fitting equation is: Y = 3.4317X + (-105.9590)
The correlation regression coefficient is 0.3100

insert image description here

2000 sets of data
After modifying the code, the output is as follows

The fitting equation is: Y = 2.9555X + (-73.6608)
The correlation regression coefficient is 0.2483

insert image description here

Summarize

In this paper, linear regression is performed through Excel and python programming. By comparison, it is found that the two solve the linear regression problem and get roughly the same results. To solve the linear regression problem in Excel, you only need to select the data to get the result, which is very simple; it is also very convenient to use the related functions provided by the sklearn library


Reference:
https://blog.csdn.net/weixin_56102526/article/details/120495151
https://blog.csdn.net/weixin_46129506/article/details/120468232
https://blog.csdn.net/weixin_44838881/article/ details/124836755

Guess you like

Origin blog.csdn.net/apple_52030329/article/details/129547942