Jupyter - Simple Linear Regression Analysis

We have introduced what is linear regression in the previous blog: Regression Analysis

This blog will introduce linear regression from the perspective of programming. There are mainly two programming methods using sklearn library and non-sklearn library.

sklearn library linear regression analysis

First we have to read the local data

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# 读取 excel 表中数据
data = pd.read_excel("D:\code-file\conda\data\weights_heights.xls")
data.shape

Here, the running output (25000, 3) indicates that there are 5000 sets of data in total, and each set has three data, corresponding to the three columns in the excel table

# 读取 20 组数据
y = array(data[['Weight']].values[:20,:])
x = array(data[['Height']].values[:20,:])
# 调用线性回归函数
model = LinearRegression(fit_intercept=True,normalize=True)
model.fit(x,y)

Here we call the LinearRegression() function in sklearn's linear_model for linear regression analysis. The LinearRegression() function has two parameters:

Whether fit_intercept has an interception, if not, the straight line passes through the origin
normalize whether to normalize the data

A warning is generated if we use the normalize parameter:

The meaning of the warning is roughly:

'normalize' is deprecated in version 1.0 and will be removed in version 1.2. If we want to continue to use it, we need to replace it in the following way:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

# 输出斜率与截距
print("斜率：",model.coef_)
print("截距：",model.intercept_)
b = model.coef_
a = model.intercept_
y = b*x+a
print("线性回归方程为：y = ",float(b),"* x + ",float(a))

Finally we output the linear regression equation and calculate R$ ^2 $,

# 绘制图像
prediction = model.predict(y)
plt.xlabel('身高')
plt.ylabel('体重')
# 绘制原始数据散点图
plt.scatter(x,y)
# 绘制回归曲线线图
y1 = b*x + a
plt.plot(x,y1,c='r')

Here we use the matplotlib library for drawing, which shows the original scatter distribution and the trend curve corresponding to the regression equation we calculated

If the Chinese characters in our image are garbled, just add the following statement:

plt.rcParams['font.sans-serif']=['Simhei']

The complete code is as follows:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy import array
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# 读取 excel 表中数据
data = pd.read_excel("D:\code-file\conda\data\weights_heights.xls",'weights_heights')
data.shape
# 读取 20 组数据
y = array(data[['Weight']].values[:20,:])
x = array(data[['Height']].values[:20,:])
# 调用线性回归函数
model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())
# model=LinearRegression(fit_intercept=True,normalize=True)
model = LinearRegression(fit_intercept=True)
model.fit(x,y)
# 输出斜率与截距
print("斜率：",model.coef_)
print("截距：",model.intercept_)
b = model.coef_
a = model.intercept_
print("线性回归方程为：y = ",float(b),"* x + ",float(a))
R = model.score(x,y)
print(f'相关回归系数：%.4f'%R)
# 绘制图像
prediction = model.predict(y)
plt.rcParams['font.sans-serif']=['Simhei']
plt.xlabel('身高')
plt.ylabel('体重')
# 绘制原始数据散点图
plt.scatter(x,y)
# 绘制回归曲线线图
y1 = b*x + a
plt.plot(x,y1,c='r')

If we want to analyze different amounts of data here, we can directly modify one of the parameters and change the amount of data from [:20,:0] to [:x,:0]

Handwritten Algorithm Realizes Linear Regression Analysis

First call the corresponding package and read the data from the excel file:

# 调用包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
# 调用数据
data = pd.read_excel("D:\code-file\conda\data\weights_heights.xls",'weights_heights')
# 20组数据，需要更多数据修改参数就行
d = data.head(20)
d.shape
x=d["Height"]
y=d["Weight"]

Here we first draw a scatter plot to confirm whether we have successfully read the correct data:

# 绘制散点图图像
plt.scatter(x,y)
plt.axis([65,72,0,180])
plt.show()

Here we have successfully called the data. Next, we start to calculate the regression equation and correlation coefficient. Here we directly use the relevant formula for calculation:

\[b = \frac{\sum^n_{i = 1}(x_i - \overline{x})(y_i - \bar{y})}{\sum_{i = 1}^n(x_i - \overline{x})^2} \]

\[a = \bar{y} - b \overline{x} \]

\[r = \frac{\sum^n_{i = 1}(x_i - \overline{x})(y_i - \bar{y})}{\sqrt{\sum_{i = 1}^n(x_i - \overline{x})^2}\sqrt{\sum_{i = 1}^n(y_i - \bar{y})^2}} \]

# 利用公式求出 a、b、r
x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0.0       #分子
d = 0.0         #分母
m = 0.0
for x_i, y_i in zip(x,y):
    num += (x_i - x_mean) * (y_i - y_mean)
    d  += (x_i - x_mean) ** 2
    m += (y_i - y_mean) ** 2
a = num/d
b = y_mean - a * x_mean
y1 = a * x + b
r=(num/((d** 0.5)*(m** 0.5)))**2
print("回归方程为：y = ",a,"x + ",b)
print("相关系数：",r)

画出最终的散点图和趋势曲线：

# 画出最终的散点图(带有趋势曲线)
plt.rcParams['font.sans-serif']=['Simhei']
plt.xlabel('身高')
plt.ylabel('体重')
plt.scatter(x,y)
plt.plot(x,y1,color='r')
plt.axis([65,72,0,180])
plt.show()

参考资料

手写机器学习算法系列01——线性回归

Python计算相关系数