1-Python implementation of the basic principles of univariate linear regression for linear regression

Python implementation of the basic principles of univariate linear regression

1. I read Wu Enda's machine learning course about linear regression. This article is a linear regression exercise for this course.

2. The code is implemented by reading the python version of linear regression shared by someone on the Internet. This article is explained in combination with the code.
3. The relevant attachments have been uploaded to the personal download section.

1 major environment preparation

  1. Create a virtual environment first
# 建立虚拟环境
conda create -n exec_py36 pip python=3.6
  1. The purpose of the following is to use the virtual environment created by yourself in jupyter notebook
# 1、打开Anaconda prompt,并激活虚拟环境
conda activate exec_py36

# 2、安装ipykernel,用于操控jupyter内核
pip install ipykernel -i https://pypi.douban.com/simple #使用了豆瓣源

# 3、将虚拟环境引入jupyter notebook
python -m ipykernel install --user --name exec_py36 --display-name "Python [conda env:exec_py36]"

insert image description here

  1. Re-open jupyter notebook to see the virtual environment, just click to switch

insert image description here

  1. This is after the switch, you can see that the kernel has changed

insert image description here

2jupyter work path changes

  1. First create a new ipynb file, enter the following code in it to view the default location of the current file:
import os
print(os.path.abspath('.'))
# 输出如下:
# C:\Users\yan
  1. You can see that the default location is under my username, so let’s modify the default location so that you can manage files:
# 进入anaconda prompt,并输入下面的命令
jupyter notebook --generate-config
# 得到如下输出:
# Writing default config to: C:\Users\yan\.jupyter\jupyter_notebook_config.py
  1. Open the above configuration file, find # c.NotebookApp.notebook_dir = ''it, delete the comment on this line, and fill in the path of your newly created folder in single quotes, and then save it.

insert image description here

  1. Find the "Jupyte Notebook" shortcut key in the start menu, right-click – More – open the file location

insert image description here

  1. Then find the corresponding "Jupyte Notebook" shortcut icon, right-click - Properties - Target, remove the latter "%USERPROFILE%/", then click "Apply", "OK", and finally restart Jupyte Notebook.
  2. At this time, create a new ipynb file, write a piece of code to save it, go to the working path you created before, and you can see that the newly created file exists.

insert image description here

  1. In this way, on the one hand, it is easy to manage, and on the other hand, it is also convenient to put the files in the directory for calling in jupyter notebook.

insert image description here

3 Univariate linear regression

Dataset introduction: The first column is the population of each city, and the second column is the profit of a truck of food in each city

2.1 Data reading

  1. Install pandas, numpy, matplotlib, seaborn (a further packaged visualization library) in the virtual environment:
# 我是在anaconda prompt里面安装的
conda install pandas
conda install matplotlib
conda install seaborn

When installing pandas, numpy will be installed incidentally, so numpy will not be installed again (in fact, pandas and numpy will be automatically installed when matplotlib is installed):

insert image description here

  1. read data:
import pandas as pd
df = pd.read_csv('ex1data1.txt', names=['population', 'profit']) # 读取数据并赋予列名
df.head() # 看前五行
df.info() # 查看数据信息
# 结果如下图所示

insert image description here

  1. data visualization:
import seaborn as sns
sns.set(context="notebook", style="whitegrid", palette="dark") # 设置画图的一些基本配置
import matplotlib.pyplot as plt
# 由于数据只有两列,因此可以使用散点图可视化一下数据,看看是什么样子
sns.lmplot('population', 'profit', df, height=6, fit_reg=False)
plt.show()
# 结果如下图所示,由图可知,数据的分布大致呈现一条直线,所以接下来会采用线性回归进行拟合

insert image description here

2.2 Feature Construction

  1. Hypothetical function h θ ( x ) h_\theta(x) for multivariate linear regressionhi( x ) is shown in formula (1):

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n (1) h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n \tag{1} hi(x)=i0+i1x1+i2x2+...+inxn(1)

  1. In order to be vectorized, introduce x 0 = 1 x_0=1x0=1 ,h θ ( x ) h_\theta(x)hi( x ) becomes formula (2):

h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n (2) h_\theta(x)=\theta_0x_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n \tag{2} hi(x)=i0x0+i1x1+i2x2+...+inxn(2)

  1. Then the parameter θ \thetaThe dimension of θ is θ ∈ R n + 1 \theta \in R^{n+1}iRn + 1 , and the feature xxof any training instanceThe dimension of x is also x ∈ R n + 1 x\in R^{n+1}xRn + 1 , soh θ ( x ) h_\theta(x)hi( x ) The vectorized expression is formula (3):

h θ ( x ) = θ T X (3) h_\theta(x)=\theta^TX \tag{3} hi(x)=iTX(3)

  1. Based on the above description, actually construct x 0 = 1 x_0=1 in the read data setx0=1
# 读取特征
def get_X(df):
#     """
#     use concat to add intercept term to avoid side effect
#     not efficient for big dataset though

#     """
    ones = pd.DataFrame({
    
    'ones': np.ones(len(df))}) # ones是m行1列的dataframe
    data = pd.concat([ones, df], axis=1)  # 合并数据,根据列合并
    return data.iloc[:, :-1].as_matrix()  # 这个操作获取所有的特征列,返回 ndarray,不是矩阵

2.3 Other preparations

  1. You need to define a function to get the label (that is, the last column, or the regression value):
# 读取标签
def get_y(df):
#    """
#    assume the last column is the target
#
#    """
    return np.array(df.iloc[:, -1]) # df.iloc[:, -1]是指df的最后一列

2.4 Linear regression subject

  1. Use the functions defined above to obtain features and labels separately:
X = get_X(df)
print(X.shape, type(X)) # 看下数据维度

y = get_y(df)
print(y.shape, type(y))
# 结果如下
# (97, 2) <class 'numpy.ndarray'>
# (97,) <class 'numpy.ndarray'>
  1. Construct parameter vector θ \thetai :
# 由线性回归假设函数可知,参数向量的维数是原始数据集的特征数+截距项的特征数
# 在本示例单变量线性回归中,参数向量维数就是1+1=2
theta = np.zeros(X.shape[1]) # X.shape[1]=2,代表特征数n
print(theta)
# 结果如下
# [ 0.  0.]

2.4.1 Calculate the cost function

  1. The cost function J ( θ ) J(\theta) of univariate linear regressionThe calculation formula of J ( θ ) is shown in formula (4):

J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 (4) J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m{({h_\theta(x^{(i)})-y^{(i)}})^2} \tag{4} J(θ)=2 m1i=1m(hi(x(i))y(i))2(4)

Form: h θ ( x ) = θ TX = θ 0 x 0 + θ 1 x 1 h_\theta(x)=\theta^TX=\theta_0x_0+\theta_1x_1hi(x)=iTX=i0x0+i1x1

  1. Then according to the cost function formula, the cost function for calculating linear regression can be defined:
# 定义代价函数
def lr_cost(theta, X: numpy.ndarray, y: numpy.ndarray):
    '''
    :param theta: 维度是R(n),是线性回归的参数
    :param X: 维度是R(m*n),m为样本数,n为特征数
    :param y:维度是R(m)
    :return:
    '''
    m = X.shape[0]  # m为样本数
    # 计算每个样本的每个特征与对应参数的乘积
    inner = X.dot(theta) - y  # X.dot(theta)等价于np.dot(X,theta),inner的维度是R(m*1)
    # 计算代价函数里的平方,然后求和,需要注意:
    # 1*m @ m*1 = 1*1 in matrix multiplication
    # but you know numpy didn't do transpose in 1d array, so here is just a
    # vector inner product to itselves
    square_sum = np.dot(inner.T, inner)  # square_sum维度是R(1*1)
    cost = square_sum / (2 * m)
    return cost
  1. Then use this function to try out the cost corresponding to the initial parameter:
lr_cost(theta, X, y) # 试一试初始的参数对应的代价是多少
# 结果如下
# 32.072733877455669
  1. The intuitive changes of dimensions during the whole calculation process are shown in formulas (5) and (6):

i n n e r ( m , 1 ) = X ( m , n + 1 ) . d o t ( θ ( n + 1 , 1 ) ) − y ( m , 1 ) (5) inner_{(m,1)}=X_{(m,n+1)}.dot(\theta_{(n+1,1)})-y_{(m,1)} \tag{5} inner(m,1)=X(m,n+1). d o t ( i(n+1,1))y(m,1)(5)

s q u a r e _ n u m ( 1 , 1 ) = ( i n n e r . T ) ( 1 , m ) . d o t ( i n n e r ( m , 1 ) ) (6) square\_num_{(1,1)}=(inner.T)_{(1,m)}.dot(inner_{(m,1)}) \tag{6} s q u a r e _ n u m(1,1)=(inner.T)(1,m).dot(inner(m,1))(6)

2.4.2 Gradient Descent + Fitting

  1. The gradient descent update formula of multiple linear regression is shown in formula (7):

θ j = θ j − α ∂ ∂ θ j J ( θ ) (7) \theta_j=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) \tag{7} ij=ijaθjJ(θ)(7)

  1. After the above formula is pushed, the actual operable formula (8) is obtained:

θ j = θ j − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) ) (8) \theta_j=\theta_j-\alpha\frac{1}{m}\sum\limits_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})x^{(i)}_j) \tag{8} ij=ijam1i=1m((hi(x(i))y(i))xj(i))(8)

  1. First define the function to calculate the summation part of formula (8):
# 先定义函数来计算梯度下降更新公式中的求和部分
def gradient(theta, X, y):
    '''
    :param theta: 维度是R(n),是线性回归的参数
    :param X: 维度是R(m*n),m为样本数,n为特征数
    :param y: 维度是R(m)
    :return:维度是R(n+1,1),即与参数向量theta同维度
    '''
    m = X.shape[0]
    inner = np.dot(X.T, (np.dot(X, theta) - y))
    return inner / m
  • The dimension changes during the entire calculation process are shown in the formula:

i n n e r ( n + 1 , 1 ) = ( X ( m , n + 1 ) ) T . d o t ( ( X ( m , n + 1 ) . d o t ( θ ( n + 1 , 1 ) ) − y ( m , 1 ) ) ) (9) inner_{(n+1,1)}=(X_{(m,n+1)})^T.dot((X_{(m,n+1)}.dot(\theta_{(n+1,1)})-y_{(m,1)})) \tag{9} inner(n+1,1)=(X(m,n+1))T.dot((X(m,n+1). d o t ( i(n+1,1))y(m,1)))(9)

  • The actual meaning of the process (that is, why it is ok to write it like this, it needs to be understood and thought about a little bit, but it is not too difficult)
    • 首先, h θ ( x ( i ) ) − y ( i ) h_\theta(x^{(i)})-y^{(i)} hi(x(i))y( i ) This part, whether it is calculatingθ \thetaWhich element in the θ vector needs to include all samples, and the matrixXXEach row of X is a sample. According to the multiplication of the matrix, each row must be related toθ \thetaTheta vectors are multiplied element-wise. In this way, the product of each sample and the parameter is completed at one time through the multiplication of the matrix. Then with the label vectoryySubtract y to get a difference vector (n+1 dimension), where each element corresponds to the difference between the predicted value and the actual value of each sample.
    • That is updating θ j \theta_jijHow to use the corresponding xj x_jxjWoolen cloth. XTX^TXAfter T , the first row represents the first feature of all samples, and so on. UseXTX^TXThe elements of the first row of T are multiplied by the difference vector, that is, the sum of the products of all samples of the first feature and the elements in the corresponding difference vector is obtained, and this first feature corresponds to the first of the parameter vector element. The wholeXTX^TXAfter T is multiplied by the interpolation vector, the so-called "gradient" of each element of the parameter vector is obtained (due to the matrix operation, the summation is automatically completed)
  1. Then define the complete gradient descent process and fit the parameters (only by setting the number of iterations to fit):
# 批量梯度下降函数
def batch_gradient_decent(theta, X, y, epoch, alpha=0.01):
    '''
    :param theta: 维度是R(n),是线性回归的参数
    :param X: 维度是R(m*n),m为样本数,n为特征数
    :param y: 维度是R(m)
    :param epoch: 批处理的轮数
    :param alpha: 学习率,即梯度下降更新公式里的alpha
    :return: 拟合线性回归,返回参数和代价
    '''
    cost_data = [lr_cost(theta, X, y)]
    _theta = theta.copy()  # 拷贝一份,不和原来的theta混淆

    for _ in range(epoch):
        _theta = _theta - alpha * gradient(_theta, X, y)
        cost_data.append(lr_cost(_theta, X, y))

    return _theta, cost_data

2.4.3 Actual calling and fitting linear regression

  1. Use an actual sample data set to fit a univariate linear regression function:
epoch = 500
final_theta, cost_data = batch_gradient_decent(theta, X, y, epoch)
print(final_theta)
# 结果如下
# [-2.28286727  1.03099898]
  1. After fitting, you can observe the change process of the cost, and you can see that the cost function gradually stabilizes after 5 iterations.
ax = sns.lineplot(cost_data, y=np.arange(epoch+1))
ax.set_xlabel('epoch')
ax.set_ylabel('cost')
plt.show()

insert image description here

  1. Use the fitted parameters to draw the fitted curve, as shown in the figure below:
# 观察最终的拟合曲线
b = final_theta[0] # intercept,Y轴上的截距
m = final_theta[1] # slope,斜率

plt.scatter(df.population, df.profit, label="Training data")
plt.plot(df.population, df.population*m + b, label="Prediction")
plt.legend(loc=2)
plt.show()

insert image description here

reference article

Graphical illustration of how to change the Python version of the Jupyter Notebook kernel under Windows 10 (switching the original python environment) - Python Researchers - 博客园 (cnblogs.com)

The default storage path of Jupyter notebook files and how to change them (360doc.com)

Detailed explanation of dot() function in Python's NumPy library

Guess you like

Origin blog.csdn.net/colleges/article/details/124765198