Guide to Python Machine Learning from Scratch (2) - OLS Regression of Supervised Learning

introduce

This blog will introduce 监督学习/Supervised Learning/SLthe first branch with examples: 回归/Regression.

Preparation before starting

Before starting, please make sure you have the following packages in your python environment:
pandas, numpy, sklearn.

All the code in this article can be run Anacondain Jupyter Lab.

text

First, let’s understand: Why 回归/Regressionis it a kind of supervised learning? What is the nature of this problem?

We first need to understand the nature of the regression problem. Simply put, the essence of the regression problem is that for a mapping fff,
f : R n ↦ R f:\mathbb{R}^n\mapsto \mathbb{R} f:RnR
, we have some array of its定义域/Domainsum对应域/Codomain. We want to use these existing data to find a model/regression function that best fits the data.

From another perspective, for each piece of data, the definition domain of the mapping can be understood as 特征集/Feature Set, and its corresponding domain can be understood as 标签集/Label Set. What we are looking for is the unknown mapping function, which corresponds to the one in machine learning 黑箱模型/Blackbox Model.

But to turn this into a fully supervised learning problem, we also need to define what we want to optimize. Since we want to find the mapping that best fits the data, we can consider minimization 误差平方和/Sum of Squared Errors/SSE. For a person with mmA data set of m data and a corresponding mappingfff , the sum of squared errors of this mapping for this data set is defined as

ℓ = ∑ i = 1 m ( y ( i ) − f ( x ( i ) ) ) 2 \ell = \sum_{i=1}^m \Big(y^{(i)} - f\big({\bf x}^{(i)}\big)\Big)^2 =i=1m(y(i)f(x(i)))2

, that is, the difference between the true label of each piece of data and the mapped predicted label is squared and then summed. If we consider the sum of squared errors as supervised learning 损失函数/Loss Function, then this regression problem can be called 普通最小二乘法线性回归问题/OLS Linear Regression. All we have to do is find the mapping fff , minimizing the sum of squared errors.

Let us define this problem more rigorously. Let x {\bf x}x as a feature set,w \bf ww is used as the weight set of each feature, then for the linear mapping
f ( x ; w ) = w T ⋅ x = w 0 x 0 + w 1 x 1 + . . . + wnxnf({\bf x; w}) = {\bf w}^T\cdot{\bf x}=w_0x_0+w_1x_1+...+w_nx_nf(x;w)=wTx=w0x0+w1x1+...+wnxn
, the loss function we want to minimize ℓ \ell可定义为:
ℓ ( w ) = ∑ i = 1 m ( y ( i ) − f ( x ( i ) ) ) 2 = ∑ i = 1 n ( y ( i ) − w T ⋅ x ( i ) ) 2 = ∑ i = 1 n ( y ( i ) − y ^ ( i ) ) 2 . \ell({\bf w}) = \sum_{i=1}^m \Big(y^{(i)} - f\big({\bf x}^{(i)}\big)\Big)^2 = \sum_{i=1}^n \Big(y^{(i)} - {\bf w}^T\cdot{\bf x}^{(i)}\Big)^2 = \sum_{i=1}^n \Big(y^{(i)} - \hat{y}^{(i)}\Big)^2. (w)=i=1m(y(i)f(x(i)))2=i=1n(y(i)wTx(i))2=i=1n(y(i)y^(i))2. In other words ,
the essence of linear regression under supervised learning can be understood as an optimization problem:
w = arg ⁡ min ⁡ w ℓ ( w ) . {\bf w} = \arg\min_{ {\bf w}} \ell({\bf w}).w=argwmin(w).

So how can we find the most appropriate linear regression function?
First of all, this problem has a unique and optimal solution and belongs to P/NPthe problem P. From this point of view alone, linear regression is a relatively simple and easy-to-solve problem in the machine learning world.

The process of finding 普通最小二乘法线性回归问题/OLS Linear Regressionthe optimal solution can be solved by just relying on some linear algebra. The derivation process is relatively tedious and complicated, and the blogger will not show it here. Interested students can search for it by themselves. But the conclusion is that the set of feature weights w ∗ {\bf w}^* that best fits the data (or minimizes the sum of squared errors)w , can be obtained by the following matrix multiplication:
w ∗ = ( XTX ) − 1 XTY {\bf w}^* = ({\bf X}^T {\bf X})^{-1}{\bf X }^T{\bf Y}w=(XTX)1XT Y
, whereX {\bf X}X is the feature set (the size is the feature type multiplied by the number of data rows),Y {\bf Y}Y ismmVector of m elements (set of labels). For computers, this kind of matrix multiplication is a small case.

However, without the learning process, is it still machine learning?
Many students, including the blogger himself, had this problem when they were studying here. The model does not seem to have an iterative and progressive process, but jumps from zero to the optimal solution in one step. But strictly speaking, there is a learning process. The model obtains the feature set and label set, and uses these data to improve its ability to generalize the data. Although there is no step-by-step move toward better performance, the computer still learns the characteristics of the data in an effective way. So in summary, linear regression is also a type of machine learning.

For a data set with an unknown regression function, if we obtain the optimal regression function (ie, minimize the sum of squared errors) through the above method, then we will have a certain predictive ability for new data.

So what are the application scenarios of linear regression?
There are many, such as:

  1. House price forecast. If you have past housing market data (such as annual average temperature, average daily daylight hours, distance to the nearest convenience store, etc.) and annual housing prices, and if you think that housing prices have a (multidimensional) linear relationship with these room characteristics, then You can use OLS to predict property prices.
  2. Stock analysis. Similar to house prices, some characteristics of stocks (such as company net profit, opening time, number of daily limit increases, etc.) are also important indicators of stock prices. However, it should be noted that since time and some unobtainable data (such as the probability that the company owner intends to fail) have a great impact on the stock price, a simple linear regression model may not be suitable for such a complex problem of predicting stocks.
  3. Customer Lifetime Value Prediction (CLV). The total revenue a customer brings to an enterprise is largely linked to the customer's income level, age, average daily consumption and other information. If there is customer information as a feature set and the corresponding total revenue to the enterprise as a label set, then this information can be used to predict their CLV.

code

After understanding the principle, we can use python to implement the above regression learning algorithm.

First, let's generate some simple data.

import numpy as np # 用来进行一些数学运算
import pandas as pd # 用来用数据框的方式储存数据
import matplotlib # 用来画图的
import matplotlib.pyplot as plt
from sklearn import linear_model # 我们需要的模型

# X和Y包含生成的数据。X可以理解为特征/Features,Y可以看作标签/Label
# 我们要做的就是训练一个模型,使它能够通过X准确预测Y的对应值
# coeff包含生成函数所使用的系数
X, Y, coeff = skd.make_regression(n_samples=5000, n_features=2, 
                                  noise=5.0, 
                                  coef=True,
                                  random_state=114514)

print(f"特征集大小为{
      
      X.shape},标签集大小为{
      
      Y.shape}。")
# 特征集大小为(5000,2),标签集大小为(5000,)。

After generating the data, we can check the data content:

from mpl_toolkits.mplot3d import Axes3D # 一个用来画3D图形的库

fig = plt.figure()
plot3d = fig.gca(projection='3d') # 调用matplotlib的对象
plot3d.view_init(elev=15., azim=35) # 观察角度

x1 = X[:, 0] # 选中X里第0列的所有行
x2 = X[:, 1] # 选中X里第1列的所有行

# 画出数据散点。注意x1,x2,Y的行数要相同
plot3d.scatter3D(x1, x2, Y, c=Y, cmap='Greens', label='Noised Samples');

# 画出真实参考平面
x1min, x1max = int(np.floor(min(x1))), int(np.ceil(max(x1)))
x2min, x2max = int(np.floor(min(x2))), int(np.ceil(max(x2)))
x1plane = np.linspace(x1min, x1max, 2000)
x2plane = np.linspace(x2min, x2max, 2000)
xx1, xx2 = np.meshgrid(range(x1min, x1max), range(x2min, x2max))
x12_coord = np.array([xx1, xx2])
y = coeff[0] * x12_coord[0] + coeff[1] + x12_coord[1]
surf = plot3d.plot_surface(xx1, xx2, y, alpha=0.2, label="True model")

# 给3D图形加上标签
surf._facecolors2d = surf._facecolor3d
surf._edgecolors2d = surf._edgecolor3d 
plot3d.legend()

# 画出坐标轴标签
plot3d.set_xlabel('$x_1$')
plot3d.set_ylabel('$x_2$')
plot3d.set_zlabel('$y$')

plt.show()

We get the following graph:
model graphics
The next step is to sklearntrain the linear regression model. The code is very simple, only two lines.

# 创造线性回归对象
OLS = linear_model.LinearRegression(fit_intercept=True)

# 训练模型
OLS.fit(X, Y)

We can manually compare the difference between the real model and the trained model:

print('真实模型参数:')
print('\tw0: {:.2f}, w1: {:.2f}, w2: {:.2f}'.format(0, *coeff))
print('训练模型参数:')
print('\tw0: {:.2f}, w1: {:.2f}, w2: {:.2f}'.format(0, *OLS.coef_))
'''
真实模型参数:
	w0: 0.00, w1: 12.56, w2: 7.64
训练模型参数:
	w0: 0.00, w1: 12.58, w2: 7.77
'''

A more scientific and comprehensive approach is to use 交叉验证/Cross-validation. The principle of this method is: each time the data is randomly divided into a training set and a validation set, and then substituted into the model to evaluate the model's errors during training and validation. Mathematically speaking, this method reduces the randomness of random sampling verification and is a more comprehensive assessment of model performance. code show as below:

from sklearn.model_selection import cross_validate

# 这里的cv指的是把数据分成几份。比如,cv=10就是把数据分成十份,取一份作为验证集,九份作为训练集
cv_results = cross_validate(OLS, X, Y, cv=10, scoring="r2",
                            return_train_score=True)

print('Mean test score: {:.3f} (std: {:.3f})'
      '\nMean train score: {:.3f} (std: {:.3f})'.format(
                                              np.mean(cv_results['test_score']),
                                              np.std(cv_results['test_score']),
                                              np.mean(cv_results['train_score']),
                                              np.std(cv_results['train_score'])))
'''
Mean test score: 0.894 (std: 0.009)
Mean train score: 0.895 (std: 0.001)
'''

Our model has a validation accuracy of 89.4% and a training accuracy of 89.5%. Overall, it is a relatively accurate model with neither overfitting nor underfitting.

We can also use the trained model to make predictions, such as:

print(OLS.predict([[42,24]])[0]) # 模型预测结果
print(coeff[0]*42+coeff[1]*24) # 真实值
'''
714.8098105630355
710.6836496552116
'''

It can be observed that although the optimal solution is guaranteed, the predictions of the model are still not 100% accurate. The cause of this phenomenon comes from the data set 随机噪声/Random Noise, that is, the data itself is not 100% consistent with its underlying model. Sometimes we can reduce the impact of noise on the algorithm. For example, for image processing, we can use it 高斯滤波/Gaussian Filterto reduce the interference of image noise on the learning algorithm.

Conclusion

In the next blog, the blogger will introduce how to use ML methods to solve simple 分类/Classificationproblems. If you have any questions or suggestions, please feel free to comment or send a private message. Coding is not easy. If you like the blogger’s content, please like and support!

Guess you like

Origin blog.csdn.net/EricFrenzy/article/details/131298610