DLT-03-Multiple Linear Regression

This article is the third article in the **Introduction to Deep Learning (deep learning tutorial, DLT)** series. It mainly introduces multiple linear regression. Students who want to learn deep learning or machine learning can pay attention to the official account GeodataAnalysis, and I will gradually update this series of articles.

The model expressions of multiple linear regression and single linear regression are basically the same, the only difference is that the number of input variables is more. Therefore, there are some differences between the two in terms of hypothesis function and gradient descent. This article will introduce them in detail on the basis of the previous one-variable linear regression. In order to accurately measure the accuracy of the model, this paper introduces the least squares method, and the hypothesis function calculated by this method can minimize the value of the loss function. In addition, the dimensions of multiple variables are not necessarily consistent, so this article will also introduce the role of dimensionless in machine learning.

1 Test dataset

Pay attention to the official account GeodataAnalysisand reply 20230302to download the test data. The test data is the Boston house price data set, and the dependent variable of the data set is the house price; there are thirteen independent variables, representing thirteen factors that affect house prices.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('./housing.data', delim_whitespace=' ')
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
x = df[df.columns[:-1]].to_numpy().T
y = df[df.columns[-1]].to_numpy()
x.shape, y.shape
((13, 506), (506,))

2 Hypothetical functions

The input variables of multiple linear regression are multidimensional variables, which are expressed as:

x j ( i ) = value of feature  j  in the  i t h  training example x ( i ) = the input (features) of the  i t h  training example m = the number of training examples n = the number of features \begin{align} x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features} \end{align} xj(i)x(i)mn=value of feature j in the ith training example=the input (features) of the ith training example=the number of training examples=the number of features

The multivariate form of the hypothesis function that accommodates these multiple features is as follows:

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ n x n h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \dots + \theta_nx_n hi(x)=i0+i1x1+i2x2+i3x3++inxn

Using the definition of matrix multiplication, our multivariate hypothetical function can be succinctly expressed as follows, where x 0 x_0x0is a matrix of all 1s, used to represent the bias term:

h θ ( x ) = [ θ 0 θ . . . . . . . . θ n ] [ x 0 x 1 ⋮ xn ] = θ T x = x T θ h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace {2em} \theta_end{bmatrix} \begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_end{bmatrix}= \theta^T x = x^T \thetahi(x)=[i0i1...in] x0x1xn =iTx=xT i

The code representation is as follows:

def hypothesis_fun(x, parameters):
    return parameters @ x

3 loss function

The loss function of multiple linear regression is consistent with that of unary linear regression, and it is also a square error function, namely:

J ( θ 0 , θ 1 , … , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 J(\theta_{0},\theta_{1}, \dots, \theta_n)=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x_i)-y_i)^2 J(θ0,i1,,in)=2 m1i=1m(hi(xi)yi)2

The code representation is as follows:

def loss_fun(x, y, parameters):
    y_predict = hypothesis_fun(x, parameters)
    return np.sum(np.square(y_predict-y))/(2*y.size)

4 Gradient descent

The gradient descent algorithm of multiple linear regression is basically the same as that of unary linear regression, the only difference is that the number of parameters is more, as follows:

θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , ... , θ n ) = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) xj ( i ) = θ j − α 1 m ( h θ ( x ) − y ) T x T j = 0 , 1 , ... , n \begin{aligned} \theta_j :&= \theta_j - \alpha \frac{\ partial} {\partial \theta_j} J(\theta_0, \dots, \theta_{n)} \newline &= \theta_j - \alpha \frac{1}{m}\sum\limits_{i=1}^m ( h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline &= \theta_j - \alpha \frac{1}{m}(h_\theta (x)-y)^Tx^T \newline j &= 0, 1, \dots,n\end{aligned}ij:j=ijaθjJ(θ0,,in)=ijam1i=1m(hi(x(i))y(i))xj(i)=ijam1(hi(x)y)TxT=0,1,,n

The code representation is as follows:

def gradient_decent(x, y, parameters, learning_rate):
    y_predict = hypothesis_fun(x, parameters)
    return parameters - learning_rate*(y_predict-y) @ x.T/y.size

5 least square method

The least square method is to find a set of ( θ 0 , θ 1 , … , θ n ) (\theta_{0},\theta_{1}, \dots, \theta_n)( i0,i1,,in)使得 ∑ i = 1 m ( h θ ( x i ) − y i ) 2 \sum\limits_{i=1}^m(h_\theta(x_i)-y_i)^2 i=1m(hi(xi)yi)2 (residual sum of squares) is the smallest, that is, findmin ∑ i = 1 m ( h θ ( xi ) − yi ) 2 min\sum\limits_{i=1}^m(h_\theta(x_i)-y_i) ^2mini=1m(hi(xi)yi)2

The algebraic solution of the least squares method is to θ i \theta_iiiFind the partial derivative, let the partial derivative be 0, and then solve the equation system to get θ i \theta_iii. The matrix method is simpler than the algebraic method. The following mainly explains the solution of the matrix method.

The function h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + ⋯ + θ nxn h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \ dots + \theta_nx_nhi(x)=i0+i1x1+i2x2+i3x3++inxnThe matrix and expression form is:

h θ ( x ) = θ T X h_\theta(x) = \theta^T X hi(x)=iTX

Among them, XXX is( n + 1 ) × m (n+1) \times m(n+1)×m -dimensional matrix,n + 1 n+1n+1 represents the feature number of the sample,mmm represents the number of samples. Suppose the functionh θ ( x ) h_\theta(x)hi( x ) ism × 1 m \times 1m×1 vector,θ \thetaθ is( n + 1 ) × 1 (n+1) \times 1(n+1)×A vector of 1 with n + 1 n+1insiden+1 model parameter of the algebraic method.

The average volatility is J ( θ ) = 1 2 ( θ TX − Y ) T ( θ TX − Y ) = 1 2 ( XT θ − Y ) T ( XT θ − Y ) J(\theta) = \frac{1 }{2}(\theta^TX - Y)^T(\theta^TX - Y) = \frac{1}{2}(X^T \theta - Y)^T(X^T \theta - Y )J(θ)=21( iTXY)T (iTXY)=21(XT iY)T(XT iY ),其中YYY is the output vector of the sample, the dimension ism × 1 m \times 1m×1 1 2 \frac{1}{2} 21The main reason here is that the coefficient is 1 after derivation, which is convenient for calculation.

According to the principle of the least squares method, we need to theta \theta of this loss functionTheta vector derivation takes 0. The result is as follows:

∂ ∂ θ J ( θ ) = X ( X T θ − Y ) = 0 \frac{\partial}{\partial \theta}J(\theta) = X(X^T \theta - Y) = 0 θJ(θ)=X(XT iY)=0

After rearranging the above derivation equations, we can get:

θ = ( X X T ) − 1 X Y \theta = (X X^T)^{-1} X Y i=(XXT)1XY

If XXTXX^TXXT is irreversible, usually for the following two reasons:

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  • Too many features (e.g. m ≤ n). In this case, some features should be removed or "regularization" should be used.

The code to calculate the parameters of the hypothetical function using the least squares method is as follows:

x2 = np.vstack((np.ones(x.shape[1]), x))
parameters = np.linalg.inv(x2 @ x2.T) @ x2 @ y

6 Dimensionless

There are two dimensionless methods introduced in this article, one is feature scaling, and its calculation formula is:

x i : = x i − m i n ( x ) m a x ( x ) − m i n ( x ) x_i := \frac{x_i - min(x)}{max(x) - min(x)} xi:=max(x)min(x)ximin(x)

The second is mean standardization, whose calculation formula is:

x i : = x i − m e a n ( x ) m a x ( x ) − m i n ( x ) x_i := \frac{x_i - mean(x)}{max(x) - min(x)} xi:=max(x)min(x)ximean(x)

The code representations of these two dimensionless algorithms are as follows:

def feature_scale(x):
    x_min = np.min(x, axis=1)[..., np.newaxis]
    x_max = np.max(x, axis=1)[..., np.newaxis]
    return (x - x_min)/(x_max - x_min)

def mean_normalization(x):
    x_min = np.min(x, axis=1)[..., np.newaxis]
    x_max = np.max(x, axis=1)[..., np.newaxis]
    x_mean = np.mean(x, axis=1)[..., np.newaxis]
    return (x - x_mean)/(x_max - x_min)

7 Model Training and Prediction

The training steps of multiple linear regression are basically the same as that of unary linear regression, the only difference is that the input variables are dimensionless before formal training. In addition, xa row full of 1s is inserted before the first row of the input variable matrix to represent the bias term.

parameters = np.random.rand(x.shape[0]+1)
learning_rate = 0.1
losses = []

x2 = feature_scale(x)
x2 = np.vstack((np.ones(x2.shape[1]), x2))

batch_size = 300
epoch_size = 100

for epoch in range(epoch_size):
    for i in range(y.size//batch_size+1):
        random_samples = np.random.choice(x2.shape[1], batch_size)
        parameters = gradient_decent(x2[:, random_samples], 
                              y[random_samples], 
                              parameters, 
                              learning_rate)
        losses.append(loss_fun(x2[:, random_samples], 
                               y[random_samples], 
                               parameters))
plt.plot(losses);

Now that our model has been trained, let me encapsulate the code for model prediction, as follows:

def predict(x, parameters):
    x = feature_scale(x)
    x = np.vstack((np.ones(x.shape[1]), x))
    return parameters @ x

The code for visualizing the prediction results is as follows. Here we compare the settlement results of the least squares method with the results of our model, and we can find that there is not much difference between the two, as follows:

fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24, 9))
ax1.plot(losses)
ax1.set_title('Loss', fontdict={
    
    'size': 30})
ax2.scatter(predict(x, parameters), y)
ax2.plot(y, y, 'r-')
ax2.set_title('Linear Regression', fontdict={
    
    'size': 30})

# Best theta
x2 = np.vstack((np.ones(x.shape[1]), x))
theta = np.linalg.inv(x2 @ x2.T) @ x2 @ y
ax3.scatter(theta @ x2, y);
ax3.plot(y, y, 'r-')
ax3.set_title('Least squares', fontdict={
    
    'size': 30});

8 Encapsulate the model

In this section, we encapsulate the model built above into a class of Python, so that it is convenient to call it for training, prediction and other tasks. At present, this class is relatively simple, and it can be regarded as a proposed linear neural network model. Understanding the structure of this class is very helpful for understanding the code in the following text. It is recommended to take a good look.

class MutiLinearRe():

    def __init__(self, input_shape):
        self.input_shape = input_shape
        self.parameters = np.random.rand(self.input_shape[0]+1)
  
    def hypothesis_fun(self, x, parameters):
        return parameters @ x

    def loss_fun(self, x, y, parameters):
        y_predict = self.hypothesis_fun(x, parameters)
        return np.sum(np.square(y_predict-y))/(2*y.size)
  
    def gradient_decent(self, x, y, parameters, learning_rate):
        y_predict = self.hypothesis_fun(x, parameters)
        return parameters - learning_rate*(y_predict-y) @ x.T/y.size
  
    def feature_scale(self, x):
        x_min = np.min(x, axis=1)[..., np.newaxis]
        x_max = np.max(x, axis=1)[..., np.newaxis]
        return (x - x_min)/(x_max - x_min)

    def mean_normalization(self, x):
        x_min = np.min(x, axis=1)[..., np.newaxis]
        x_max = np.max(x, axis=1)[..., np.newaxis]
        x_mean = np.mean(x, axis=1)[..., np.newaxis]
        return (x - x_mean)/(x_max - x_min)

    def _normalization(self, x, normalization):
        if normalization:
            if 'feature' == normalization:
                x = self.feature_scale(x)
            elif 'mean' == normalization:
                x = self.mean_normalization(x)
        return x

    def fit(self, x, y, epoch_size, batch_size, learning_rate, normalization="feature"):
        self.learning_rate = learning_rate
        self.loss = []
        self.normalization = normalization
        self.x = self._normalization(x, self.normalization)
        self.x = np.vstack((np.ones(self.input_shape[1]), self.x))
        self.y = y.copy()
        
        for epoch in range(epoch_size):
            for i in range(y.size//batch_size+1):
                random_samples = np.random.choice(x2.shape[1], batch_size)
                self.parameters = self.gradient_decent(self.x[:, random_samples], 
													   self.y[random_samples],
													   self.parameters, 
													   self.learning_rate)
                loss_ = self.loss_fun(self.x[:, random_samples], 
                                      self.y[random_samples], 
                                      self.parameters)
                if np.isinf(loss_):
                    raise ValueError("Overflow, please change the learning_rate")
                self.loss.append(loss_)
    
    def predict(self, x):
        x = self._normalization(x, self.normalization)
        x = np.vstack((np.ones(x.shape[1]), x))
        return self.parameters @ x
model = MutiLinearRe(input_shape=x.shape)
model.fit(x, y, epoch_size=100, batch_size=300, 
		  learning_rate=0.1, normalization='feature')
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(24, 9))
ax1.plot(model.loss[700:])
ax1.set_title('Loss', fontdict={
    
    'size': 30})
ax2.scatter(model.predict(x), y)
ax2.plot(y, y, 'r-')
ax2.set_title('Linear Regression', fontdict={
    
    'size': 30})

# Least squares: Best theta
x2 = np.vstack((np.ones(x.shape[1]), x))
theta = np.linalg.inv(x2 @ x2.T) @ x2 @ y
ax3.scatter(theta @ x2, y);
ax3.plot(y, y, 'r-')
ax3.set_title('Least squares', fontdict={
    
    'size': 30});

Guess you like

Origin blog.csdn.net/weixin_44785184/article/details/129311362