Introduction to Machine Learning in Python: Polynomial Regression

1. Description

        Polynomial regression can identify non-linear relationships between independent and dependent variables. This article is the third in a series on regression, gradient descent, and MSE. The previous articles covered simple linear regression, the normal equation for regression , and multiple linear regression .

2. Polynomial regression

        Polynomial regression is used for complex data that is best suited for curve fitting. It can be considered as a subset of multiple linear regression.

        Note that X₀  is a column of biases; this allows for the generalized formulation discussed in the first post. Using the above equation, each "independent variable" can be viewed as  an exponential version of X₁  .

        This allows the same model to be used from multiple linear regression, since only the coefficients for each variable need be identified. A simple third-order polynomial model can be created as an example. Its equation is as follows:

        Generalized functions for models, gradient descent and MSE can be used in previous articles:

# line of best fit
def model(w, X):
  """
    Inputs:
      w: array of weights | (num features, 1)
      X: array of inputs  | (n samples, num features)

    Output:
      returns the output of X@w | (n samples, 1)
  """

  return torch.matmul(X, w)
# mean squared error (MSE)
def MSE(Yhat, Y):
  """
    Inputs:
      Yhat: array of predictions | (n samples, 1)
      Y: array of expected outputs | (n samples, 1)
    Output:
      returns the loss of the model, which is a scalar
  """
  return torch.mean((Yhat-Y)**2) # mean((error)^2)
# optimizer
def gradient_descent(w):
  """
    Inputs:
      w: array of weights | (num features, 1)

    Global Variables / Constants:
      X: array of inputs  | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      lr: learning rate to scale the gradient

    Output:
      returns the updated weights
  """ 

  n = X.shape[0]

  return w - (lr * 2/n) * (torch.matmul(-Y.T, X) + torch.matmul(torch.matmul(w.T, X.T), X)).reshape(w.shape)

3. Create data

        Now, all that is needed is some data to train the model on. The "Blueprint" feature is available and randomness can be added. This follows the same approach as the previous article. The blueprint looks like this:

        A training set of size (800, 4) and a test set of size (200, 4) can be created. Note that each feature, except bias, is an exponential version of the first feature.

import torch

torch.manual_seed(5)
torch.set_printoptions(precision=2)

# features
X0 = torch.ones((1000,1))
X1 = (100*(torch.rand(1000) - 0.5)).reshape(-1,1) # generates 1000 random numbers from -50 to 50
X2, X3 = X1**2, X1**3
X = torch.hstack((X0,X1,X2,X3))

# normal distribution with a mean of 0 and std of 8
normal = torch.distributions.Normal(loc=0, scale=8)

# targets
Y = (3*X[:,3] + 2*X[:,2] + 1*X[:,1] + 5 + normal.sample(torch.ones(1000).shape)).reshape(-1,1)

# train, test
Xtrain, Xtest = X[:800], X[800:]
Ytrain, Ytest = Y[:800], Y[800:]

        After defining the initial weights, the data can be plotted using a line of best fit.

torch.manual_seed(5)
w = torch.rand(size=(4, 1))
w
tensor([[0.83],
        [0.13],
        [0.91],
        [0.82]])
import matplotlib.pyplot as plt

def plot_lbf():
  """
    Output:
      prints the line of best fit in comparison to the train and test data
  """

  # plot the train and test sets
  plt.scatter(Xtrain[:,1],Ytrain,label="train")
  plt.scatter(Xtest[:,1],Ytest,label="test")

  # plot the line of best fit
  X1_plot = torch.arange(-50, 50.1,.1).reshape(-1,1) 
  X2_plot, X3_plot = X1_plot**2, X1_plot**3
  X0_plot = torch.ones(X1_plot.shape)
  X_plot = torch.hstack((X0_plot,X1_plot,X2_plot,X3_plot))

  plt.plot(X1_plot.flatten(), model(w, X_plot).flatten(), color="red", zorder=4)

  plt.xlim(-50, 50)
  plt.xlabel("$X$")
  plt.ylabel("$Y$")
  plt.legend()
  plt.show()

plot_lbf()
Image source: author

4. Training model

        To partially minimize the cost function, a learning rate of 5e-11 and 500,000 epochs can be used with gradient descent.

lr = 5e-11
epochs = 500000

# update the weights 1000 times
for i in range(0, epochs):
  # update the weights
  w = gradient_descent(w)

  # print the new values every 10 iterations
  if (i+1) % 100000 == 0:
    print("epoch:", i+1)
    print("weights:", w)
    print("Train MSE:", MSE(model(w,Xtrain), Ytrain))
    print("Test MSE:", MSE(model(w,Xtest), Ytest))
    print("="*10)

plot_lbf()
epoch: 100000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.87)
Test MSE: tensor(162.55)
==========
epoch: 200000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.52)
Test MSE: tensor(162.22)
==========
epoch: 300000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(163.19)
Test MSE: tensor(161.89)
==========
epoch: 400000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.85)
Test MSE: tensor(161.57)
==========
epoch: 500000
weights: tensor([[0.83],
        [0.13],
        [2.00],
        [3.00]])
Train MSE: tensor(162.51)
Test MSE: tensor(161.24)
==========
Image source: author

        Even with 500,000 epochs and an extremely small learning rate, the model fails to recognize the first two weights. While the current solution is very accurate with an MSE of 161.24, it may take millions of epochs to fully minimize it. This is one of the limitations of gradient descent for polynomial regression.

5. Normal equation

        As an alternative, the optimization weights can be calculated directly using the normal equation from the second article:

def NormalEquation(X, Y):
  """
    Inputs:
      X: array of input values | (n samples, num features)
      Y: array of expected outputs | (n samples, 1)
      
    Output:
      returns the optimized weights | (num features, 1)
  """
  
  return torch.inverse(X.T @ X) @ X.T @ Y

w = NormalEquation(Xtrain, Ytrain)
w
tensor([[4.57],
        [0.98],
        [2.00],
        [3.00]])

        The normal equation was able to immediately identify the correct value for each weight, and the MSE per group was about 100 points lower than with gradient descent:

MSE(model(w,Xtrain), Ytrain), MSE(model(w,Xtest), Ytest)
(tensor(60.64), tensor(63.84))

6. Conclusion

        The next two articles cover lasso and ridge regression by implementing simple linear, multilinear, and polynomial regression. These types of regression introduce two important concepts in machine learning: overfitting and regularization.

 Reference article:

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132139386