Start from scratch, write linear regression in Python freehand

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join


Putting down Scikit-learn first, let's take a look at the real technology.

For most data scientists, linear regression methods are their starting point for statistical modeling and predictive analysis tasks. This method has existed for more than 200 years and has been extensively studied, but it is still an active research field. Because of its good interpretability, linear regression has a wide range of uses in commercial data. Of course, there is no shortage of applications of regression analysis in biological data, industrial data and other fields.

On the other hand, Python has become the programming language of choice for data scientists, and it is particularly important to be able to apply multiple methods to fit large data sets with linear models.

If you have just stepped into the door of machine learning, it is a meaningful attempt to use Python to code the entire linear regression algorithm from scratch. Let's see how to do it.

data

The first step in a machine learning problem is to obtain data. Without data that can be learned, there is no machine learning. This article will use a very conventional linear regression data set-house price prediction data set.

This is a simple data set containing house prices in Portland, Oregon. The first column of the data set is the area of ​​the house (in square feet), the second column is the number of bedrooms, and the third column is the house price. There are multiple features in this data set (for example, house_size and number of rooms), so we will study multiple linear regression, and the label (y) is the house price we will predict.

First define the function for loading the data set:


def load_data(filename):
    df = pd.read_csv(filename, sep=",", index_col=False)
    df.columns = ["housesize", "rooms", "price"]
    data = np.array(df, dtype=float)
    plot_data(data[:,:2], data[:, -1])
    normalize(data)
    return data[:,:2], data[:, -1]

We will call the above function later to load the data set. This function returns x and y.

Normalized data

The above code not only loads data, but also performs normalization processing on the data and plots data points. Before looking at the data graph, we first understand the normalize(data) in the above code.

After viewing the original data set, you will find that the value of the second column of data (the number of rooms) is much smaller than the first column (ie, the area of ​​the house). The model does not evaluate this data as the number of rooms or floor space, for the model, they are just numbers. Some columns (or features) in the machine learning model have higher values ​​than other columns, which may cause undesired bias, and may also lead to an imbalance in variance and mathematical mean. For these reasons and in order to simplify the work, we recommend scaling or normalizing the features to be in the same range (for example [-1,1] or [0,1]), which will make training much easier . Therefore, we will use feature normalization, whose mathematical expression is as follows:

  • Z = (x - μ) / σ
  • μ : mean
  • σ: standard deviation
    where z is a normalized feature and x is a non-normalized feature. With the normalization formula, we can create a function for normalization:
def normalize(data):
    for i in range(0,data.shape[1]-1):
        data[:,i] = ((data[:,i] - np.mean(data[:,i]))/np.std(data[:, i]))

The above code traverses each column and normalizes it using the mean and standard deviation of all data elements in each column.

Plot data

Before coding the linear regression model, we need to ask "why" first.

Why use linear regression to solve this problem? This is a very useful question. Before writing any specific code, you should be very clear about which algorithm to use and whether this is really the best choice given the data set and the problem to be solved.

We can draw images to prove why linear regression is effective for the current data set. To this end, we call the plot_data function in the load_data above. Now let's define the plot_data function:

def plot_data(x, y):
    plt.xlabel('house size')
    plt.ylabel('price')
    plt.plot(x[:,0], y, 'bo')
    plt.show()

Calling this function will generate the following figure:

 

 

As shown in the figure above, we can roughly fit a line. This means that using linear approximation can make more accurate predictions, so linear regression can be used.

After the data is prepared, the next step is to write code for the algorithm.

Hypothesis

First we need to define the hypothesis function, which we will use later to calculate the cost. For linear regression, the assumptions are:

 

 

But there are only 2 features in the data set, so for the current problem, the assumption is:

 


Where x1 and x2 are two features (namely, the area of ​​the house and the number of rooms). Then write a simple Python function that returns that hypothesis:


def h(x,theta):
    return np.matmul(x, theta)

Next we look at the cost function.

Cost function

The purpose of using the cost function is to evaluate the quality of the model.

The equation of the cost function is:

 

 

The code of the cost function is as follows:

def cost_function(x, y, theta):
    return ((h(x, theta)-y).T@(h(x, theta)-y))/(2*y.shape[0])

So far, all the Python functions we have defined have exactly the same mathematical meaning as the above linear regression. Next we need to minimize the cost, which requires gradient descent.

Gradient descent

Gradient descent is an optimization algorithm designed to adjust parameters to minimize the cost function.

The main update steps of gradient descent are:

 

 

Therefore, we multiply the derivative of the cost function by the learning rate (α), and then subtract it from the current value of the parameter (θ) to obtain the new updated parameter (θ).


def gradient_descent(x, y, theta, learning_rate=0.1, num_epochs=10):
    m = x.shape[0]
    J_all = []

    for _ in range(num_epochs):
        h_x = h(x, theta)
        cost_ = (1/m)*(x.T@(h_x - y))
        theta = theta - (learning_rate)*cost_
        J_all.append(cost_function(x, y, theta))

    return theta, J_all

The gradient_descent function returns theta and J_all. Theta is obviously a parameter vector, which contains the hypothetical θs value, and J_all is a list containing the cost function after each epoch. The J_all variable is not essential, but it helps to better analyze the model.

Integrated together

The next thing to do is to call the functions in the correct order:


x,y = load_data("house_price_data.txt")
y = np.reshape(y, (46,1))
x = np.hstack((np.ones((x.shape[0],1)), x))
theta = np.zeros((x.shape[1], 1))
learning_rate = 0.1
num_epochs = 50
theta, J_all = gradient_descent(x, y, theta, learning_rate, num_epochs)
J = cost_function(x, y, theta)
print("Cost:", J)
print("Parameters:", theta)

#for testing and plotting cost 
n_epochs = []
jplot = []
count = 0
for i in J_all:
    jplot.append(i[0][0])
    n_epochs.append(count)
    count += 1
jplot = np.array(jplot)
n_epochs = np.array(n_epochs)
plot_cost(jplot, n_epochs)

test(theta, [1600, 2])

First call the load_data function to load the x and y values. The x value contains the training sample, and the y value contains the label (in this case, the price of the house).

You must have noticed that throughout the code, we have been using matrix multiplication to express what we need. For example, in order to get a hypothesis, we must multiply each parameter (θ) with each feature vector (x). We can use a for loop to traverse each sample and perform a multiplication each time, but if there are too many training samples, this may not be the most efficient method.

A more efficient way here is to use matrix multiplication. The data set used in this article has two characteristics: house area and number of rooms, that is, we have (2+1) three parameters. Think of the hypothesis as a line in the graphic sense, and think about the additional parameter θ0 in this way. Finally, the additional θ0 must make this line meet the requirements.

 

 

Now we have three parameters and two features. This means that the dimension of theta or parameter vector (1-dimensional matrix) is (3,1), but the dimension of the feature vector is (46,2). You will certainly notice that it is mathematically impossible to multiply such two matrices together. Look at our hypothesis again:

 


If you look closely, this is actually very intuitive: if you add an extra column at the beginning of the eigenvector (x) {dimensionality (46, 3)} and perform a matrix multiplication on x and theta, you will get hθ(x ) Equation.

Remember, when you actually run the code to implement this function, it will not return an expression like hθ(x), but return the mathematical value obtained by the expression. In the above code, x = np.hstack((np.ones((x.shape[0],1)), x)) This line adds an extra column at the beginning of x to prepare for matrix multiplication.

After this, we initialize the theta vector with zeros, of course you can also initialize with some small random values. We also specify the training learning rate and the number of epochs.

After defining all the hyperparameters, we can call the gradient descent function to return the history of all cost functions and the final vector of the parameter theta. Here theta vector defines the final hypothesis. You may notice that the dimension of the theta vector returned by the gradient descent function is (3,1).

Remember the hypothesis of the function?

 

 

So we need three θ, the dimension of theta vector is (3,1), so theta [0], theta [1] and theta [2] are actually θ0, θ1 and θ2 respectively. The J_all variable is the historical record of all cost functions. You can print out the J_all array to see how the cost function gradually decreases in each epoch of gradient descent.

 

We can draw this graph by defining and calling the plot_cost function, as shown below:


def plot_cost(J_all, num_epochs):
    plt.xlabel('Epochs')
    plt.ylabel('Cost')
    plt.plot(num_epochs, J_all, 'm', linewidth = "5")
    plt.show()

Now we can use these parameters to find labels, such as the price of a house for a given house size and number of rooms.

test

Now you can test the code that calls the test function, which takes as input the house area, the number of rooms, and the final theta vector returned by the logistic regression model, and outputs the house price.


def test(theta, x):
    x[0] = (x[0] - mu[0])/std[0]
    x[1] = (x[1] - mu[1])/std[1]

    y = theta[0] + theta[1]*x[0] + theta[2]*x[1]
    print("Price of house:", y)

Complete code


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#variables to store mean and standard deviation for each feature
mu = []
std = []

def load_data(filename):
    df = pd.read_csv(filename, sep=",", index_col=False)
    df.columns = ["housesize", "rooms", "price"]
    data = np.array(df, dtype=float)
    plot_data(data[:,:2], data[:, -1])
    normalize(data)
    return data[:,:2], data[:, -1]

def plot_data(x, y):
    plt.xlabel('house size')
    plt.ylabel('price')
    plt.plot(x[:,0], y, 'bo')
    plt.show()

def normalize(data):
    for i in range(0,data.shape[1]-1):
        data[:,i] = ((data[:,i] - np.mean(data[:,i]))/np.std(data[:, i]))
        mu.append(np.mean(data[:,i]))
        std.append(np.std(data[:, i]))


def h(x,theta):
    return np.matmul(x, theta)

def cost_function(x, y, theta):
    return ((h(x, theta)-y).T@(h(x, theta)-y))/(2*y.shape[0])

def gradient_descent(x, y, theta, learning_rate=0.1, num_epochs=10):
    m = x.shape[0]
    J_all = []

    for _ in range(num_epochs):
        h_x = h(x, theta)
        cost_ = (1/m)*(x.T@(h_x - y))
        theta = theta - (learning_rate)*cost_
        J_all.append(cost_function(x, y, theta))

    return theta, J_all 

def plot_cost(J_all, num_epochs):
    plt.xlabel('Epochs')
    plt.ylabel('Cost')
    plt.plot(num_epochs, J_all, 'm', linewidth = "5")
    plt.show()

def test(theta, x):
    x[0] = (x[0] - mu[0])/std[0]
    x[1] = (x[1] - mu[1])/std[1]

    y = theta[0] + theta[1]*x[0] + theta[2]*x[1]
    print("Price of house:", y)

x,y = load_data("house_price_data.txt")
y = np.reshape(y, (46,1))
x = np.hstack((np.ones((x.shape[0],1)), x))
theta = np.zeros((x.shape[1], 1))
learning_rate = 0.1
num_epochs = 50
theta, J_all = gradient_descent(x, y, theta, learning_rate, num_epochs)
J = cost_function(x, y, theta)
print("Cost:", J)
print("Parameters:", theta)

#for testing and plotting cost 
n_epochs = []
jplot = []
count = 0
for i in J_all:
    jplot.append(i[0][0])
    n_epochs.append(count)
    count += 1
jplot = np.array(jplot)
n_epochs = np.array(n_epochs)
plot_cost(jplot, n_epochs)

test(theta, [1600, 3])

to sum up

This is the entire code for linear regression.

Now you have learned to successfully write linear regression models from scratch. It is not easy to understand and write the entire algorithm, and you may need to look back from time to time to fully understand it. But these efforts are worthwhile. Linear regression is usually the first step for people to learn machine learning algorithms. After that, you can choose another data set suitable for linear regression processing and try the algorithm you just wrote.

Guess you like

Origin blog.csdn.net/pythonxuexi123/article/details/113060621
Recommended