An article to learn machine learning algorithms from scratch: Linear Regression

Regression algorithm

Regression refers to a statistical analysis method that studies the relationship between a set of random variables (Y1, Y2,..., Yi) and another set of variables (X1, X2,..., Xk), also known as multiple regression analysis. Usually Y1, Y2,..., Yi are dependent variables, and X1, X2,..., Xk are independent variables.
Regression analysis is a predictive modeling technique that studies the relationship between the dependent variable (target) and the independent variable (predictor). This technique is commonly used for predictive analysis, time series modeling, and discovery of causal relationships between variables. For example, the best way to study the relationship between driver's reckless driving and the number of road traffic accidents is regression.

Common regression algorithms

Linear Regression
is one of the most well-known modeling techniques. Linear regression is usually one of the preferred techniques when people learn predictive models. In this technique, the dependent variable is continuous, the independent variable can be continuous or discrete, and the nature of the regression line is linear.
Linear regression uses the best fit straight line (also known as the regression line) to establish a relationship between the dependent variable (Y) and one or more independent variables (X).
Use an equation to express it, that is, Y = ax + b + ϵ Y=ax + b + \epsilonY=ax+b+ϵ, where a represents the slope of the line, b represents the intercept, and e is the error term. This equation can predict the value of the target variable based on the given predictor variable(s).
Logistic Regression Logistic
regression is used to calculate the probability of "event=Success" and "event=Failure". When the type of the dependent variable is a binary (1/0, true/false, yes/no) variable, we should use logistic regression. It is widely used in classification problems.
Polynomial Regression Polynomial Regression
For a regression equation, if the index of the independent variable is greater than 1, then it is a polynomial regression equation. As shown in the following equation:
y = a ∗ x 2 + by=a*x^2+by=a∗x2+b
In this regression technique, the best fit line is not a straight line. It is a curve used to fit the data points. Over-fitting and under-fitting problems. Very common in polynomial regression.
Stepwise Regression Stepwise Regression
When dealing with multiple independent variables, we can use this form of regression. In this technique, the selection of independent variables is done in an automatic process, including non-human operations.
This feat is to identify important variables by observing statistical values, such as R-square, t-stats and AIC indicators. Stepwise regression fits the model by adding/removing covariates based on specified criteria at the same time. Some of the most commonly used stepwise regression methods are listed below: The
standard stepwise regression method does two things. That is, the prediction required for each step is added and deleted.
The forward selection method starts with the most significant prediction in the model, and then adds variables for each step.
The backward elimination method starts at the same time as all the predictions of the model, and then eliminates the least significant variable at each step.
The purpose of this modeling technique is to use the least number of predictors to maximize predictive power. This is also one of the ways to deal with high-dimensional data sets.
Ridge Regression

Ridge regression analysis is a technique used for data with multicollinearity (independent variables are highly correlated). In the case of multicollinearity, although the least squares method (OLS) is fair for each variable, they are very different, which makes the observed value shift away from the true value. Ridge regression reduces the standard error by adding a degree of deviation to the regression estimate.
Lasso Regression
is similar to ridge regression. Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute value of the regression coefficient. In addition, it can reduce the degree of variation and improve the accuracy of the linear regression model.
ElasticNet regression
ElasticNet is a hybrid of Lasso and Ridge regression technologies. It uses L1 for training and L2 first as the regularization matrix. ElasticNet is useful when there are multiple related features. Lasso will pick one of them at random, while ElasticNet will choose two.
——The above part of the content is taken from the algorithm published by Lu Qin_Dataren.com of Tencent Cloud Community: the commonly used regression algorithm.

Linear regression

concept

Linear Regression, as the name suggests, is regression analysis based on linear models. It is one of the most well-known modeling techniques. Linear regression is usually one of the preferred techniques when people learn predictive models. In this technique, the dependent variable is continuous, the independent variable can be continuous or discrete, and the nature of the regression line is linear.
Linear regression uses the best fit straight line (also known as the regression line) to establish a relationship between the dependent variable (y) and one or more independent variables (x).
Use an equation to express it, that is, y=ax + b + ε, where a represents the slope of the line, b represents the intercept, and ε is the error term. This equation can predict the value of the target variable based on the given predictor variable(s).

The above picture contains two formulas. The above formula is very familiar to everyone. It is a linear function in the two-dimensional space that we learned in our school days. The function image is a straight line. The formula below is also easy to understand. It only changes in the dimension. It is a straight line in a multi-dimensional space. The function image is a straight line. We call it linear. If it is linear, you only need to observe whether there is only one term in the function. As long as it is a single term, no matter how many dimensions are in the space, it is a straight line. In addition, , Distinguished from the number of terms, divided into unary linear regression and multiple linear regression.

Linear equation knowledge review

Suppose we have the following binary linear equation:
y = ax + b
and we know two points on this function graph
A: (1, 3) and B: (2, 5)
bring the two points into the equation
a + b = 3
2a + b = 5 The
solution is: a = 2, b = 1
Then the equation is: y = 2x + 1
At this time: given the x coordinate of any point on the function image, we can all follow this Equation to find the corresponding y coordinate.

From this simple case, what we realize is actually to select the appropriate mathematical model (f(x) = ax + b) based on the existing data (two points AB), and solve the model according to the relationship between the data and the model. For specific functions, the process of predicting the approximate value of the response variable corresponding to other independent variables through the solved function.

Linear model

Unfortunately, such a perfect model generally only exists in the mathematical world, and things in life often receive interference from other factors on the basis of satisfying certain laws. This kind of interference is called the inclusion of noise. (with noise)
The function containing noise is expressed as follows: f (x) = ax + b + ε f(x) = ax + b + εf(x)=ax+b+ε
ε(Epsilon) ~ N(0, 1): Noise value, which satisfies Gaussian distribution (normal distribution)
Note: What is Gaussian distribution (normal distribution)? It means that in a set of related sample data, the distribution of all sample data is that the closer to the average, the denser the sample data , The farther, the sparser, as shown in the figure:

The noise ε~N(0,1) means that the noise is 0 under normal conditions (most of the data)

Now we have the following set of data:

ID,Education,Income
1,10.000000 ,26.658839
2,10.401338 ,27.306435
3,10.842809 ,22.132410
4,11.244147 ,21.169841
5,11.645449 ,15.192634
6,12.086957 ,26.398951
7,12.048829 ,17.435307
8,12.889632 ,25.507885
9,13.290970 ,36.884595
10,13.732441 ,39.666109
11,14.133779 ,34.396281
12,14.635117 ,41.497994
13,14.978589 ,44.981575
14,15.377926 ,47.039595
15,15.779264 ,48.252578
16,16.220736 ,57.034251
17,16.622074 ,51.490919
18,17.023411 ,51.336621
19,17.464883 ,57.681998
20,17.866221 ,68.553714
21,18.267559 ,64.310925
22,18.709030 ,68.959009
23,19.110368 ,74.614639
24,19.511706 ,71.867195
25,19.913043 ,76.098135
26,20.354515 ,75.775216
27,20.755853 ,72.486055
28,21.167191 ,77.355021
29,21.598662 ,72.118790
30,22.000000 ,80.260571
 

Image drawing based on data

import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

# Read the file through pd
data = pd.read_csv('./data.csv')
x = data.Education
y = data.Income
# Draw an image through plt
plt.scatter(x, y)
plt.show()
 

Through observation, we found that this set of data basically conforms to the image characteristics of the linear function, but there may be noise in the individual data.
Then we can try to choose the function f(x) = ax + b + ε for modeling.
However, this Although the function image and data at time are similar to the simple case just now, due to the existence of noise, it cannot be solved by simply substituting the data. So how to solve this situation?

Least squares method

The method of solving the model based on the minimization of the mean square error is called "ordinary least squares". His main idea is to select unknown parameters so that the sum of squares of the difference between the theoretical value and the test value is minimized.

Let's first organize the following ideas:

  1. We know that in this set of data, it is impossible for us to draw a straight line through all points, because it contains noise, and by observing the functional formula, we find that the reflection of noise on the image is to make the points that were originally on the straight line appear. The translation distance of or down is the absolute value of ε.
  2. And ε ~ N(0,1), so we can first assume that there is a straight line that is the model we finally require, and then substitute all the data into this estimated straight line, and find the ε value of all points, when When the sum of ε is the smallest, the error between the function image and the sample data is the smallest.At this time, we can determine the values ​​of the two unknowns a and b.

So when can the sum of ε be minimized?
Sort the formula:
f (x) = ax + b + ε f(x) = ax + b + εf(x)=ax+b+ε
Let’s try to calculate ε first. Value:
ε = yi − f (xi) ⇒ ε = yi − axi − b ε = y_i-f(x_i) ⇒ ε = y_i-ax_i-bε=yi​−f(xi​)⇒ε=yi​−axi ​−b
corresponds to different values ​​(xi, yi) at this time, and ε is positive or negative. Because the ε of all points needs to be summed, in order to shield the possible negative influence of ε, we can square it. That is.
ε 2 ε^2ε2 = (yi − axi − b) 2 (y_i-ax_i-b)^2(yi​−axi​−b)2
Sum ε 2 ε^2ε2 on all points to get:
E = Σ i = 1 m \Sigma^m_(i=1)Σi=1m​ (yi − axi − b) 2 (y_i-ax_i-b)^2(yi​−axi​−b)2 is the loss Function (loss function), also recorded as: loss = Σ i = 1 m \Sigma^m_(i=1)Σi=1m​ (yi − axi − b) 2 (y_i-ax_i-b)^2(yi​ −axi​−b)2
where i∈ [1,m] represents all points. When E is the smallest, the values ​​of a and b have a unique solution. See the figure below, (w is a)

So how to find the minimum value of loss? Here we need to have a basic understanding of the quadratic function. The image of the quadratic function is as follows:

Looking back at the knowledge of the quadratic function, we know that when the coefficient of the quadratic term is positive, the function opens up and has a minimum. And further calculate the formula we just got: loss = Σ i = 1 m loss = \Sigma^m_(i= 1)loss=Σi=1m​ (yi − axi − b) 2 (y_i-ax_i-b)^2(yi​−axi​−b)2
where xi, yi x_i,y_ixi​,yi​ are all known The constants, the unknowns are a and b. The squares of −xi -x_i−xi​ and − 1 -1−1 are both positive, so the image is open upward, that is, there is a minimum.

It is relatively simple to find the minimum value of the function. As long as the function is derived, when the derivative (the slope of the tangent line) is 0, as shown in the red line, the minimum value of loss is obtained.
There are two unknowns, we first find w (that is, a , the coefficient of x), the derivative of b is obtained after w is substituted.

It can be solved by making the partial derivatives of w and b both 0:

among them: 

At this point, we can get a perfect fitting model for this set of sample data.

Code

# Linear regression solution of least squares method
# 1. Introduce dependency
import pandas as pd
import matplotlib.pyplot as plt

# 2. Read data
points = pd.read_csv('data.csv')
x = points.Education
y = points.Income


# *3. Display image
# plt.scatter(x, y)
# plt.show()

# 3. Define the function to calculate the average
def average(data):
    sum = 0
    num = len(data)
    for i in range(num):
        sum += data[i]
    return sum / num


# 4. Define the fitting function, and find w and b
def fit_function(x, y):
    """
    Fitting function
    : param x: x coordinate collection
    : param y: y coordinate collection
    : return: w, b: slope ,Intercept
    """
    m = len(x)
    #
    Find the average of all x x_bar = average(x)

    # Prepare data according to the formula
    sum_yxxbar = 0
    sum_x2 = 0
    sum_ywx = 0

    for i in range(m):
        sum_yxxbar += y[i] * (x[i] - x_bar)
        sum_x2 += x[i] ** 2

    w = sum_yxxbar / (sum_x2 - m * (x_bar ** 2))

    for i in range(m):
        sum_ywx += (y[i] - w * x[i])

    b = sum_ywx / m

    return w, b


w, b = fit_function(x, y)

# 5. Define the error function and find the error value
def MSE(w, b, x, y):
    """
    Mean square error
    : param w: slope
    : param b: intercept
    : param x: x coordinate collection
    : param y :y coordinate collection
    : return: mean square error
    """
    total_loss = 0
    m = len(x)
    for i in range(m):
        total_loss += (y[i]-w * x[i]-b) ** 2
    return total_loss / m


print("w is :", w)
print("b is :", b)

loss = MSE(w, b, x, y)
print("loss is :", loss)

# 6. Draw the fitting function image on the sample data image
plt.scatter(x, y)
y = w * x + b
plt.plot(x, y, c='r')
plt.show()

Results of the:

The calculation of the least squares method in unary linear regression is very accurate, but if the model is a multivariate linear model, the image is a straight line in a multidimensional space. This situation is shown in the figure:
 

Then the amount of calculation (solving the derivative function) is also very large. In order to avoid the complicated calculation process, we can give up the accurate results and take the second place, and use approximate approximation to solve this problem. How to achieve it? Here is another solution. The more commonly used method of linear regression: the gradient descent algorithm.

Gradient descent algorithm

Gradient Descent is an iterative method that can be used to solve least squares problems (both linear and nonlinear). When solving the model parameters of machine learning algorithms, that is, unconstrained optimization problems, gradient descent is one of the most commonly used methods, and another commonly used method is the least squares method. When solving the minimum value of the loss function, the gradient descent method can be used to solve it step by step to obtain the minimized loss function and model parameter values. Conversely, if we need to solve the maximum value of the loss function, then we need to use the gradient ascent method to iterate. In machine learning, two gradient descent methods have been developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.

Code

# Linear regression solution of gradient descent method
# 1. Introduce dependency
import pandas as pd
import matplotlib.pyplot as plt

# 2. Read data
points = pd.read_csv('data.csv')
x = points.Education
y = points.Income


# *3. Display image
# plt.scatter(x, y)
# plt.show()

# 3. Define loss function (MSE)
def MSE(w, b, x, y):
    """
    Mean square error
    : param w: slope
    : param b: intercept
    : param x: set of x coordinates
    : param y:y Coordinate collection
    : return: mean square error
    """
    total_loss = 0
    m = len(x)
    for i in range(m):
        total_loss += (y[i]-w * x[i]-b) ** 2
    return total_loss / m


# 4. Initialize the model's hyperparameter
alpha = 0.0001 # The step length coefficient is used to control the descent speed
initial_w = range(2, 8) # Initialize w
initial_b = range(-50, -30) # Initialize b
num_iter = 200 # Number of iterations


# 5. Define the gradient descent algorithm
def gradient_descent(x, y, initial_w, initial_b, alpha, num_iter):
    """
    Define the gradient descent algorithm
    : param x: sample point x coordinate
    : param y: sample point y coordinate
    : param initial_w: Initial w
    : param initial_b: initial b
    : param alpha: step-length coefficient
    : param num_iter: number of iterations
    : return: final w, b, loss value set (record loss declining process)
    """
    w = initial_w
    b = initial_b
    loss_list = [ ]
    for i in range(num_iter):
        loss = MSE(w, b, x, y)
        if i% 10 == 0: # Output the current result every ten times
            print("step:", i, "|w| is :", w, "|b| is :", b, "|loss| is :", loss)
        loss_list.append(loss)
        w, b = step_gradient_descent(w, b, alpha, x, y)
    return [w, b, loss_list]   # 返回结果list


# 6. Gradient descent process
def step_gradient_descent(w, b, alpha, x, y):
    """
    Single gradient descent
    : param w: current slope
    : param b: current intercept
    : param alpha: step coefficient for control Descent speed
    : param x: sample point x coordinate
    : param y: sample point y coordinate
    : return: step_w, step_b The slope and intercept after this descent
    """
    sum_grad_w = w
    sum_grad_b = b
    m = len(x)

    # 计算w和b的偏导
    for i in range(m):
        sum_grad_w += (w * x[i] + b - y[i]) * x[i]
        sum_grad_b += w * x[i] + b - y[i]
    grad_w = 2 / m * sum_grad_w
    grad_b = 2 / m * sum_grad_b

    # Gradient descent
    step_w = w-alpha * grad_w
    step_b = b-alpha * grad_b

    return step_w, step_b

# 7.多点
# Define the final result
final_loss_list = [1000]
final_w = 0
final_b = 0
# Traverse multiple points to calculate and retain the optimal solution
def comput_multipoint(x, y, initial_w, initial_b, alpha, num_iter):
    "" "
    Traverse multiple points to calculate and retain the optimal solution
    : param x: sample point x coordinate
    : param y: sample point y coordinate
    : param initial_w: initial slope range
    : param initial_b: initial intercept range
    : param alpha: step coefficient To control the descending speed
    : param num_iter: number of iterations
    : return: None
    """

    # Introduce global variables
    global final_loss_list
    global final_w
    global final_b

    # 遍历所有w,b进行组合尝试
    for w in initial_w:
        for b in initial_b:
            result_list = gradient_descent(x, y, w, b, alpha, num_iter)
            if min(result_list[2]) < min(final_loss_list):
                final_w = result_list[0]
                final_b = result_list[1]
                final_loss_list = result_list[2]
                print("=" * 10 + "最新loss:", min(result_list[2]), "=" * 10)


# w, b, loss_list = gradient_descent(x, y, initial_w, initial_b, alpha, num_iter)
# 8. Call the function and output the result
comput_multipoint(x, y, initial_w, initial_b, alpha, num_iter)
# Log output
print("w is :", final_w)
print("b is :", final_b)
print("loss is :", min(final_loss_list))

# Loss change image drawing
plt.plot(final_loss_list)
plt.show()

# Draw the fitting function image on the sample data image
plt.scatter(x, y)
y = final_w * x + final_b
plt.plot(x, y, c='r')
plt.show()

 

Guess you like

Origin blog.csdn.net/abu1216/article/details/111026993