DLT-02-Unary Linear Regression

This article is the second article in the **Deep Learning Tutorial (DLT)** series, which mainly introduces linear neural networks. Students who want to learn deep learning or machine learning can pay attention to the official account GeodataAnalysis, and I will gradually update this series of articles.

Before introducing deep learning, we need to understand some basics of neural network training. In order to make it easier to learn, we will start from the classic algorithm - linear neural network, first introduce the linear regression of one variable, and gradually expand to multiple linear regression and nonlinear regression. Both linear regression and regression in classical statistical learning techniques softmaxcan be viewed as linear neural networks, and this knowledge will form the basis for more complex techniques in other parts of this tutorial series.

The catalog of this article is as follows:

  • 1 Linear regression
  • 2 Linear regression model expression
  • 3 Generate test data
    • 3.1 Long-term trends
    • 3.2 Seasonal changes
    • 3.3 Irregular changes
  • 4 Hypothetical functions
  • 5 Loss function
  • 6 Gradient Descent
  • 7 Stochastic Gradient Descent
  • 8 training
    • 8.1 Without stochastic gradient descent
    • 8.2 Using stochastic gradient descent

1 Linear regression

Regression is a class of methods that can model the relationship between one or more independent variables and a dependent variable. In the natural and social sciences, regression is often used to represent the relationship between inputs and outputs.

Most tasks in the field of machine learning are usually related to prediction. Regression problems are involved when we want to predict a value. A common example is: given the living area, predict the price of a house. But not all predictions are regression problems, in a later article, we will introduce classification problems. The goal of a classification problem is to predict which of a set of classes data belongs to, such as: given the living area, predict whether a house is a house or an apartment.

Linear regression is based on several simple assumptions: First, it is assumed that the relationship between the independent variable x and the dependent variable y is linear, that is, y can be expressed as the weighted sum of the elements in x, where it is usually allowed to include a Second, we assume that any noise is not disordered, such as noise follows a normal distribution.

2 Linear regression model expression

Usually, we use x ( i ) x^{(i)}x( i ) represents the input variable;y ( i ) y^{(i)}y( i ) represents the output or target variable;( x ( i ) , y ( i ) ) (x^{(i)}, y^{(i)})(x(i),y( i ) )represents the training set. We'll also use X to represent the range of input values ​​and Y to represent the range of output values. In this example, X = Y = R.

Supervised learning is to learn a function h given a training set : X → Y h: X → YhXY , such thath ( x ) h(x)h ( x )yyA "good" predictor for the corresponding value of y . For historical reasons,h ( x ) h(x)h ( x ) is called the hypothesis function.

3 Generate test data

For simplicity, we will construct an artificial dataset based on a linear model with noise. Our task is to recover the parameters of this model using this finite sample dataset. We will use low-dimensional data so that it can be easily visualized. In the code below, we generate a time series dataset with 1461 samples. Since time series data = long-term trend + seasonal change + irregular change, we generate test data according to this rule.

3.1 Long-term trends

The long-term trend is y=0.1xto generate four-year daily data.

import numpy as np
import matplotlib.pyplot as plt

def trend(time, slope=0):
    return slope * time

time = np.arange(4 * 365 + 1)
baseline = 10
trend_series = trend(time, 0.1)+baseline
plt.plot(time, trend_series);

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-5HTOrbqe-1677508652376)(https://files.mdnice.com/user/39322/4cf0f8c6-fb6a-4fbb -8fae-16fe07d4ad9e.png)]

3.2 Seasonal changes

Cyclicity, also known as cyclical fluctuation, is a wavy or oscillating change around a long-term trend presented in a time series.

def seasonal_pattern(season_time):
    return np.where(season_time < 0.4,
                    np.cos(season_time * 2 * np.pi),
                    1 / np.exp(3 * season_time))

def seasonality(time, period, amplitude=1, phase=0):
    season_time = ((time + phase) % period) / period
    return amplitude * seasonal_pattern(season_time)

amplitude = 40
season_series = seasonality(time, period=365, amplitude=amplitude)
plt.plot(time, season_series, time, trend_series+season_series);

3.3 Irregular changes

Irregular changes are irregular changes, including two types: strict random changes and irregular sudden changes with great influence.

def noise(time, noise_level=1):
    return np.random.randn(len(time)) * noise_level

noise_level = 15
noisy_series = trend_series + season_series + noise(time, noise_level)
plt.plot(time, noisy_series);

The final generated data is as follows, xrepresenting the input variable and ythe target variable, both of which are one-dimensional numpyarrays.

x = time.copy()
y = noisy_series.copy()
x.shape, y.shape
((1461,), (1461,))

4 Hypothetical functions

The linear assumption means that the target (house price) can be expressed as a weighted sum of features (house area), as shown in the following formula:

p r i c e = w a r e a ∗ a r e a + b price = w_{area} * area + b price=wareaarea+b

Among them, warea w_{area}wareaCalled the weight (weight), the weight determines the influence of each feature on our predicted value. bbb is called bias (bias), offset (offset) or intercept (intercept). Bias refers to how much the predicted value should be when all the features have a value of 0. Even though in reality there will never be any house with an area of ​​0 or exactly 0 years old, we still need a bias term. Without the bias term, the expressive power of our model would be limited. Strictly speaking, the above formula is an affine transformation of the input features. Affine transformation is characterized by linear transformation of features through weighted sums and translation through offset terms.

Given a dataset, our goal is to find the weights ww of the modelw and biasbbb , so that the prediction made according to the model roughly conforms to the real price in the data. The predicted value of the output is determined by the affine transformation of the input features through the linear model, and the affine transformation is determined by the selected weights and biases.

A hypothetical function is used to represent this linear model so that it represents the distribution of the data as closely as possible. For unary linear regression, the hypothesis function can be set as h ( x ) = θ 0 + θ 1 ∗ xh(x)=\theta_0+\theta_1*xh(x)=i0+i1x . Among them,θ 0 \theta_0i0is the weight, θ 1 \theta_1i1for the bias. code show as below:

def predict_fun(x, parameters):
    return parameters['theta0']+parameters['theta1']*x

5 Loss function

Before we start thinking about how to fit the model to the data, we need to identify a measure of the goodness of fit. The loss function (loss function) can quantify the gap between the actual value of the target and the predicted value, also known as the cost function. Usually we choose a non-negative number as the loss, and the smaller the value, the smaller the loss, and the loss is 0 for perfect prediction. The most commonly used loss function in regression problems is the squared error function. ​​When the sample x ( i ) x^{(i)}xThe predicted value of ( i ) is y ^ ( i ) \widehat{y}^{(i)}y ( i ) , and its corresponding true label isy ( i ) y^{(i)}y( i ) , the square error can be defined as the following formula:

J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( y ^ ( i ) − y ( i ) ) 2 = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_{0},\theta_{1})=\frac{1}{2m}\sum\limits_{i=1}^{m} (\widehat{y}^{(i)}-y^{(i)})^{2}=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 J(θ0,i1)=2 m1i=1m(y (i)y(i))2=2 m1i=1m(hi(x(i))y(i))2

This function is also known as the "mean square error function". Multiply the mean by 1 2 \frac{1}{2}21is to facilitate the calculation of gradient descent, because the derivative term of the square function will cancel 1 2 \frac{1}{2}21

The purpose of training the model is to find a set of parameters ( θ 0 , θ 1 ) (\theta_{0},\theta_{1})( i0,i1) , this set of parameters minimizes the total loss over all training samples. The code of the loss function is as follows:

def loss_fun(y, y_predict):
    return np.sum(np.square(y_predict-y))/(2*y.size)

6 Gradient Descent

Most deep learning algorithms involve some form of optimization. Optimization refers to the task of changing x to minimize or maximize some function f(x). We usually refer to most optimization problems by minimizing f(x). Maximization can be achieved by minimizing −f(x) via a minimization algorithm.

We call the function to be minimized or maximized the objective function or criterion. When we minimize it, we also call it the cost function, loss function, or error function. Although some machine learning literature assigns special meaning to these names, we use the terms interchangeably throughout this tutorial. We usually use a superscript ∗ to denote the x-value that minimizes or maximizes the function. If we remember x ∗ = argminf ( x ) x^∗=argmin f(x)x=argminf(x)

Suppose we have a function y = f(x), where x and y are real numbers. The derivative of this function is written as f′(x) or dydx \frac{dy}{dx}dxdy. The derivative f′(x) represents the slope of f(x) at point x. In other words, it shows how to scale small changes in the input to obtain corresponding changes in the output: f(x + ϵ) ≈ f(x) + ϵf′(x).

So the derivative is useful for minimizing a function because it tells us how to improve y slightly by changing x. As shown in the figure below, we know that f(x − ϵ·sign(f′(x))) is smaller than f(x) for sufficiently small ϵ. So we can reduce f(x) by moving x a small step in the opposite direction of the derivative. This technique is called gradient descent.

When f′(x) = 0, the derivative provides no information about which direction to move. The point where f′(x) = 0 is called critical point or stationary point. A local minimum means that f(x) at this point is smaller than all neighboring points, so it is impossible to reduce f(x) by moving infinitely small steps. A local maximum means that f(x) at this point is larger than all neighboring points, so it is impossible to increase f(x) by moving infinitesimal steps. Some critical points are neither minimum nor maximum. These points are called saddle points.

The point at which f(x) attains its absolute minimum (relative to all other values) is the global minimum. The function may have only one global minimum point or multiple global minimum points, and there may also be local minimum points that are not globally optimal. In the context of deep learning, the function we want to optimize may contain many suboptimal local minima, or many saddle points in very flat regions. All of this will make optimization difficult, especially when the input is multidimensional. Our aim is therefore to find the point where f is very small, but not necessarily minimal in any formal sense.

We often minimize functions with multidimensional inputs: f : R n − > R f: R^n -> Rf:Rn>R. _ In order for the concept of ''minimization'' to make sense, the output must be one-dimensional (scalar).

For functions with multidimensional inputs, we need to use the concept of partial derivatives. Partial derivative ∂ ∂ xif ( x ) \frac{\partial}{\partial x_i}f(x)xif ( x ) measures onlyxi x_ixiHow f(x) changes as it increases. The gradient is the derivative relative to a vector: the derivative of f is a vector containing all partial derivatives, denoted as ∇ xf ( x ) \nabla_x f(x)xf ( x ) . The ith element of the gradient is f with respect toxi x_ixipartial derivative of . In the multidimensional case, the critical point is the point where all elements in the gradient are zero.

The directional derivative in the direction of u (the unit vector) is the slope of the function f in the direction of u. In other words, the directional derivative is the function f ( x + α u ) f(x + \alpha u)f(x+αu ) aboutα \alphaThe derivative of α (in α \alphaobtained when α = 0). Using the chain rule, we can see that whenα = 0 \alpha = 0a=0 时, ∂ ∂ α f ( x + α u ) = u T ∇ x f ( x ) \frac{\partial}{\partial \alpha} f(x + \alpha u) = u^T \nabla_x f(x) αf(x+au )=uTxf(x)

To minimize f, we want to find the direction in which f decreases the fastest. Compute directional derivatives:

min ⁡ u , u T u = 1 u T ∇ x f ( x ) = min ⁡ u , u T u = 1 ∥ u ∥ 2 ∥ ∇ x f ( x ) ∥ 2 c o s θ \min_{u, u^T u=1} u^T \nabla_x f(x) = \min_{u, u^T u=1} \left \| u \right \|_2 \left \| \nabla_x f(x) \right \|_2 cos \theta u,uTu=1minuTxf(x)=u,uTu=1minu2xf(x)2cosθ

where θ is the angle between u and the gradient. Put ∥ u ∥ 2 \left \| u \right \|_2u2Substituting and ignoring the items unrelated to u, we can simplify to get min ⁡ ucos θ \min_{u} cos \thetaminucos θ . This is minimized when u is in the opposite direction of the gradient. In other words, gradient vectors point uphill and negative gradient vectors point downhill. We can reduce f by moving in the direction of negative gradient. This is called the method of steepest descent or gradient descent.

The suggested new point for steepest descent is:

x ′ = x − ϵ ∇ x f ( x ) x' = x - \epsilon \nabla_x f(x) x=xϵxf(x)

where ϵ \epsilonϵ is the learning rate, which is a positive scalar that determines the step size. We can choose ϵ \epsilonin several different waysϵ . A common way is to choose a small constant. Sometimes we choose the step size that makes the directional derivative disappear by calculation. Another method is based on severalϵ \epsilonϵ calculusf( x ) = ϵ ∇ xf ( x ) f(x) = \epsilon \nabla_x f(x)f(x)=ϵxf ( x ) , and choose the ϵ \epsilonthat yields the smallest value of the objective functionϵ . This strategy is called line search.

Steepest descent converges when every element of the gradient is zero (or in practice, very close to zero). In some cases, we may be able to avoid running this iterative algorithm and solve the equation ∇ xf ( x ) = 0 \nabla_x f(x) = 0xf(x)=0 jumps directly to the critical point.

While gradient descent is restricted to optimization problems in continuous spaces, the general concept of continuous small steps toward a better situation (i.e., small moves that approximate optimality) can be generalized to discrete spaces. Increasing an objective function with discrete parameters is called a hill climbing algorithm.

For the unary linear regression we are going to talk about, the derivative of the loss function (the mean loss of all samples in the data set) with respect to the model parameters is the gradient. Therefore, we must iterate through the entire dataset before updating the parameters each time. For the loss function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( xi ) − yi ) 2 J(\theta_{0},\theta_{1})=\frac{1} {2m}\sum\limits_{i=1}^m(h_\theta(x_i)-y_i)^2J(θ0,i1)=2 m1i=1m(hi(xi)yi)2 , the gradients are:

∂ J ( θ 0 , θ 1 ) ∂ θ 0 = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) ∂ J ( θ 0 , θ 1 ) ∂ θ 1 = 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) ∗ x i \begin{align} \frac{\partial J(\theta_{0},\theta_{1})}{\partial \theta_0} &= \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i) \newline \frac{\partial J(\theta_{0},\theta_{1})}{\partial \theta_1} &= \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)*x^i \end{align} θ0J ( i0,i1)θ1J ( i0,i1)=m1i=1m(hi(xi)yi)=m1i=1m(hi(xi)yi)xi

To sum up, the steps of the gradient descent algorithm are as follows: (1) Initialize the value of the model parameters, such as random initialization; (2) Choose a very small number as the learning rate; (3) Take the entire data set as a sample, and in the negative gradient The parameters are updated in the direction of , and this step is iterated continuously. For unary linear regression, we can explicitly write the parameter update as follows:

θ 0 ← θ 0 − ϵ ∗ 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) θ 1 ← θ 1 − ϵ ∗ 1 m ∑ i = 1 m ( h θ ( x i ) − y i ) ∗ x i \begin{align} \theta_{0} & \gets \theta_{0} - \epsilon * \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i) \newline \theta_{1} & \gets \theta_{1} - \epsilon * \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x_i)-y_i)*x^i \end{align} i0i1i0ϵm1i=1m(hi(xi)yi)i1ϵm1i=1m(hi(xi)yi)xi

The code to update the model parameters using gradient descent is as follows:

def gradient(x, y, parameters, learning_rate):
	y_predict = predict_fun(x, parameters)
    
    theta0_d = learning_rate*np.sum(y_predict-y)/y.size
	theta0 = parameters['theta0'] - theta0_d
    theta1_d = learning_rate*np.sum((y_predict-y)*x)/y.size
	theta1 = parameters['theta1'] - theta1_d
    
	parameters['theta0'] = theta0
	parameters['theta1'] = theta1

    return parameters

7 Stochastic Gradient Descent

A recurring problem in machine learning is that good generalization requires large training sets, but large training sets are also more computationally expensive.

Cost functions in machine learning algorithms can often be decomposed into a sum of cost functions for each sample. For example,

J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x i ) − y i ) 2 \begin{align} J(\theta_{0},\theta_{1})=\frac{1}{2m}\sum\limits_{i=1}^m(h_\theta(x_i)-y_i)^2 \end{align} J(θ0,i1)=2 m1i=1m(hi(xi)yi)2

For these additive cost functions, gradient descent needs to compute:

∇ θ J ( θ ) = i m ∑ i = 1 m ∇ θ L ( x ( i ) , y ( i ) , θ ) \begin{align} \nabla_\theta J(\theta) = \frac{i}{m}\sum\limits_{i=1}^m \nabla_\theta L(x^{(i)}, y^{(i)}, \theta) \end{align} iJ(θ)=mii=1miL(x(i),y(i),i ).

The computational cost of this operation is O(m). As the size of the training set grows to billions of samples, computing the gradient for one step also takes a considerable amount of time.

At its core, stochastic gradient descent (SGD) is that the gradient is the expectation. Expectations can be approximated using a small sample size. Specifically, at each step of the algorithm, we uniformly sample a small batch (minibatch) of samples B = x ( 1 ) , … , x ( m ′ ) B = {x^{(1)}, \dots , x^{(m')}}B=x(1),,x(m' ). The number m' of the mini-batch is usually a relatively small number, from one to several hundred. Importantly, m′ is usually fixed as the training set size m grows. We may only use a few hundred samples per update calculation when fitting billions of samples.

The gradient estimate can be expressed as:

g = 1 m ′ ∇ θ ∑ i = 1 m ′ L ( x ( i ) , y ( i ) , θ ) \begin{align} g = \frac{1}{m'} \nabla_\theta \sum\limits_{i=1}^{m'} L(x^{(i)}, y^{(i)}, \theta) \end{align} g=m1ii=1mL(x(i),y(i),i ).

Use samples from mini-batch B. Then, the stochastic gradient descent algorithm uses a gradient descent estimate as follows:

θ ← θ − ϵ θ \begin{align}\theta\gets\theta - \epsilon\theta\end{align}iiϵ θ

where, ϵ \epsilonϵ is the learning rate.

Gradient descent is often considered slow or unreliable. Previously, applying gradient descent to non-convex optimization problems was considered reckless or unprincipled. Now, we know that gradient descent works well for training in Section 7 of this paper. An optimization algorithm is not necessarily guaranteed to reach a local minimum in a reasonable amount of time, but it can often find a small value of the cost function in time and be useful.

Stochastic gradient descent has many important applications outside of deep learning. It is the main method for training large linear models on large-scale data. For a fixed-size model, the computational cost of each stochastic gradient descent update does not depend on the training set size m. In practice, when the training set size grows, we usually use a larger model, but this is not necessary. The number of updates required to reach convergence generally increases with the size of the training set. However, as m tends to infinity, the model eventually converges to the best possible test error before stochastic gradient descent has sampled all the samples on the training set. Continuing to increase m does not prolong the time to reach the best possible test error for the model. From this point of view, we can consider the asymptotic cost of training a model with SGD to be of the order O(1) as a function of m.

8 training

8.1 Without stochastic gradient descent

Now that we have all the elements needed for model training ready, we can implement the main part of the training process. Understanding this code is crucial because when you do deep learning, you see pretty much the same training process over and over again. We first initialize the model parameters and learning rate. In each iteration, we read the training samples and run them through our model (the hypothesis function) to obtain a set of predictions. The gradient of each parameter is then calculated, and then the model parameters are updated using the gradient descent algorithm. After each iteration, we will calculate and save the loss value, so that we can check the training effect later.

parameters = {
    
    'theta0': np.random.random(), 
			  'theta1': np.random.random()}
learning_rate = 0.000001
losses = []

for i in range(10):
	parameters = gradient(x, y, parameters, 
							learning_rate)
	losses.append(loss_fun(x, y, parameters))
plt.plot(losses);

The visualization results of the loss value of each iteration are as follows, and the loss value has dropped to a very low level after the second iteration.

The final linear regression visualization results are as follows:

plt.scatter(x, y, s=2)
plt.plot(x, predict_fun(x, parameters), 'r-');

8.2 Using stochastic gradient descent

At present, more and more training data are used in deep learning, and the use of stochastic gradient descent is almost an inevitable choice, so understanding the following code is very important for future learning.

First, initialize the model parameters and learning rate.

But before starting the training, we also define two parameters: sample_num, indicating the number of samples drawn each time by stochastic gradient descent; epoch_size, indicating that all data is sent to the network, and a parameter update process is completed. It is used epoch_sizebecause it is not enough to iteratively train all the data once, and it needs to be repeated many times to fit and converge.

Then start the first epochtraining, sample_numand determine the required batchquantity according to the size. batchRefers to the samples extracted by SGD. It should be noted that since we are using time series data, we cannot continuously extract data. For example, the first sample_numsample is taken as the first batch, and the next sample_numsample is taken as the second batch.

The following process is the same as not using SGD, first use the small batch of training samples to obtain a set of predictions through our model. The gradient of each parameter is then calculated, and then the model parameters are updated using the gradient descent algorithm.

parameters = {
    
    'theta0': 0, 'theta1': 1}
learning_rate = 0.000001
losses = []

sample_num = 365
epoch_size = 2

for epoch in range(epoch_size):
    for i in range(x.size//sample_num+1):
        random_samples = np.random.choice(x.size, 
                                          sample_num)
        parameters = gradient(x[random_samples], 
                              y[random_samples], 
                              parameters, 
                              learning_rate)
        losses.append(loss_fun(x[random_samples], 
                               y[random_samples], 
                               parameters))
plt.plot(losses);

plt.scatter(x, y, s=2)
plt.plot(x, predict_fun(x, parameters), 'r-');

bibliography:

Hands-on deep learning, https://github.com/d2l-ai/d2l-zh.
Deep Learning, https://github.com/exacity/deeplearningbook-chinese.

Guess you like

Origin blog.csdn.net/weixin_44785184/article/details/129251788