One, loss function

Loss function is usually used to estimate the predicted value of the model The degree of inconsistency between $f$ $($ $x$ $)$ and the true value Y. It is a non-negative real-valued function, where $L (Y, f (x))$ means. The smaller the loss value, the better the robustness of the model.

The Loss Function is the core part of the empirical risk function and an important part of the structural risk function.

The resulting risk function of the model includes empirical risk terms and regular terms, which are usually expressed as follows:
Insert picture description here

1.1 Mean square loss function

Using the least squares method, the basic principle is that the best fit curve should minimize the distance from all points to the regression line. Euclidean distance is usually used to measure distance. The loss function of the mean square loss is as follows:
$L(Y|f(X))=\sum (Yf(X))^{2}$

1.2 log log loss function

The logistic regression loss function is the log loss function.

$L (Y ∣ f (X)) = - l o g P (Y ∣ X)$

1.3 Exponential loss function

AdaBoost is the exponential loss function. Its standard form is as follows:
$L (Y ∣ f (X)) = e x p [- y f (x)]$

Two, optimization method

The optimization problem refers to the given objective function $L (Y ∣ f (X))$ , we need to find a set of parameters X such that $The value of L (Y ∣ f (X))$ is the smallest. In the machine learning neighborhood, there are three ways of gradient descent, namely: batch gradient descent (BGD), stochastic gradient descent (SGD), and mini-batch gradient descent (Min_BGD).

2.1 Batch gradient descent (BGD)

The basic idea: the most primitive form of the gradient descent method, its specific idea is that when updating each parameter, all samples are used for updating.

Advantages: global optimal solution; easy to implement in parallel;

Disadvantages: When the number of samples is large, the training process will be very slow.

Subject: The sample size is relatively small.

import numpy as np
import matplotlib.pyplot as plt

def BGD(x_vals, y_vals):

    alpha = 0.001  # 步长
    loop_max = 100     # 迭代次数
    theta = np.random.randn(2)  # 存储 权重、偏移量
    m = len(x_vals)
    b = np.full(m, 1.0)
    x_vals = np.vstack([b, x_vals]).T
    error = np.zeros(2)
    for i in range(loop_max):
        sum_m = np.zeros(2)
        for j in range(m):
            dif = (np.matmul(theta, x_vals[j]) - y_vals[j]) * x_vals[j]
            sum_m = sum_m + dif
        theta = theta - alpha * sum_m
        if np.linalg.norm(theta - error) < 0.001:
            break
        else:
            error = theta
        print('loop count = %d' % i, '\t theta:',theta)
    return theta

if __name__ == '__main__':
    np.random.seed(0)
    # iris = datasets.load_iris()
    # x_vals = np.array([x[3] for x in iris.data])
    # y_vals = np.array([y[0] for y in iris.data])
    x_vals = np.arange(0., 10., 0.2)
    y_vals = 2 * x_vals + 5 + np.random.randn(len(x_vals))
    theta = BGD(x_vals, y_vals)

    #  画图
    plt.plot(x_vals, y_vals, 'g*')
    plt.plot(x_vals, theta[1] * x_vals + theta[0], 'r')
    plt.show()

Insert picture description here

2.2 Stochastic Gradient Descent (SGD)

The basic idea: every iteration, one sample is updated.

Advantages: fast training speed;

Disadvantages: Decreased accuracy, not global optimal; not easy to implement in parallel.

Object: The number of samples is too large, or an online algorithm.

import numpy as np
import matplotlib.pyplot as plt

def SGD(x_vals, y_vals):

    alpha = 0.01  # 步长
    loop_max = 100     # 迭代次数
    theta = np.random.randn(2)  # 存储 权重、偏移量
    m = len(x_vals)
    b = np.full(m, 1.0)
    x_vals = np.vstack([b, x_vals]).T
    error = np.zeros(2)
    np.random.seed(0)
    for i in range(loop_max):

        for j in range(m):
            dif = np.matmul(theta, x_vals[j]) - y_vals[j]
            theta = theta - alpha * dif * x_vals[j]

        if np.linalg.norm(theta - error) < 0.001:
            break
        else:
            error = theta

        print('loop count = %d' % i, '\t theta:',theta)
    return theta

if __name__ == '__main__':

    x_vals = np.arange(0., 10., 0.2)
    y_vals = 2 * x_vals + 5 + np.random.randn(len(x_vals))

    theta = SGD(x_vals, y_vals)

    #  画图
    plt.plot(x_vals, y_vals, 'g*')
    plt.plot(x_vals, theta[1] * x_vals + theta[0], 'r')
    plt.show()

Insert picture description here

2.3 Small batch gradient descent (Min_BGD)

Basic idea: Combining the ideas of BGD and SGD, use a part of the data for BGD operation in each iteration.

Advantages: In order to overcome the shortcomings of the above two methods, while taking into account the advantages of both methods;

Object: . In the actual general situation.

import numpy as np
import matplotlib.pyplot as plt

def min_batch(x_vals, y_vals):

    alpha = 0.01  # 步长
    loop_max = 100     # 迭代次数
    batch_size=5
    theta = np.random.randn(2)  # 存储 w、b
    m = len(x_vals)

    b = np.full(m, 1.0)
    x_vals = np.vstack([b, x_vals]).T

    error = np.zeros(2)
    np.random.seed(0)
    for j in range(loop_max):
        for i in range(1,m,batch_size):
            sum_m = np.zeros(2)
            for k in range(i-1,i+batch_size-1,1):
                dif = (np.dot(theta, x_vals[k]) - y_vals[k]) *x_vals[k]
                sum_m = sum_m + dif
            theta = theta- alpha * (1.0/batch_size) * sum_m

        if np.linalg.norm(theta - error) < 0.001:
            break
        else:
            error = theta

        print('loop count = %d' % j, '\t theta:',theta)
    return theta

if __name__ == '__main__':

    x_vals = np.arange(0., 10., 0.2)
    y_vals = 2 * x_vals + 5 + np.random.randn(len(x_vals))

    theta = min_batch(x_vals, y_vals)

    #  画图
    plt.plot(x_vals, y_vals, 'g*')
    plt.plot(x_vals, theta[1] * x_vals + theta[0], 'r')
    plt.show()

Insert picture description here

2. Optimization method-gradient descent (BGD, SGD, Min_BGD)

Article Directory

One, loss function

1.1 Mean square loss function

1.2 log log loss function

1.3 Exponential loss function

Two, optimization method

2.1 Batch gradient descent (BGD)

2.2 Stochastic Gradient Descent (SGD)

2.3 Small batch gradient descent (Min_BGD)

Guess you like