Introductory machine learning (2)———————— Linear Unit and Gradient Descent

         Through the previous article, we learned a simple perceptron, understood the step function (preferably called two classification, simple and clear haha), and the perceptron rules for training the perceptron. Here perceptron learning another - linear elements, by means of this linear to understand some basic concepts of machine learning, such as the model, the objective function, optimization algorithms and so on. Use this to get a simple understanding of machine learning.

1. Linear unit

       For the perceptrons mentioned in the previous section , the previous perceptron rules cannot converge on linearly inseparable data sets. At this time, it is impossible to train the perceptron. So in order to solve this problem, I found a derivable is a linear function instead of a step function perceptron, which is called perceptron linear elements . The linear unit will converge to an optimal approximation when faced with a linearly inseparable data set.

      Here we assume the activation function f of a simple linear unit:  f(x) = x

      The linear unit is expressed as follows:

Compared with the previous section:

After replacing the activation function f, the linear unit returns a real value instead of the binary classification 0, 1, so the linear unit is used to solve the regression problem instead of classification .

2. Linear unit model

       It is said that the model is actually a function (ps: I feel like this, the model is a bit taller~~haha).

       In actual real life, we use the input x value to predict the output y algorithm. For example, x can be a person’s working years, and y is his monthly salary. We can predict a person’s income based on a person’s working years according to a certain algorithm. such as:

y = h(x) = w*x +b

       The function h(x) is a hypothesis, and w, b are his parameters. Assuming that w = 1000, b = 500, x is a person's working years, substituting the above model, he predicted that his monthly salary is 5500. Such a model is a bit unreliable, because other factors are not considered, but only working years are considered. When working years, industry, company, and rank information are called characteristics . For a person who has worked for 3 years in the IT industry, Alibaba, and rank T3, you can use feature vectors to represent

X = (3, IT, Ali, T3)

At this time, the input X becomes a vector with four features. On the contrary, only one parameter w is not enough, and three parameters are needed. A total of four parameters (w1, w2, w3, w4) are Subscript, each feature corresponds to a parameter. The model at this time becomes:

y = h(x) = w1 * x1+ w2 *x2 + w3 * x3 + w4 * x4 + b

x1: working years x2: corresponding industry x3: corresponding company x4: corresponding rank

At this time, let w0 be equal to b, and w0 corresponds to feature x0, because x0 does not exist, we can set x0 = 1, that is:

b = w0 * x0 where x0 = 1

So the above formula becomes the following:

y = h(x) = w1 * x1+ w2 *x2 + w3 * x3 + w4 * x4 + b                 (1)

             =w0 * x0 + w1 * x1+ w2 *x2 + w3 * x3 + w4 * x4         (2)

The pattern written as a vector:

The above formula, function model is called linear model , y is the input output characteristics x1, x2, x3, ... of the linear combination .

Three, supervised learning and unsupervised learning

I have a model, but how to train it? ? , How to choose the parameter w? ?

Supervised learning: It is a kind of learning method of machine learning. It is to train a model and provides a lot of samples. These samples include the input feature X and the corresponding output y (y is also called label, label ) . In other words, we need to know the data of many people, they all have these characteristics: working years, industry, company, rank, and their income. We use these samples to train the model, let the model know the characteristics of each input (X), and also know the answer to the corresponding question (marked y). The model uses these samples (enough) to summarize the rules, and then predicts the input (X) that has not been seen, and the corresponding answer.

Unsupervised learning: In the training sample of this method, only X does not have y. The model can summarize some laws of feature x, but it cannot know the corresponding answer y. (That is, if you give input x, there will be an output y, but you don’t know if the value of y is correct)

        In many cases, there are very few samples with x and y, and most samples only have x. For example, in a speech-to-text (STT) recognition task, x is the speech and y is the text corresponding to the speech. We can obtain a lot of data with voice, but it is very time-consuming to convert voice into text and label the corresponding text. At this time, in order to make up for the lack of labeled samples, we can use unsupervised learning methods to do some clustering first , let the model summarize which syllables are similar, and then use a small number of labeled training samples to tell the model that some of the syllables correspond to Text. At this time, the model can map similar syllables to similar words to complete the training of the model.

Fourth, the objective function of the linear unit

Only consider supervised learning :

       Under supervised learning, for a sample, we know its feature X and label y. At the same time, the output is calculated according to the model h(x) . Here, y represents the mark in the training sample , which is the actual value ; the underlined represents the predicted value calculated by the model . If the values ​​of y and are extremely close, the model is good.

So how to express the value similar to y, we can use 1/2 of the trivial sum of the difference with y to express:

   e: the error of a single sample ( 1/2 for easy calculation )

Training data will have a lot of samples, if there are N number, we can use the training samples all samples Errors and, to represent the error E model , as follows:

The upper one indicates the error of the first sample, and the subsequent one indicates the second, all the way to the last one. . . . .

Written as sum:

In the above formula 2, it represents the feature of the i-th training sample, and represents the mark of the i-th sample , where a tuple can be used to represent the i-th training sample

It is the model's predicted value for the i-th sample .

For a training set data, when the error is smaller, the model is better. For a specific training data set, the values ​​are known, so for equation 2 it becomes the parameter w function.

In summary, the training of the model is actually to get a suitable w so that the value of formula 2 is the smallest. Taking the minimum value becomes a mathematical optimization problem , and E(w) is the optimization objective, which is our objective function .

 

Four, gradient descent optimization algorithm

When learning mathematics, we find the extreme value of a function. The extreme point of y=f(x) is the shop where its derivative is calculated . So we can get the extreme point of the function by solving the equation .

For the computer, it finds the extreme points through its own calculations. as follows:

        First of all, we randomly select a point, assuming that the change point is x0 point, then each iteration changes x to x1, x2, x3, ... After many iterations, we reach the minimum point of the function.

     Every time we modify the value of x, we need to go in the direction of the minimum value of the function, and we need to modify x in the opposite direction of the gradient of the function y = f(x) Gradient: is a vector that points to the direction where the function value rises fastest. So the opposite direction of the gradient is the direction in which the function value drops fastest . So every time you modify the value of x in the opposite direction of the gradient, you can walk to the minimum value of the function. The reason why it is near the minimum value instead of the minimum point is because the step length of each movement will not be just right. Maybe the last iteration went far and directly passed the minimum point. The selection of step length: if the choice is small, it will iterate for many rounds to get to the vicinity of the minimum; if the choice is large, it may go far beyond the minimum and fail to converge to a good point.

Given the gradient descent algorithm:

   

 : Gradient operator         : refers to the gradient of f(x)      : step size (learning rate)

The corresponding objective function above (Equation 2) can be rewritten as:

The gradient descent algorithm can be rewritten as:

At the same time, if the maximum value is obtained , the gradient ascent algorithm can be used , and its parameter modification rules are as follows:

At this time, it is necessary to obtain and substitute the parameter rules of the linear unit.

The gradient of the objective function obtained is:

The parameter modification rules of the final linear unit are as follows:

At this point, you can use this function to write the code for training the linear unit.

It should be noted that if each sample has M features, x and w in the above formula are all M+1-dimensional vectors (because we have added a virtual feature x0 that is always 1, refer to the previous content), And y is a scalar . Expressed in mathematical notation, that is

That is, w and x are M+1-dimensional column vectors, and Equation 3 can be written as  

Five, the derivation of the objective function

Comrades who have studied mathematics, those who know but don’t know, I’m here to tell you that the definition of the gradient of a function is the partial derivative with respect to each variable .

as follows:

Solve for 11. First, the sum of the derivatives is equal to the derivative of the sum (not to mention the details, Baidu by yourself, it will be exhausted~~), so first put the sum symbol

The derivative inside is calculated and added as follows:

Find the derivative as follows:

From the above we know that y is a constant that has nothing to do with w, and the derivation is obtained by the continuous rule (the derivation of the compound function):

Calculate the two partial derivatives on the right side of the equation above

Substitution, the partial derivative in the sum is:

mamaya~~It's finally over, leiskr.

6. Stochastic Gradient Descent (SGD)

           If the equation 3 in the fourth quarter more to train the model ( DGD ), at each iteration W, to traverse all the sample data, saying it is called batch gradient descent ( Batch Gradient Descent ).

But when the sample is particularly large and the amount of data reaches millions to hundreds of millions, the SDG algorithm is often used. In the SGD algorithm, only one sample is calculated for each iteration of updating w. In this way, for a training data with millions of samples, completing one traversal will update w millions of times, which greatly improves the efficiency. Due to the noise and randomness of the sample, each update of w is not necessarily in the direction of reducing E. However, although there is a certain degree of randomness, a large number of updates generally proceed in the direction of decreasing E, so they can finally converge to near the minimum. The following figure shows the difference between SGD and BGD:

As shown in the figure above, the ellipse represents the contour of the function value, and the center of the ellipse is the minimum point of the function. Red is the approximation curve of BGD, and purple is the approximation curve of SGD. , We can see that BGD is moving towards the lowest point. Although SDG fluctuates greatly, it still approaches the lowest point in the end.

      Finally, it should be noted that SGD is not only efficient, but randomness is sometimes a good thing. Today's objective function is a "convex function", and the globally unique minimum can be found along the opposite direction of the gradient. However, for non-convex functions, there are many local minima. Randomness helps us escape some terrible local minima and thus obtain a better model.

Seven, realize the linear unit

Compare the previous perceptron model:

Through the comparison of the above figure, it is found that except for the different activation function f, the models and training rules of the two are the same (in the above table, the optimization algorithm of the linear unit is the SDG algorithm). So at this time we only need to replace the activation function of the perceptron.

The linear unit is realized by inheriting Perceptron from the previous section:

# -*- coding: utf-8 -*-
# !/usr/bin/env python
# @Time    : 2019/4/29 15:23
# @Author  : xhh
# @Desc    :  
# @File    : test_linearModel.py
# @Software: PyCharm
from test_perceptorn import Perceptron


# 定义激活函数
f = lambda x: x

class LinearUnit(Perceptron):
    def __init__(self, input_num):
        """
        初始化线性单元,设置输入参数的个数
        :param input_num:
        """
        Perceptron.__init__(self, input_num,f)

Simulate some data to test:

def get_training_dataset():
    """
    假设5个人的收入数据
    :return:
    """
    # 构建训练数据
    # 输入向量列表,每一项都是工作年限
    input_vecs = [[5],[3],[8],[1.4],[10.1]]
    # 期望的输出列表,月薪,注意要与输入一一对应
    labels = [5500, 2300, 7600,1800,11400]
    return input_vecs, labels

def train_linear_unit():
    """
    使用数据训练线性单元
    :return:
    """
    # 创建感知器,输入参数的特征为1(工作年限)
    lu = LinearUnit(1)

    # 训练,迭代10轮,学习速率为0.01
    input_vecs, labels = get_training_dataset()
    lu.train(input_vecs, labels, 10, 0.01)

    # 返回训练好的线性单元
    return lu

def plot(linear_unit):
    import matplotlib.pyplot as plt
    input_vecs, labels = get_training_dataset()
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(map(lambda x:x[0], input_vecs), labels)
    weights = linear_unit.weight
    bias = linear_unit.bias
    x = range(0, 12, 1)
    y = map(lambda x:weights[0] * x + bias, x)
    ax.plot(x, y)
    plt.show()


if __name__ == '__main__':
    """
    训练线性单元
    """
    linear_unit = train_linear_unit()
    # 打印训练获得的权重
    print(linear_unit)
    # 测试
    print('工作3.6年,月收入 = %.2f'% linear_unit.predict([3.4]))
    print('工作10年,月收入 = %.2f'% linear_unit.predict([10]))
    print('工作6.3年,月收入 = %.2f'% linear_unit.predict([6.3]))
    plot(linear_unit)

The results are as follows:

The fitted straight line is as follows:

 to sum up:

From what has been written above, it is concluded that the machine learning algorithm actually only has two parts:

 1. The model  predicts the function h(x) of the output y from the input feature x

  2. Objective function   The parameter value corresponding to the minimum (maximum) value of the objective function is the optimal value of the model's parameters. But usually we can only get the local minimum (maximum) value of the objective function, so we can only get the local optimal value of the model parameters .

  3. Next, use the optimization algorithm to find the minimum (maximum) value of the objective function, [random] gradient {decrease|rise} algorithm is an optimization algorithm . For an objective function, different optimization algorithms will introduce different training rules .

 

Reference materials:

  1. Tom M. Mitchell, "Machine Learning", translated by Zeng Huajun et al., China Machinery Industry Press

https://www.zybuluo.com/hanbingtao/note/448086

You can pay attention to the official account of my friend and me~~~ Here are some python technical information that my friend and I update from time to time! ! You can also leave a message to discuss technical issues. I hope you can support and pay attention to it. Thank you~~

 

Guess you like

Origin blog.csdn.net/weixin_39121325/article/details/89638932