Ten commonly used deep learning algorithms

In the past decade, people's interest in machine learning has exploded. We can see machine learning in computer programs, industry conferences, and the media almost every day. Many discussions about machine learning confuse "what can machine learning do" and "what humans want machine learning to do". Fundamentally speaking, machine learning is the use of algorithms to extract information from raw data, and use a certain type of model to represent it, and then use the model to make inferences on some other data that has not been represented by the model.

    神经网络就是机器学习各类模型中的其中一类,并且已经存在了至少50年。神经网络的基本单位是节点,它的想法大致来源于哺乳动物大脑中的生物神经元。生物大脑中的神经元节点之间的链接是随着时间推移不断演化的,而神经网络中的神经元节点链接也借鉴了这一点,会不断演化(通过“训练”的方式)。

    神经网络中很多重要框架的建立和改进都完成于二十世纪八十年代中期和九十年代初期。然而,要想获得较好结果需要大量的时间和数据,由于当时计算机的能力有限,神经网络的发展受到了一定的阻碍,人们的关注度也随之下降。二十一世纪初期,计算机的运算能力呈指数级增长,业界也见证了计算机技术发展的“寒武纪爆炸”——这在之前都是无法想象的。深度学习以一个竞争者的姿态出现,在计算能力爆炸式增长的十年里脱颖而出,并且赢得了许多重要的机器学习竞赛。其热度在2017年仍然不减。如今,在机器学习的出现的地方我们都能看到深度学习的身影。

    这是柳猫自己做的一个小例子,词向量的 t-SNE 投影,通过相似性进行聚类。![在这里插入图片描述](https://img-blog.csdnimg.cn/20210311162451871.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzIxNDY0NA==,size_16,color_FFFFFF,t_70)

Insert picture description here

https://img4.mukewang.com/5c06355e0001943706900664.jpg

    最近,我开始阅读关于深度学习的学术论文。根据我的个人研究,以下文章对这个领域的发展产生了巨大的影响:

In 1998, NYU's article "Gradient-Based Learning Applied to Document Recognition" introduced the application of convolutional neural networks in machine learning.

Toronto's 2009 article "Deep Boltzmann Machines" (Deep Boltzmann Machines) proposed a new learning algorithm for Boltzmann machines, which contains many hidden layers.

The article "Building High-Level Features Using Large-Scale Unsupervised Learning" jointly published by Stanford and Google in 2012 solves the problem of using only unlabeled data to build advanced, specific class feature detectors. The problem.

Berkeley's 2013 article "Deep Convolutional Activation Feature for Generic Visual Recognition" (DeCAF-A Deep Convolutional Activation Feature for Generic Visual Recognition) released an algorithm called DeCAF, which is an open source of deep convolutional activation features Realization, using relevant network parameters, vision researchers can use a series of visual concept learning paradigms to conduct in-depth experiments.

DeepMind's 2016 article "Playing Atari with Deep Reinforcement Learning" proposed the first deep learning model that can successfully learn control strategies directly from high-dimensional sensory inputs through reinforcement learning.

    柳猫整理了人工智能工程师 10 个用于解决机器学习问题的强大的深度学习方法。但是,我们首先需要定义什么是深度学习。

    如何定义深度学习是很多人面临的一个挑战,因为它的形式在过去的十年中已经慢慢地发生了改变。下图直观地展示了人工智能,机器学习和深度学习之间的关系。

https://img3.mukewang.com/5c0635b3000135f406900699.jpg

    人工智能领域广泛,存在时间较长。深度学习是机器学习领域的一个子集,而机器学习是人工智能领域的一个子集。一般将深度学习网络与“典型”前馈多层网络从如下方面进行区分:

Deep learning networks have more neurons than feedforward networks

The way in which deep learning networks connect layers is more complicated

Deep learning networks need to have computing capabilities like the "Cambrian explosion" for training

Deep learning network can automatically extract features

    上文提到的“更多的神经元”是指近年来神经元的数量不断增加,就可以用更复杂的模型来表示。层也从多层网络中每一层完全连接,发展到卷积神经网络中神经元片段的局部连接,以及与递归神经网络中的同一神经元的循环连接(与前一层的连接除外)。

    因此,深度学习可以被定义为以下四个基本网络框架中具有大量参数和层数的神经网络:

Unsupervised pre-trained network

Convolutional Neural Network

Recurrent neural network

Recurrent neural network

    在这篇文章中,我主要讨论三个框架:

Convolutional Neural Network (Convolutional Neural Network) is basically a standard neural network that uses shared weights to expand in space. Convolutional neural networks mainly recognize pictures through internal convolution, and internal convolution can see the edge of the recognized object on the image.

Recurrent Neural Network (Recurrent Neural Network) is basically a standard neural network that expands in time. It extracts the edge of entering the next time step instead of entering the next layer at the same time. Recurrent neural networks are mainly used to recognize sequences, such as speech signals or text. The internal circulation means that there is short-term memory in the network.

Recursive Neural Networks are more similar to hierarchical networks, in which the input sequence has no real time plane, but must be processed hierarchically in a tree-like manner. The following 10 methods can be applied to these frameworks.

1.
Backpropagation Backpropagation is a simple method to calculate the partial derivative (or gradient) of a function, in the form of a combination of functions (such as a neural network). When using gradient-based methods to solve optimization problems (gradient descent is only one of them), the function gradient needs to be calculated in each iteration.

https://img3.mukewang.com/5c0635da00011eee06000460.jpg

    对于一个神经网络,其目标函数是组合形式。那么应该如何计算梯度呢?有2种常规方法:

(1) Differential analysis method. When the function form is known, only the chain rule (basic calculus) needs to be used to calculate the derivative.

(2) Approximate differentiation by finite difference method. This method is computationally intensive, because the order of magnitude of function evaluation is O(N), where N is the number of parameters. Compared with the differential analysis method, this method is more computationally intensive, but when debugging, finite differences are usually used to verify the effect of back propagation.

2. Stochastic gradient descent
An intuitive understanding of gradient descent is to imagine a river originating from the top of a mountain. This river flows in the direction of the mountain to the lowest point of the foothills, and this is the goal of the gradient descent method.

    我们所期望的最理想的情况就是河流在到达最终目的地(最低点)之前不会停下。在机器学习中,这等价于我们已经找到了从初始点(山顶)开始行走的全局最小值(或最优值)。然而,可能由于地形原因,河流的路径中会出现很多坑洼,而这会使得河流停滞不前。在机器学习术语中,这种坑洼称为局部最优解,而这不是我们想要的结果。有很多方法可以解决局部最优问题。

https://img2.mukewang.com/5c06362a00014d8706900512.jpg

    因此,由于地形(即函数性质)的限制,梯度下降算法很容易卡在局部最小值。但是,如果能够找到一个特殊的山地形状(比如碗状,术语称作凸函数),那么算法总是能够找到最优点。在进行最优化时,遇到这些特殊的地形(凸函数)自然是最好的。另外,山顶初始位置(即函数的初始值)不同,最终到达山底的路径也完全不同。同样,不同的流速(即梯度下降算法的学习速率或步长)也会导致到达目的地的方式有差异。是否会陷入或避开一个坑洼(局部最小值),都会受到这两个因素的影响。

3. Learning rate decay
Adjusting the learning rate of the stochastic gradient descent optimization algorithm can improve performance and reduce training time. This is called learning rate annealing or adaptive learning rate. The simplest and most commonly used learning rate adaptation method in training is to gradually reduce the learning rate. Using a larger learning rate in the early stage of training can greatly adjust the learning rate; in the later stage of training, reduce the learning rate and update the weights at a smaller rate. This method can quickly learn to obtain better weights in the early stage, and fine-tune the weights in the later stage.

https://img1.mukewang.com/5c0636580001f7f106000337.jpg

    两个流行而简单的学习率衰减方法如下: 

Linearly gradually reduce the learning rate

Significantly reduce the learning rate at a specific time

4. Dropout
's deep neural network with a large number of parameters is a very powerful machine learning system. However, in such a network, overfitting is a very serious problem. Moreover, the running speed of large-scale networks is very slow, which makes it very difficult to solve the over-fitting problem by combining the predictions of multiple different large-scale neural networks in the testing phase. The Dropout method can solve this problem.

https://img3.mukewang.com/5c0636700001b21806140328.jpg

    其主要思想是,在训练过程中随机地从神经网络中删除单元(以及相应的连接),这样可以防止单元间的过度适应。训练过程中,在指数级不同“稀疏度”的网络中剔除样本。在测试阶段,很容易通过使用具有较小权重的单解开网络(single untwined network),将这些稀疏网络的预测结果求平均来进行近似。这能有效地避免过拟合,并且相对于其他正则化方法能得到更大的性能提升。Dropout 技术已经被证明在计算机视觉、语音识别、文本分类和计算生物学等领域的有监督学习任务中能提升神经网络的性能,并在多个基准数据集中达到最优秀的效果。

5. The largest pool The
largest pool is a discretization method based on samples. The goal is to down-sample the input representation (image, hidden layer output matrix, etc.), reduce the dimensionality and allow assumptions about the features in the sub-region.

https://img2.mukewang.com/5c0636870001512e05140406.jpg

    通过提供表征的抽象形式,这种方法可以在某种程度上解决过拟合问题。同样,它也通过减少学习参数的数目以及提供基本的内部表征转换不变性来减少计算量。最大池是通过将最大过滤器应用于通常不重叠的初始表征子区域来完成的。

6. Batch standardization
Of course, neural networks including deep networks need to carefully adjust the initial values ​​of weights and learning parameters. Batch standardization can make this process easier.

    权重问题: 

No matter how you set the initial value of the weight, such as random or empirical selection, the initial weight and the weight after learning are very different. Consider a small batch of weights. At the beginning, there may be many outliers for the required feature activation.

The deep neural network itself is pathological, that is, a small change in the initial layer will lead to a huge change in the latter layer.

    在反向传播过程中,这些现象会导致梯度的偏移,这就意味着在学习权重以产生所需要的输出之前,梯度必须补偿异常值。而这将导致需要额外的时间才能收敛。

https://img1.mukewang.com/5c06371e0001d81f06900497.jpg

    批量标准化将这些梯度从异常值调整为正常值,并在小批量范围内(通过标准化)使其向共同的目标收敛。 

    学习率问题:

Generally speaking, the learning rate is relatively small, so that only a small part of the gradient is used to correct the weight, because the gradient of abnormal activation should not affect the already learned weight.

Through batch standardization, the possibility of these abnormal activations will be reduced, and a larger learning rate can be used to accelerate the learning process. Electric forklift tires

7. Long and short-term memory
There are three differences between neurons in long and short-term memory networks (LSTM) and other recurrent neural networks:

It can decide when to let input into the neuron

It can decide when to remember what was calculated in the previous time step

It can decide when to pass the output to the next time stamp. The power of LSTM is that it can decide all of the above based only on the current input. Please see the chart below:
Insert picture description here

https://img3.mukewang.com/5c0637380001003806900394.jpg

    当前时间戳的输入信号 x(t) 决定了上述三点。

The input gate determines the first point,

The forget gate determines the second point,

The output gate determines the third point. These three decisions can be made only by input. This is inspired by the working mechanism of the brain, which can process sudden context switching based on input.

8.
The purpose of the Skip-gram word embedding model is to learn a high-dimensional dense representation for each word, where the similarity between embedding vectors shows the semantic or syntactic similarity between corresponding words. Skip-gram is a model for learning word embedding algorithms. The main idea behind the skip-gram model (including many other word embedding models) is that if two lexical items have similar contexts, they are similar.
Insert picture description here

https://img3.mukewang.com/5c0638020001fcc905950404.jpg

    换句话说,假设有一个句子,比如“cats are mammals”,如果用“dogs”替换“cats”,该句子仍然是有意义的。因此在这个例子中,“dogs”和“cats”有相似的上下文(即“are mammals”)。

    基于以上假设,我们可以考虑一个上下文窗口(包含 K 个连续项)。然后跳过其中一个词,试着学习一个可以得到除了跳过的这个词以外所有词项,并且可以预测跳过的词的神经网络。因此,如果两个词在一个大语料库中多次具有相似的上下文,那么这些词的嵌入向量将会是相似的。

9. Continuous bag-of-words model
In natural language processing, we hope to represent each word in a document as a numerical vector, so that words appearing in similar contexts have similar or similar vector representations. In the continuous bag-of-words model, our goal is to use the context of a specific word to predict the word.
Insert picture description here

https://img2.mukewang.com/5c06377e000156f406000337.jpg

    首先在一个大的语料库中抽取大量的句子,每看到一个单词,同时抽取它的上下文。然后我们将上下文单词输入到一个神经网络,并预测在这个上下文中心的单词。

    当我们有成千上万个这样的上下文词汇和中心词时,我们就得到了一个神经网络数据集的实例。然后训练这个神经网络,在经过编码的隐藏层的最终输出中,我们得到了特定单词的嵌入式表达。当我们对大量的句子进行训练时也能发现,类似上下文中的单词都可以得到相似的向量。

10. Transfer learning
Let's consider how convolutional neural networks process images. Suppose there is an image, apply convolution to it, and get a combination of pixels as output. Assuming that these outputs are edges and convolution is applied again, the output now will be a combination of edges or lines. Then convolution is applied again, and the output at this time will be a combination of lines, and so on. Think of it as looking for a specific pattern at each level. The last layer of the neural network usually becomes very special.

    如果基于 ImageNet 进行训练,那么神经网络的最后一层或许就是在寻找儿童、狗或者飞机之类的完整图像。再往后倒退几层,可能会看到神经网络在寻找眼睛、耳朵、嘴巴或者轮子等组成部分。

Insert picture description here

https://img1.mukewang.com/5c0637ac0001ca1306380359.jpg
Insert picture description here

    深度卷积神经网络中的每一层逐步建立起越来越高层次的特征表征,最后几层通常是专门针对输入数据。另一方面,前面的层则更为通用,主要用来在一大类图片中有找到许多简单的模式。

    迁移学习就是在一个数据集上训练卷积神经网络时,去掉最后一层,在不同的数据集上重新训练模型的最后一层。直观来讲,就是重新训练模型以识别不同的高级特征。因此,训练时间会减少很多,所以在没有足够的数据或者需要太多的资源时,迁移学习是一个很有用的工具。

Guess you like

Origin blog.csdn.net/weixin_43214644/article/details/114671546