Andrew Ng, understand the complete notes of the five courses of deeplearning.ai?

640?wx_fmt=png

Source: Heart of the Machine

This article has a total of 3744 words, and it is recommended to read for 8 minutes.
This article explains how to build models for natural language, audio, and other sequence data.

Since Andrew Ng released the deeplearning.ai course, many learners have completed all the special courses and made careful course notes. Last month, the fifth lesson of deep learning.ai was released, and the series of courses finally came to an end. Mahmoud Badry has open sourced complete notes for five courses on GitHub, covering detailed knowledge points including sequence models. We briefly describe the project and focus on the fifth course sequence model.


Project address: https://github.com/mbadry1/DeepLearning.ai-Summary


Last week, Andrew Ng showed on Twitter an infographic of a deep learning special course completed by Tess Ferrandez, which beautifully records the knowledge and highlights of the deep learning course. For a detailed introduction to this infographic, please check: This deep learning course note was liked by Wu Enda .


Since the course of Deeplearning.ai started, it has always attracted everyone's attention, and many readers have actively participated in the learning. The notes completed by Mahmoud Badry are mainly divided into five parts, corresponding to the basics of neural networks and deep learning, techniques and methods to improve DNN performance, structured machine learning projects, convolutional neural networks and sequence models. It is worth noting that the notes completed in this project are very detailed, and basically all the knowledge points of the five courses are covered. For example, the first course records the basics, from an introduction to neural networks to interviews with Goodfellow, in a weekly order of different topics.


Since many knowledge points in the first four lessons have been introduced, this article focuses on the outline of the notes of the fifth lesson. Readers can refer to GitHub to read the complete notes.


Lesson 5 Introduction to Sequence Models


This course teaches how to build models of natural language, audio, and other sequence data. With the help of deep learning, sequence algorithms work better than they were two years ago and are used in a plethora of interesting applications such as speech recognition, music synthesis, chatbots, machine translation, natural language understanding, and more. By the end of this lesson, you will:


  • Learn how to build and train Recurrent Neural Networks (RNNs) and their common variants such as GRUs and LSTMs.

  • Use sequence models for natural language problems such as text synthesis.

  • Apply sequence models to audio applications such as speech recognition and music synthesis.

  • This is the fifth and final lesson of the Deep Learning Specialization course.


For people:


  • Learners who have completed Lessons 1, 2, and 4. It is also recommended that you study the third lesson.

  • People who already have a solid understanding of neural networks (including CNNs) and want to learn how to develop recurrent neural networks.


This course introduces Recurrent Neural Networks (RNNs), Natural Language Processing and Word Embeddings, as well as Sequence Models and Attention Mechanisms. The following will briefly introduce the sequence model notes completed by Mahmoud Badry.


Sequence model:


Sequence models, such as RNNs and LSTMs, have dramatically changed sequence learning, which can be augmented by attention mechanisms. Sequence models are used in speech recognition, music generation, sentiment classification, DNA sequence analysis, machine translation, video activity recognition, named entity recognition, and more.


Recurrent Neural Network Model (RNN)


RNNs emerged in the 1980s and have recently become more popular due to advances in network design and increased computing power on graphics processing units. Such networks are especially useful for sequential data, since each neuron or unit can use its internal storage to hold relevant information about previous inputs. In the case of language, the sentence "I had washed my house" has a very different meaning than "I had my house washed". This allows the network to gain a deeper understanding of the expression.


RNNs have many applications and perform well in the field of natural language processing (NLP). The figure below is an RNN network used to solve the task of named entity recognition.


640?wx_fmt=png

An RNN network for solving named entity recognition tasks.


640?wx_fmt=png

Simplified RNN notation.


Backpropagation Through Time (BPTT)


640?wx_fmt=png


Backpropagation in RNN architecture, w_a, b_a, w_y, b_y are shared by all elements in the sequence.


Here the cross-entropy loss function is used:


640?wx_fmt=png


其中第一个公式是序列中一个元素的损失函数,整个序列的损失是每个元素的损失之和。


640?wx_fmt=png


在上图中沿时间反向传播中,激活值 a 从一个序列元素向另一个元素传播。


RNN 的类型


640?wx_fmt=png

RNN 的不同类型。


RNN 的梯度消失


梯度消失指的是随着网络深度增加,参数的梯度范数指数式减小的现象。梯度很小,意味着参数的变化很缓慢,从而使得学习过程停滞。循环神经网络在语言建模等序列问题上有非常强大的力量,但同时它也存在很严重的梯度消失问题。因此像 LSTM 和 GRU 等基于门控的 RNN 有非常大的潜力,它们使用门控机制保留或遗忘前面时间步的信息,并形成记忆以提供给当前的计算过程。


门控循环单元(GRU)


GRU 旨在解决标准 RNN 中出现的梯度消失问题。GRU 背后的原理与 LSTM 非常相似,即用门控机制控制输入、记忆等信息而在当前时间步做出预测,表达式如下:


640?wx_fmt=png


GRU 有两个有两个门,即一个重置门(reset gate)和一个更新门(update gate)。从直观上来说,重置门决定了如何将新的输入信息与前面的记忆相结合,更新门定义了前面记忆保存到当前时间步的量。如果我们将重置门设置为 1,更新门设置为 0,那么我们将再次获得标准 RNN 模型。使用门控机制学习长期依赖关系的基本思想和 LSTM 一致,但还是有一些关键区别:


  • GRU 有两个门(重置门与更新门),而 LSTM 有三个门(输入门、遗忘门和输出门)。

  • GRU 并不会控制并保留内部记忆(c_t),且没有 LSTM 中的输出门。

  • LSTM 中的输入与遗忘门对应于 GRU 的更新门,重置门直接作用于前面的隐藏状态。

  • 在计算输出时并不应用二阶非线性。


为了解决标准 RNN 的梯度消失问题,GRU 使用了更新门(update gate)与重置门(reset gate)。基本上,这两个门控向量决定了哪些信息最终能作为门控循环单元的输出。这两个门控机制的特殊之处在于,它们能够保存长期序列中的信息,且不会随时间而清除或因为与预测不相关而移除。


640?wx_fmt=png

带有门控循环单元的循环神经网络


以下展示了单个门控循环单元的具体结构。


640?wx_fmt=png

门控循环单元


LSTM


使用传统的通过时间的反向传播(BPTT)或实时循环学习(RTTL/Real Time Recurrent Learning),在时间中反向流动的误差信号往往会爆炸(explode)或消失(vanish)。但 LSTM 可以通过遗忘和保留记忆的机制减少这些问题。


LSTM 单元一般会输出两种状态到下一个单元,即单元状态和隐藏状态。记忆块负责记忆各个隐藏状态或前面时间步的事件,这种记忆方式一般是通过三种门控机制实现,即输入门、遗忘门和输出门。


以下是 LSTM 单元的详细结构,其中 Z 为输入部分,Z_i、Z_o 和 Z_f 分别为控制三个门的值,即它们会通过激活函数 f 对输入信息进行筛选。一般激活函数可以选择为 Sigmoid 函数,因为它的输出值为 0 到 1,即表示这三个门被打开的程度。


640?wx_fmt=png

图片来源于李弘毅机器学习讲义。


若我们输入 Z,那么该输入向量通过激活函数得到的 g(Z) 和输入门 f(Z_i ) 的乘积 g(Z) f(Z_i ) 就表示输入数据经筛选后所保留的信息。Z_f 控制的遗忘门将控制以前记忆的信息到底需要保留多少,保留的记忆可以用方程 c*f(z_f)表示。以前保留的信息加上当前输入有意义的信息将会保留至下一个 LSTM 单元,即我们可以用 c' = g(Z)f(Z_i) + cf(z_f) 表示更新的记忆,更新的记忆 c' 也表示前面与当前所保留的全部有用信息。我们再取这一更新记忆的激活值 h(c') 作为可能的输出,一般可以选择 tanh 激活函数。最后剩下的就是由 Z_o 所控制的输出门,它决定当前记忆所激活的输出到底哪些是有用的。因此最终 LSTM 的输出就可以表示为 a = h(c')f(Z_o)。


双向 RNN(BRNN)


双向 RNN 和深度 RNN 是构建强大序列模型的有效方法。下图是一个命名实体识别任务的 RNN 模型:


640?wx_fmt=png


BRNN 架构


640?wx_fmt=png


BRNN 的缺点是在处理之前需要整个序列。


深度 RNN


深度 RNN 可帮助构建强大的序列模型。


640?wx_fmt=png

3 层深度 RNN 图示。


RNN 的反向传播


在现代深度学习框架中,你只需实现前向传播,框架会执行反向传播,因此大部分机器学习工程师不需要担心反向传播。但是,如果你是微积分专家,想了解 RNN 中反向传播的细节,可参考该 notebook:https://www.coursera.org/learn/nlp-sequence-models/notebook/X20PE/building-a-recurrent-neural-network-step-by-step。


自然语言处理与词表征


词表征在自然语言处理中是必不可少的部分,从早期的 One-Hot 编码到现在流行的词嵌入,研究者一直在寻找高效的词表征方法。Mahmoud Badry 在笔记中详细记录了词嵌入方法,包括用于命名实体识别、人脸识别和翻译系统的词嵌入等,下图展示了用于人脸识别的词嵌入结构:


640?wx_fmt=png


在这种词嵌入方法中,我们可以将不同的人脸编码压缩为一个向量,进而根据该向量比较是不是同一张脸。


词嵌入有非常多的优秀属性,例如给定一张词嵌入表。该笔记用案例来解释词嵌入的语义相近关系,如下图所示,男性变化到女性与国王变化到女王的反向在嵌入空间内是相同的,这意味着词嵌入捕捉到了词与词之间的语义相关性。


640?wx_fmt=png


Generally speaking, the Word2Vec method consists of two parts. The first is to map words represented in high-dimensional one-hot form into low-dimensional vectors. For example, converting a 10,000-column matrix to a 300-column matrix is ​​called word embedding. The second goal is to preserve the meaning of the word to some extent while preserving the context of the word. Word2Vec achieves these two goals by skip-gram and CBOW, etc. skip-gram will input a word and then try to estimate the probability of other words appearing near the word. There is also an opposite model called Continuous Bag Of Words (CBOW), which takes some context words as input and evaluates the probability to find the word that best fits (with the highest probability) that context.


For the continuous bag-of-words model, Mikolov et al. use the n words preceding and following the target word to simultaneously predict the word. They call this model Continuous Bag of Words (CBOW) because it represents words in a continuous space, and the order of the words does not matter. CBOW can be seen as a language model with a prophet, while the skip-gram model completely changes the goal of the language model: it does not predict the middle word from the surrounding words like CBOW; on the contrary, it uses the head to predict the surrounding words. word.


Mahmoud Badry also showed another method for learning word embeddings, GloVe, which is not as widely used as language models, but its compact structure is very easy to understand:


640?wx_fmt=png


Sequence Model and Attention Mechanism


In the last part, the authors focus on the attention mechanism, including the pitfalls of the encoder-decoder architecture and solutions to introduce the attention mechanism. The figure below shows the process of encoding information using the context vector C or attention weights.


640?wx_fmt=png


In fact, when we translate a sentence, we pay special attention to the word being translated. Neural networks can achieve the same behavior by paying attention to a subset of the received information.


We typically generate attention distributions using context-based attention. A participating RNN generates a query describing what it wants to focus on. Each item is dot-multiplied with the query to produce a score that describes how well the item matches the query. These scores are fed into a softmax to generate the attention distribution.


640?wx_fmt=jpeg

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326839152&siteId=291194637