embedding sentence and documentation

Link: data set extraction code: 6cgu

1. A paper REVIEW
2. The two papers Intensive
3. The three code implementation
4. Four Issues thinking

"Distributed Representations of Sentences and Documents"
- sentences and documents distributed learns
Author: Quoc Le and Tomas Mokolov
Unit: Google
published the meeting and time: ICML 2014

A paper REVIEW

Distributed sentence expressed Profile

Distributed sentence represents the correlation method

Pre-knowledge

1. Introduction sentence Distributed representation

Distributed sentence said: Distributed sentence is to express a sentence or paragraph (this sentence will be treated equally and documents, the document equivalent to a longer sentence) with 固定长度的向量representation
Meaning: If you can accurately represent a word with a vector, it can be directly used for text classification, information retrieval, machine translation, and so with this vector field

As shown below:
Here Insert Picture Description

2. The sentence represents the correlation method distributed
a historical model:

1 sentence based on statistical distributed representation:

Bag-of-words
Bag-of-n-grams

2 sentence based on the depth study of distributed representation

Weighted average method
Depth learning model

(1) Bag of words
algorithm:

Build a vocabulary, vocabulary each element is a word
For the number of times a word s, the statistics for each vocabulary word appears in s
The number of times each word appears in the vocabulary in s, construct a word vector table size

The example of FIG:
Here Insert Picture Description
Reflection: Bag-of-words disadvantage, how to improve
(2) Bag-of-n-gram, the word elements in the table may also be word n-gram phrases

(3) plus whole average method:
algorithm:

Build vocabulary, vocabulary word each element
使用词向量学习方法（skip-gram等）学习每个词的词向量表示
对于句子s中的每个词（w1,w2,w3,…,wn)对应的词向量（e1,e2,e3,…,en)加权平均，结果为句子的分布式表示：
（下图公式只有平均，没有加权）

（4）深度学习方法：
算法:

构建词表，词表中每个元素都是词
使用词向量学习方法（skip-gram等）学习每个词向量表示
将句子s中的每个向量作为输入送进深度神经网络模型（CNN或RNN),然后通过监督学习，学习每个句子的分布式表示。

模型一般形式如下图，在concatenation部分将句子的每个词进行了加权平均得到了句子的分布式表示
Here Insert Picture Description

3. 前期知识

熟悉词向量的相关知识
了解使用语言模型训练词向量的方法
训练模型如下图：

二论文精读

论文整体框架

传统/经典算法模型

论文提出改进后的模型

实验结果

讨论和总结

1. 论文整体框架
0.摘要
Here Insert Picture Description
1.介绍
2.句子分布式表示模型
3.实验
4.相关工作
5.结论

2. 传统/经典算法模型

Bag-of-words
其模型的缺点：
一因为是词袋模型，所以丢失了词之前的位置信息
二句向量知识单纯地利用了统计信息，而没有得到语义信息，或者只得到很少的语义信息
Bag-of-n-gram模型的缺点：
一因为使用了n-gram,所以保留了位置信息，但是n-gram不会太大，最多是4-gram,所以保留的位置信息很少
二 N-gram同样没有学习到语义信息
加权平均法的缺点
对所有的词向量进行平均，丢失了词之前的顺序信息及词与词之间的关系信息
基于深度学习模型的缺点
只能使用标注数据训练每个句子的句向量，这样训练得到的向量都是任务导向的，不具有通用性
基于语言模型的词向量训练
语言模型：语言模型可以给出每个句子是句子的概率：

而每个词的概率定义成n-gram形式，即每个词出现只与前n-1个词有关：

评价语言模型的好坏的指标困惑度（perplexity)
接下来就是基于语言模型的词向量训练
算法：
对于每个词随机初始化一个词向量
取得一个连续的n-1个词，将n-1个词对应的词向量连接（concatenate）在一起形成向量e
将e作为输入，送入一个单隐层神经网络，隐层的激活函数为tanh,输出层的神经元个数为词表的大小
优点：就像原文提到的，即训练出一组词向量，又得到一个语言模型，其次不需要标注数据，可以使用很大的数据集
论文：《A Neural Probabilistic Language Model 》

Here Insert Picture Description
3. 论文提出改进后的模型

本文的模型就是基于语言模型改进而来的分布式句向量训练模型。
算法：

类似于前面提到的基于语言模型的词向量训练模型，这里的句向量训练模型也是利用前几个预测后一个词
不同的是，这里将每句话映射成一个句向量，联合预测后一个词出现的概率

这样就学习到了每个词的词向量和每句话的句向量
Here Insert Picture Description
橘色的是句向量，右边三个是词向量
句向量+词向量得到映射的一个词
本模型可以学到语义和语法信息

窗口大小包括预测的那个词

训练阶段：
通过训练集构建词表，并随机初始化词向量W和训练集的句向量矩阵D。设置n-gram，文中为窗口大小，然后利用句向量训练模型，训练矩阵模型的所有参数，包括词向量矩阵和句向量矩阵

最后将学习到的句向量用于分类器预测句子的类别概率

测试阶段：
固定词向量矩阵W和模型的其他参数，重新构建句向量矩阵D并随机初始化，然后利用梯度训练矩阵D，从而得到测试集每个句子的句向量

无序句向量训练模型
文本还提出了一种Bag-of-words，即忽略词序信息的模型
算法：

每个句子通过随机初始化句向量矩阵映射成一个句向量，然后通过句向量每次随机预测句子中的一个词。
然后将学习到的句向量送到已经训练好的分类器，预测句子的概率

本文分别使用提出的两种模型训练得到两个句向量，然后将两个句向量合并（concatenate）,得到最终的句向量表示

4. 实验结果

一数据集：
SST
IMDB
评价方法：SST：5分类也可以2类 IMDB：2分类

二实验结果

第二列为二分类结果，第三列为五分类结果，实验结果显示本文提出的句向量方法优于朴素贝叶斯、SVM、词向量平均法、神经网络方法，在二分类和五分类任务都取得了最好的结果。
Here Insert Picture Description
在IMDB也取得了最好结果

Here Insert Picture Description

5. 讨论和总结

目前主流的句向量表示方法：
基于神经网络的句向量学习方法（多快好省），使用预训练的词向量，神经网络可以得到非常好的句向量表示
训练过程还需要训练，大大降低了效率? 是
使用基于神经网络的句向量学习方法，当前流行的ELMO，BERT
Are there any other sentence vectors training methods?
Seq2seq sentence vector model training method proposed by posterity
The main innovation
A proposes a new unsupervised training methods sentence vector
B can be used directly in downstream task
C SOTA results have been achieved at the time of publication of papers

Three code implementation

Four Issues on thinking

Thoughts: Bag-of-words shortcomings, how to improve

Major_s

Released 1031 original articles · won praise 22 · views 20000 +

Private letter concerns