Abstract

CNN应用于文本分类系列实验表明，使用很少超参合静态变量的CNN在多分类任务上表现出色。fine-tuning的词向量还能提高性能。本文同时利用了微调和静态的词向量（multi-channel）。

1 Introduct

Word vectors: train by Mikolov, on 100 billion words of Google News ,

词向量静态，学习其他层的参数，在多个benchmarks上取得了和好的效果。同时词向量可以被用于各种分类任务。

task-specific vectors : 根据学习任务微调向量。

最终的模型输入结构是multiple channels：静态预训练的词向量和task-specific vectors。

2 Model

结构参考Collobert 的论文，其中 $x_i \in R^k$ 是k维的词向量的句中的第i个词语。长度n的句子表示为（不足则填充）：

\begin{matrix} (1) & x_{1 : n} = x_{1} \oplus x_{2} \oplus . . . \oplus x_{n} \end{matrix}

$x_{1:n} = x_1 \oplus x_2 \oplus...\oplus x_n \tag 1$

对于卷积核 $w \in R^{hk}$ ，表示以h个词的窗口提取特征。例如特征 $c_i$ 的产生过程

\begin{matrix} (2) & c_{i} = f (w * x_{i : i + h} + b) \end{matrix}

$c_i=f(w*x_{i:i+h} + b) \tag 2$

其中b是一个实数偏置，f是非线性函数。这个filter对于每个可能的窗口 $\{x_{1:h}, x_{2:h+1}, ...x_{1:i+h-1},\}$ 产生特征图：

\begin{matrix} (3) & c = [c_{1}, c_{2}, . . ., c_{n - h + 1}] \end{matrix}

$c=[c_1, c_2, ..., c_{n-h+1}] \tag 3$

其中 $c\in R^{n-h+1}$ ，使用最大池化 $\hat{c} = max\{c\}$ 来捕获每个特征图中最强的特征。

模型中使用了多种filter（不同的窗口大小）来获得多种特征，构成倒数第二层，然后使用全连接的softmax层，输出是每个类的概率。

模型有多个变体：

静态通道的向量输入模型。
fine-tuned的向量输入模型。
multichannel 模型（上两种输入）。

2.1 Regularization

对于倒数第二层使用drop-out和向量的L2-norms来做正则。

drop-out即训练是每个神经元以概率P被置为0。梯度反向传播只通过未置0的神经元。预测时所有神经元都激活。即：

y = w * (z \circ r) + b

$y = w*(z \circ r)+b$

其中 $\circ$ 是element-wise的乘法， $r \in R^m$ 是一个m维的矩阵，其中每个元素是以随机概率P为1。

L2-norms即对于每个神经元的输出 $w$ ，如果 $||w||_2>s \ \ then \ \ ||w||_2=s$ .

3 Datasets and Experimental Setup

MR:Moviereviewswithonesentenceperre- view. Classification involves detecting posi- tive/negative reviews (Pang and Lee, 2005).3
SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very pos- itive, positive, neutral, negative, very nega- tive), re-labeled by Socher et al. (2013).4
SST-2: Same as SST-1 but with neutral re- views removed and binary labels.
Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).
TREC: TREC question dataset—task in- volves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002).5
CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict pos- itive/negative reviews (Hu and Liu, 2004).6

3.1 超参&训练 Hyperparameters and Training

使用ReLU非线性激活函数，卷积核窗口3，4，5各有100个，drop-out概率0.5，L2常量（s）为3，mini-batch大小50，这些值通过在SST-2数据集上grid search获得。

对于数据集，随机取10%作为验证集，优化方法使用Ada。

3.2 Pre-trained Word Vectors

使用google的word2vec通过1000亿的google新闻语料训练好的词向量，维度300，使用CBOW结构。未被包含的词语将被随机初始化复制。

3.3 Model Variations

CNN-rand: baseline模型，词语使用随机话的向量
CNN-static: 使用word2vec的词向量，未被包含的词语随机初始化，词向量保持不变。
CNN-non-staic: 和上面一样，但是词向量是根据任务微调。
CNN-multichannel：模型包含两种词向量输入，一个是通道是微调，一个通道是静态，这个两个通道初始都是word2vec的值。

为了减小随机源的影响，使用CV多折训练。

4 结果 Results and Discussion

可以看出，baseline效果不好，说明预训练的词向量巨大提升了模型性能。
Fine-turning的预训练词向量对于每个任务更进一步的提升空间。

4.1 多通道vs单通道 (multichannel VS single channel)

本文期望多通道能够防止过拟合（防止过拟合原始的词向量），因此能比单通道效果好。但事实表明，并不一定如此，同时在fine-tuning的处理还可以进一步研究。

4.2 静态VS动态

动态的词向量会使得词向量更加与任务相关，例如word2vec中，bad和good很相似因为他们句法相似（因为word2vec基于分布式表达），而在数据集SST-2上fine-tune的词向量，good和nice很相近，因为他们情感上很相似。具体可以参考table 3。

随机初始化的词向量fine-tuning训练能够学到更多的意义信息。（个人理解学习会更加取决于任务，例如情感类任务，词向量更偏向情感色彩表达）。

4.3 Further Observations

Kalchbrenner的CNN和本文单通道CNN结构相似，但结果更差，因为本文CNN more capacity（多个不同窗口的filter，特征数更多）
Dropout具有正则能力，因此可以使用更大的网络，提示2%-4%。
本文使用了其他的词向量方法（by Collobert et al. (2011) on Wikipedia ），但是效果不如word2vec，不清楚到底是模型较优还是1000亿的google语料造成的。
Adadelta 和Adagrad 优化效果相同，但Adadelta训练次数更少。

5 Conclusion

本文主要基于word2vec词向量做了一些列简单CNN文本分类实验，表现良好，表明无监督的词向量是NLP重要原材料。

Reference

http://www.aclweb.org/anthology/D14-1181
Implementing a CNN for Text Classification in TensorFlow
TensorFlow实现：https://github.com/dennybritz/cnn-text-classification-tf
作者theano实现: https://github.com/yoonkim/CNN_sentence
字符级CNN的论文：Character-level Convolutional Networks for Text Classification

『论文阅读』：Convolutional Neural Networks for Sentence Classification

Abstract

1 Introduct

2 Model

2.1 Regularization

3 Datasets and Experimental Setup

3.1 超参&训练 Hyperparameters and Training

3.2 Pre-trained Word Vectors

3.3 Model Variations

4 结果 Results and Discussion

4.1 多通道vs单通道 (multichannel VS single channel)

4.2 静态VS动态

4.3 Further Observations

5 Conclusion

Reference

猜你喜欢

『 论文阅读』：Convolutional Neural Networks for Sentence Classification

Abstract

1 Introduct

2 Model

2.1 Regularization

3 Datasets and Experimental Setup

3.1 超参&训练 Hyperparameters and Training

3.2 Pre-trained Word Vectors

3.3 Model Variations

4 结果 Results and Discussion

4.1 多通道vs单通道 (multichannel VS single channel)

4.2 静态VS动态

4.3 Further Observations

5 Conclusion

Reference

猜你喜欢

『论文阅读』：Convolutional Neural Networks for Sentence Classification