Read the paper notes: Glyce: Glyph-vectors for Chinese Character Representations

Shannon Chinese Science and Technology proposed a vector-based font representation: Glyce, Glyce based models reach the SOTA on 13 Chinese NLP tasks.

Summary:

   Intuitively, nlp font information for the task on this ideographic Chinese have a lot of help, but due to 1) the lack of abundant pictographic hieroglyphic evidence, 2) lack of existing cv model for generalization text data, so and effective use of this method is part of the information has yet to be explored.

   In this article, we propose by Glyce: Vector Chinese character for characters to solve the above problems, the main innovations are the following three points: 1) make full use of various periods of Chinese fonts (Bronze text, Seal, Traditional, etc.) as well as font styles (cursive, official script); 2) presents a special image processing for Chinese characters CNN architecture; 3) multi-task learning settings, the use of image classification task as a secondary classification open improve the generalization ability of the model.

   The article on the 13 Chinese NLP tasks are made SOTA performance: (1) the word level language model (2) word level language model (3) Chinese word (4) named entity recognition (5) POS tagging (6) syntactic dependency analysis (7) semantic annotation decision (8) semantic similarity (9) intended to identify (10) sentiment analysis (11) machine translation (12) text Categorization (13) text analysis.

1 Introduction:

  Chinese ideogram can be divided into (Japan, represents the sun) and phonograms (blue, eyes) in the Han Dynasty, Explain Word on the use of so-called glyph index, this approach is still in use today. Since many Chinese characters are evolved from the picture, as shown below, the shape of Chinese characters can provide a wealth of information.

   On the Chinese nlp task, very few people take advantage of font information, but also to study Chinese characters, has made some progress in some studies use Wubi structure, but because Wubi coding structure is random, it can not express the deep identification information.

Some people use cnn structure of the font information to study, but did not produce good results, probably due to the following: 1) the use of simplified characters were studied, but simplified characters lost most of the font information in the course of evolution. Character evolution below; 2) structures used cnn inappropriate, because the small size of the character, and the existing models are generally used to treat cnn larger pictures; 3 less) data, only about 10,000 Kanji characters.

 

  This article Chinese characters seen pictures, cnn used to extract features, to solve these problems, we use the following solutions:

1. Using historical and contemporary characters of text (such as bronze text, official script, Seal, Traditional Chinese, etc.) combined, information-rich pictographic character images, as well as different writing styles of writing (cursive), improve the generalization ability of the model.

2. proposed in line with Chinese glyphs Tianzige-CNN (swastika grid) architecture.

3. Multi-task learning methods to improve the generalization ability of the model by increasing the loss of image classification function.


3 Glyce

1. Use of Data


   The evolution of Chinese characters is easy to change from the original painting is easy to write, in the process, the inevitable loss of a large number of text font information contains, so in order to enrich a text message, using the text in different periods, in order to improve the text of the generalization capabilities, the use of different styles of characters, which are commonly used in computer vision is the method of data increases, the specific text used as follows:

2.Glyce swastika lattice structure -cnn

  In order to solve the above-mentioned small picture size characters too, a small number of issues, this paper swastika lattice structure:

 Swastika lattice structure information extracted by the characters found very effective, is called by the word lattice field sized filter obtained feature sizes of 2 * 2. As shown below:

 The first convolution: f = 5, s = 1, p = no

Maximum Pooling: f = 4, s = 4, p = no

  By a convolution and the largest pool of the Chinese characters picture 12 * 12 into a swastika lattice in the form of 2 * 2.

  In order to prevent over-fitting, the last step does not use conventional convolution method, instead of using the group convolution. See for details: https://blog.csdn.net/hhy_csdn/article/details/80030468

  下图是一个正常的、没有分组的卷积层结构。下图用第三维的视角展示了CNN的结构,一个filter就对应一个输出channel。随着网络层数的加深,通道数急剧增加,而空间维度随之减少,因为卷积层的卷积核越来越多,但是随着卷积池化操作,特征图越来越小。所以在深层网络中,channel的重要性越来越大。

   下图则是一个群卷积的CNN结构。filters被分成了两个group。每一个group都只有原来一半的feature map。

 3.使用图像分类作为辅助目标

  为了防止过拟合,使用图像分类任务作为辅助训练目标,将cnn获得汉字特征输入到图像分类器中来预测这个字符是哪个汉字。图像分类的损失函数为:

 用L(task)表示模型具体进行的nlp任务,也就是下游任务,比如机器翻译、单词分割等,所以总得目标损失函数为:

\lambda表示权重, ,t为迭代次数。

可以看出,在训练的初期,图像分类的影响比较大,随着迭代次数的增加,图像分类的作用逐渐减小,直观地理解是训练的初期我们需要从图像分类得到更多的信息。

根据下游任务的不同,有两种结构的embedding

4.Glyce-字向量

字向量:如下图所示,图中可以看出Glyph Emb 就是本文提出的模型,而char-Id Emb就是每一个汉字对应的普通Embedding,它们两个结合起来就可以作为整个汉字完整的Embedding,结合的方式可以是拼接,fully connected network, highway network等。

 5.Glyce-词向量

词向量:由于中文的词都可以看成是由中文的字组成,Glyce通过充分利用组成中文词中的汉字得到更加细粒度的词的语意信息。使用 Glyce字向量的方式得到词中的对应字的表示。因为中文词中字的个数的不确定性,Glyce 通过 max pooling 层对所有得到的 Glyce 字向量进行特征的筛选,用来保持了维度的不变性。最终得到的词向量通过和 word-id 向量进行拼接得到最终的 Glyce 中文词向量。
 

4.实验

  本文提出的方法是一种新型的字、词向量表示形式,在进行下游nlp任务时,都是将当前效果最好的模型中的向量替换为本文提出的Glyce形式

1.Task1:字符级语言模型

根据前一个字预测下一个字是什么,这一任务在Chinese Tree-Bank 6.0 (CTB6)这一数据集上进行,这个数据集包括4401个不同的中文汉字,采用的模型是LSTM。评价指标为ppl。

Perplexity定义

PPL是用在自然语言处理领域(NLP)中,衡量语言模型好坏的指标。它主要是根据每个词来估计一句话出现的概率,并用句子长度作normalize,公式为:

 

S代表sentence,N是句子长度,p(wi)是第i个词的概率。第一个词就是 p(w1|w0),而w0是START,表示句子的起始,是个占位符。

这个式子可以这样理解,PPL越小,p(wi)则越大,一句我们期望的sentence出现的概率就越高。

实验结果如下:

2.Task2: 词级别语言模型

使用Chinese Tree-Bank 6.0 (CTB6)数据集和jieba分词,在LSTM上输入本文方法提取的词向量,给定前一个词语预测下一个词语。经过对照实验,word-ID 向量+ glyce 词向量的结合在词级别的语言模型上效果最好,PPL(困惑度)达到了 175.1。实验结果如下所示:

 3.命名实体识别(字符级任务)

应用的数据集为OntoNotes, MSRA and resume ,使用Lattice-LSTMs架构,将charID embeddings替换为Glyce-char  embeddings。实验结果如下:

 4.中文分词(字符级任务)

使用CTB6, PKU and Weibo数据集,当前效果最好的模型是Lattice-LSTMs,用本文提出的Glyce-char embeddings替换charID embeddings,实验结果如下所示:

5.词性标注(字符级任务)

当前效果最好的模型是:字符级的双向RNN-CRF,实验结果如下:

6.句法依存分析(词级任务)

句法依存分析采用了 Chinese Penn Treebank 5.1 的数据集。Glyce 词向量结合之前最优的 Biaffien 模型把结果在 UAS 和 LAS 数据集上和最优结果比较分别提高了0.9和0.8。 实验结果如下:

7.Task7: 语义决策标注

语义决策标注的实验采用了 CoNLL-2009 的数据集,并且采用 F1 作为最终的评价指标。最优模型 k-order pruning 和 Glyce 词向量超过了之前最优模型 0.9 的 F1 值。实验结果如下:

8.语义相似度

采用的数据集为BQ数据集,有120000个中文句子对, 当前效果最好的模型是:bilateral multi-perspective matching model (BiMPM),实验结果如下所示:

9.意图识别

数据集:The Large-scale Chinese Question Matching Corpus (LCQMC),当前效果最好的模型:bilateral multi-perspective matching model (BiMPM),实验结果如下所示:

 10.中英文机器翻译

中文-英文机器翻译任务的训练集来自 LDC 语料,验证集来自 NIST2002 语料。测试集分别是 NIST2003,2004,2005,2006 和 2008,并且采用 BLEU 作为最终的评价指标。Glyce 词向量结合 Seq2Seq+Attention 模型,测试集上 BLEU 值达到了新的最优结果。

11.情感分析

采用(1) Dianping:饭店评论,2000000个训练集500000个测试集,(2) JD Full:商品评论,3M个训练集,250000个测试集,(3) JD binary:商品打星,4M训练集和360000个测试集。采用双向lstm结合本文提出的向量,实验结果如下:


12.文本分类

文本分类的任务采用了 Fudan corpus, IFeng, ChinaNews 三个数据集,并且采用准确率作为评价指标。Glyce 字向量结合 Bi-LSTM 模型在这三个数据集上分别取得了最优的结果。实验结果如下:


 13.篇章分析

数据集:the Chinese Discourse Treebank (CDTB),当前效果最好的模型:RvNN model,实验结果如下:

Guess you like

Origin blog.csdn.net/weixin_44740082/article/details/91347767