Ten Chinese NLP paddle fly open source tools Detailed

PaddleNLP NLP is based on open source tools industrial Chinese training model with the pre-set flight pitch (PaddlePaddle) developed by the various models of the natural language processing implemented by a shared set of skeleton code, developers can greatly reduce the development process is repeated jobs. PaddleNLP rely on Baidu to provide ten billion big data model pre-trained to meet the overall wealth of NLP tasks, enabling developers to easily plug try a variety of flexible network structure, and make the most rapid industrial-effects. Following small with you one of the top ten NLP understand PaddleNLP support tasks and tools.


First, text classification


1, sentiment analysis


Emotion is an advanced intelligent human behavior, in order to identify the text's emotional tendencies, require in-depth semantic modeling. In addition, different areas (such as catering, sports) differ in emotional expression, and thus the need for large-scale data covering all areas of the model train. To this end, we learn the depth-based semantic model and large-scale data mining to solve these two problems. Baidu independent research and development of Chinese specialties emotional tendencies analysis model (Sentiment Classification, referred to as Senta) for the Chinese text of subjective description, it can automatically determine the sentiment polarity category of the text and the corresponding confidence level. Emotional types are divided into positive and negative. The sentiment analysis can help companies understand consumer habits, analyze hot topics and crisis monitoring public opinion, provide a favorable decision support for enterprises.


Detailed classification based on open source datasets ChnSentiCorp sentiment orientation evaluation results are shown in Table, in addition, PaddleNLP Baidu also open the massive data based on the trained model, which is then fine-tune (for Finetune method based on open source model data set on ChnSentiCorp see Github), you can get better results.


  • BOW (Bag Of Words) model is a non-series model, use the basic fully connected.

  • CNN (Convolutional Neural Networks), a basic sequence model is capable of handling variable length input sequences, extracting a feature within a local area.

  • GRU (Gated Recurrent Unit), series models, can solve the long-distance-dependent sequence of text problem.

  • LSTM (Long Short Term Memory), series models, can solve the long-distance-dependent sequence of text problem.

  • BI-LSTM (Bidirectional Long Short Term Memory), a sequence model, two-way LSTM structure, to better capture the semantic features of the sentence.

  • ERNIE (Enhanced Representation through kNowledge IntEgration), Baidu massive data based on self-development and training generic text prior knowledge of the semantic representation model, and based on this tendency were fine-tune to get on the classified data set in emotion.

  • ERNIE + BI-LSTM, based on semantic representation ERNIE top BI-LSTM docking model, and based on this tendency to get carried Fine-tune classification dataset emotion.


640?wx_fmt=png

project address:

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/sentiment_classification


2. dialogue emotion recognition


Dialogue emotion recognition applies to multiple scenes chat, customer service, etc., can help companies better understand the quality of the dialogue, to improve the product's user interactive experience, but also to analyze the quality of customer service, quality control and reduce labor costs. Dialogue emotion recognition (Emotion Detection, referred EmoTect), focused on identifying smart dialogue scene in the user's mood, for the user text intelligent dialogue scene, automatically determines the mood category of the text and the corresponding confidence level, is divided into a type of emotion positive, negative, neutral.


Baidu set based on self test (including chat, customer) and data set nlpcc2014 Twitter emotion evaluation results shown in the following table, in addition, PaddleNLP Baidu also open the massive data based on the trained model, which in the chat conversation corpus fine- after the tune, you can get better results.


  • BOW: Bag Of Words, is a non-series model, use the basic fully connected.

  • CNN: CNN shallow model can process input sequence becomes long, extracting a feature within a local area.

  • TextCNN: Multi-convolution kernels CNN model, can better capture the local correlation sentence.

  • LSTM: LSTM single model, can be solved in the sequence of text-dependent long distance problems.

  • BI-LSTM: LSTM single bidirectional model, two-way LSTM structure, to better capture the semantic features of the sentence.

  • ERNIE: Baidu from massive amounts of data and research on generic text prior knowledge training semantic representation model, and based on this fine-tune the emotional dialogue on the classified data set obtained.


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/emotion_detection


Second, the text matches


1, the short text semantic matching


Baidu self-developed short text semantic matching semantic matching frame (SimilarityNet, SimNet) frame is a short text similarity calculation, according to two text input by the user, calculates the similarity score. SimNet semantic representation followed the way of the implicit continuous vector representation, but the semantic matching problem is modeled End-to-End in depth learning framework, the point-wise and pair-wise there are two kinds of supervised learning all unified in one overall framework. In practical application scenario, the mass of users click behavior data into a massive weakly labeled data, for the first time on a web search using the task that is show great power, bringing relevance has improved significantly. SimNet Baidu framework widely used in various products, including core network structure BOW, CNN, RNN, MMDNN, to provide semantic similarity computing training and forecasting framework for information retrieval, news recommendation, more intelligent customer service and other application scenarios to help enterprises solve semantic matching problem.


Baidu search data based on mass, PaddleNLP trained a SimNet-BOW-Pairwise semantic matching model, in some real FAQ Q scene, the model results than similarity method based on literal AUC lifting more than 5%. Baidu for evaluation based on self test set (including chat, customer service and other data set) and semantic matching data set (LCQMC), results in the following table.


Accuracy LCQMC data set as an evaluation index, and outputs the pairwise similarity model is, therefore using 0.958 classification threshold as compared to the baseline model complex network structure CBOW equivalent model (accuracy rate of 0.737), to enhance the accuracy of BOW_Pairwise It is 0.7532.


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/similarity_net


三、序列标注


1、词法分析


百度自主研发中文特色模型词法分析任务(Lexical Analysis of Chinese),输入是一个字符串,而输出是句子中的词边界和词性、实体类别。序列标注是词法分析的经典建模方式。LAC使用基于 GRU 的网络结构学习特征,将学习到的特征接入 CRF 解码层完成序列标注。CRF 解码层本质上是将传统 CRF 中的线性模型换成了非线性神经网络,基于句子级别的似然概率,因而能够更好的解决标记偏置问题。LAC能整体性地完成中文分词、词性标注、专名识别任务。


基于自建的数据集上对分词、词性标注、专名识别进行整体的评估效果,效果如下表所示。此外,在飞桨开放的语义表示模型 ERNIE 上 finetune,并对比基线模型、BERT finetuned 和 ERNIE finetuned 的效果,可以看出会有显著的提升。


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/lexical_analysis


四、文本生成


1、机器翻译


机器翻译(machine translation, MT)是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程,输入为源语言句子,输出为相应的目标语言的句子。Transformer是论文 「Attention Is All You Need 」中提出的用以完成机器翻译(machine translation, MT)等序列到序列(sequence to sequence, Seq2Seq)学习任务的一种全新网络结构。


其同样使用了 Seq2Seq 任务中典型的编码器-解码器(Encoder-Decoder)的框架结构,但相较于此前广泛使用的循环神经网络(Recurrent Neural Network, RNN),其完全使用注意力(Attention)机制来实现序列到序列的建模,基于公开的 WMT'16 EN-DE 数据集训练 Base、Big 两种配置的Transformer 模型后,在相应的测试集上进行评测,效果如下表所示。


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/neural_machine_translation/transformer


五、语义表示与语言模型


1、语言表示工具箱


BERT 是一个迁移能力很强的通用语义表示模型,以 Transformer 为网络基本组件,以双向 Masked Language Model和 Next Sentence Prediction 为训练目标,通过预训练得到通用语义表示,再结合简单的输出层,应用到下游的 NLP 任务,在多个任务上取得了 SOTA 的结果。


ELMo(Embeddings from Language Models) 是重要的通用语义表示模型之一,以双向 LSTM 为网路基本组件,以 Language Model 为训练目标,通过预训练得到通用的语义表示,将通用的语义表示作为Feature 迁移到下游 NLP 任务中,会显著提升下游任务的模型性能。PaddleNLP发布了基于百科类数据训练的预训练模型。


百度自研的语义表示模型ERNIE 通过建模海量数据中的词、实体及实体关系,学习真实世界的语义知识。相较于 BERT 学习原始语言信号,ERNIE直接对先验语义知识单元进行建模,增强了模型语义表示能力。


这里我们举个例子:


Learnt by BERT :哈 [mask] 滨是 [mask] 龙江的省会,[mask] 际冰 [mask] 文化名城。

Learnt by ERNIE[mask] [mask][mask] 是黑龙江的省会,国际 [mask] [mask] 文化名城。


在 BERT 模型中,我们通过『哈』与『滨』的局部共现,即可判断出『尔』字,模型没有学习与『哈尔滨』相关的任何知识。而 ERNIE 通过学习词与实体的表达,使模型能够建模出『哈尔滨』与『黑龙江』的关系,学到『哈尔滨』是『黑龙江』的省会以及『哈尔滨』是个冰雪城市。


训练数据方面,除百科类、资讯类中文语料外,ERNIE 还引入了论坛对话类数据,利用 DLM(Dialogue Language Model)建模 Query-Response 对话结构,将对话 Pair 对作为输入,引入 Dialogue Embedding 标识对话的角色,利用 Dialogue Response Loss学习对话的隐式关系,进一步提升模型的语义表示能力。


ERNIE在自然语言推断,语义相似度,命名实体识别,情感分析,问答匹配多项NLP中文任务上效果领先。


640?wx_fmt=png


https://github.com/PaddlePaddle/LARK/tree/develop/ERNIE

https://github.com/PaddlePaddle/LARK/tree/develop/BERT

https://github.com/PaddlePaddle/LARK/tree/develop/ELMo


2、语言模型


基于LSTM的语言模型任务是给定一个输入词序列(中文分词、英文tokenize),计算其PPL(语言模型困惑度,用户表示句子的流利程度),基于循环神经网络语言模型的介绍可以参阅论文「Recurrent Neural Network Regularization」。相对于传统的方法,基于循环神经网络的方法能够更好的解决稀疏词的问题。此任务采用了序列任务常用的RNN网络,实现了一个两层的LSTM网络,然后LSTM的结果去预测下一个词出现的概率。


在small、meidum、large三个不同配置情况的ppl对比如下表所示。


640?wx_fmt=png

640?wx_fmt=png

640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/language_model


六、复杂任务


1、对话模型工具箱


Auto Dialogue Evaluation


对话自动评估模块主要用于评估开放领域对话系统的回复质量,能够帮助企业或个人快速评估对话系统的回复质量,减少人工评估成本。


1)在无标注数据的情况下,利用负采样训练匹配模型作为评估工具,实现对多个对话系统回复质量排序;


2)利用少量标注数据(特定对话系统或场景的人工打分),在匹配模型基础上进行微调,可以显著提高该对话系统或场景的评估效果。


以四个不同的对话系统(seq2seq_naive/seq2seq_att/keywords/human)为例,使用对话自动评估工具进行自动评估。


1)无标注数据情况下,直接使用预训练好的评估工具进行评估;在四个对话系统上,自动评估打分和人工评估打分spearman相关系数,如下表所示。


640?wx_fmt=png


2)  对四个系统平均得分排序:


640?wx_fmt=png


3)利用少量标注数据微调后,自动评估打分和人工打分spearman相关系数,如下表所示。


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/auto_dialogue_evaluation


Deep Attention Matching Network


深度注意力机制模型是开放领域多轮对话匹配模型。根据多轮对话历史和候选回复内容,排序出最合适的回复。


多轮对话匹配任务输入是多轮对话历史和候选回复,输出是回复匹配得分,根据匹配得分排序,更多内容请参阅论文「Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network」。


两个公开数据集上评测效果如下表所示。

640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/deep_attention_matching


对话通用理解模型DGU


对话相关的任务中,Dialogue System常常需要根据场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽位解析、DA识别、DST等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得dialogue system得到更好的发展,需要开发一个通用的对话理解模型。基于BERT的对话通用理解模块(DGU: Dialogue General Understanding),通过实验表明,使用base-model(BERT)并结合常见的学习范式,在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。


DGU针对数据集开发了相关的模型训练过程,支持分类,多标签分类,序列标注等任务,用户可针对自己的数据集,进行相关的模型定制。基于对话相关的业内公开数据集进行评测,效果如下表所示。


640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/dialogue_model_toolkit/dialogue_general_understanding


2、知识驱动对话


人机对话是人工智能(AI)中最重要的话题之一,近年来受到学术界和工业界的广泛关注。目前,对话系统仍然处于起步阶段,通常是被动地进行交谈,并且更多地将他们的言论作为回应而不是他们自己的倡议,这与人与人的谈话不同。因此,我们在一个名为知识驱动对话的新对话任务上设置了这个竞赛,其中机器基于构建的知识图与人交谈。它旨在测试机器进行类似人类对话的能力。


我们提供基于检索和基于生成的基线系统。这两个系统都是由PaddlePaddle(百度深度学习平台)和Pytorch(Facebook深度学习框架)实现的。两个系统的性能如下表所示。


640?wx_fmt=png

https://github.com/baidu/knowledge-driven-dialogue/tree/master


3、阅读理解


在机器阅读理解(MRC)任务中,我们会给定一个问题(Q)以及一个或多个段落(P)/文档(D),然后利用机器在给定的段落中寻找正确答案(A),即Q + P or D => A. 机器阅读理解(MRC)是自然语言处理(NLP)中的关键任务之一,需要机器对语言有深刻的理解才能找到正确的答案。基于PaddlePaddle的阅读理解升级了经典的阅读理解BiDAF模型,去掉了char级别的embedding,在预测层中使用了pointer network,并且参考了R-NET中的一些网络结构,效果上有了大幅提升(在DuReader2.0验证集、测试集的表现见下表)。


DuReader is a large-scale, real-world applications-oriented, human-generated by the Chinese reading comprehension data sets. DuReader focus on the real world is not limited to the field of questions and answers task. Reading comprehension compared to other data sets, DuReader advantages include:

  • The real problem comes from search logs

  • Article content from the real page

  • The answer is generated by humans

  •  For real application scenarios

  • Marked richer detail

640?wx_fmt=png

https://github.com/PaddlePaddle/models/tree/develop/PaddleNLP/reading_comprehension

640?wx_fmt=png

Guess you like

Origin blog.csdn.net/PaddlePaddle/article/details/92875879