Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

引入了ELMO embedding

ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) can be used to generate contextualized word representations by combining internal states of different layers in neural language models. Contextualized word representation can help to improve performance in various NLP tasks by incorporating contextual information, essentially allowing for the same word to have distinct context-dependent meanings. This could be par- ticularly powerful for chemical NER since generic chemical names (e.g. salts, acid) may have dif- ferent meanings in other domains. We therefore explore the impact of using contextualized word representations for chemical patents.
We train ELMo on the same corpus of 84K patents (detailed in Table 1), which we use for training the ChemPatent embeddings (described in Section 3.4). We use the ELMo implementation provided by Peters et al. (2018) with default hyperparameters.2 Such neural language models require a large amount of computational resources to train. In ELMo, a maximum character sequence length of tokens is set to make training feasible. However, systematic chemical names in chemical patents are often longer than the typical maximum sequence length of these neural language models. As very long tokens tend to be systematic chemical names, we reduced the max length of word from 50 to 25 and replace tokens longer than 25 characters by a special token “Long Token”.

通过结合神经语言模型中不同层的内部状态,可以使用ELMo(Peters等人,2018)和BERT(Devlin等人,2019)来生成上下文化的单词表示。通过合并上下文信息,上下文化的单词表示可以帮助提高各种NLP任务的性能,本质上允许同一单词具有不同的上下文相关含义。这对于化学NER可能特别有效,因为通用化学名称(例如盐,酸)可能在其他域中具有不同的含义。因此,我们探讨了使用上下文化的词表示法获取化学专利的影响。
我们使用84K专利(在表1中详细说明)的同一语料库对ELMo进行培训,我们将其用于训练ChemPatent嵌入(在3.4节中进行介绍)。我们使用Peters等人提供的ELMo实现。 (2018)中使用默认的超参数。2这种神经语言模型需要大量的计算资源来训练。在ELMo中,设置令牌的最大字符序列长度以使训练可行。但是,化学专利中的系统化学名称通常长于这些神经语言模型的典型最大序列长度。由于很长的标记通常是系统化的化学名称,因此我们将单词的最大长度从50减少到25,并使用特殊标记“长标记”替换超过25个字符的标记。

发布了241 篇原创文章 · 获赞 6 · 访问量 7245

猜你喜欢

转载自blog.csdn.net/qq_28468707/article/details/103878292