论文阅读《zero-shot word sense disambiguation using sense definition embedding》

LawsonAbs’ reading and thinking, but alsoPlease read it critically.

to sum up

  • Divided into three parts, part1 talks about background knowledge first, part2 talks about text content, part3 talks about personal thinking
  • To summarize this article in one sentence:Combining KG knowledge into the task of word sense disambiguation (WSD), so that the discrete interpretation and labeling information of words can be changed into continuous embedding representation; at the same time, the model is trained using supervised learning methods to obtain better generalization effects
  • Original source: csdn+lawsonabs

part1 background knowledge

Before talking about this article, in order to better understand the content of the article, let me talk about a few simple but important concepts.

1 Word sense disambiguation

This part can refer to my blog

2 Knowledge Graph

A Knowledge Graph is typically comprised of a set K of N triples (h; l; t), where head h and tail t are entities, and l denotes a relation.

Simply put: the knowledge graph is a set of entities + relationships. So how to represent the information in the knowledge graph?

The traditional knowledge graph representation method is to use OWL, RDF and other ontology languages ​​to describe; with the development and application of deep learning, we hope to adopt a simpler way of representation, that is, [vector], which can be convenient for us After performing various tasks, such as reasoning, our goal now is to encode each simple triple <subject, relation, object> into a low-dimensional distributed vector. (What is a distributed vector?)

3 means learning

Representation learning: Representation learning aims to represent the semantic information of the research object as dense low-dimensional real-valued vectors. Knowledge representation learning is mainly for representation learning of entities and relationships in the knowledge graph. Use modeling methods to represent entities and relationships in a low-dimensional dense vector space, and then perform calculations and inferences. Simply put, the process of expressing triples as vectors is called representation learning. [TransE model] is a classic method in the [Trans series].

4 MFS

This is a commonly used algorithm name in WSD tasks, and its full name is (Most-Frequent-sense).
The MFS strategy is to use the most commonly used sense of a word in the training set as the sense of these words in the test set . The advantage of this algorithm is simple and direct, but it is obvious that the algorithm has obvious shortcomings.

5.lexical sample / all-words task

These two terms are a description of the scope of application of the wsd task.

5.1 lexical sample

In Adam Kilgarriff's paper "English Lexical Sample Task Description", a description of this task:
Insert picture description hereInsert picture description here

5.2 all-words task

It means to disambiguate all the words that need to be disambiguated (sounds awkward), usually dealing with nouns, adverbs, adjectives, and verbs.

Summarizing the two tasks, there are two main points:

  • Both concepts should come from semeval
  • The two should only differ in the data range of disambiguation terms

6 discrete label


part2 thesis content

0 Summary

First, you need to understand the task of WSD. It has been described above and will not be repeated here.
Why is the EWISE method proposed? Because the current WSD algorithm is based on discrete label, in order to improve efficiency, this word sense embddingalgorithm based on is introduced to solve the problem that the sense that only appears in the test set cannot be accurately predicted .

1 Introduction

1.1 wsd

Do not introduce~

1.2 Traditional methods

  • The traditional supervised algorithm
    Insert picture description here
    discrete label is for the traditional supervised and semi-supervised algorithms in the wsd task. I consulted the relevant information and came to the conclusion that the discrete label refers to the one-hot vector.

  • Semi-supervised algorithm
    From

But regardless of the previous algorithm, as long as the discrete label is used, there are problems ( This leads to poor performance on rare and unseen senses.). In order to solve this problem, the current EWISE method was introduced. The innovation of this method is: use the embedding of the word sense definition as the target to achieve a better generalization effect.

2. Related work

It is divided into supervised WSD and semi-supervised WSD, but these algorithms all rely on paraphrase-annotated data and use an unlabeled corpus.

2.1 Lexical resources

lexical resourceIt provides important support for the words and what they mean, so EWISEis the use of a dictionary definition to capture the meaning of the word. But this is not the first time Leskthat dictionary definitions have been used on WSD tasks. As early as algorithms, dictionary definitions have been used on WSD. So EWISEwhat is the difference with these methods? There are two main points:

  • EWISE Use defined embedding as target embedding, which is a supervised learning training process.
  • Do not rely on any overlapping assumptions (referring to: the definition of a word and its context have a large overlap), but only rely on a single definition provided by WordNet

Of course, there are other methods for obtaining this continuous representations for definition. The article will judge the effects of these methods (including elmo, bert, etc.)

2.2 Structural Knowledge

I don't know what kind of knowledge this refers to.

In the article, I will first say that there are several ways to use structural knowledge to do WSD. Graph-based techniques are used to match words to the most relevant sense. But EWISE is different from it:

  • use structural knowledge to learn better representations of definitions

2.3 Main contributions

  • predicting in an embedding space(key claim)
  • allowing generalized zero shot learning capability [because the labeled data is no longer used, but the paraphrase information in wordnet is used directly]
  • incorporating definitions and structural knowledge

3. Algorithm

3.1 Main framework

The most important thing in each paper is the algorithm framework, so let's take a look at this framework.
Insert picture description here
This framework is mainly divided into two parts:

  • Attentive Context Encoder
  • Definition Encoder

3.2 Attentive Context Encoder

That is, get a word with the context of representation (that is, the integration of contextual information), use the BiLSTM + self-Attentionmethod.

3.3 Definition Encoder

Get the embedding representation of a word sense definition as a target embedding.

3.4 Training

3.4.1 What does the KG part do?

Based on the above input data, you can start training the entire model. So the question is: What does the Knowledge Graph Embedding part do? When we get the target embedding, we may not be able to get the embedding that we are satisfied with, so we need to combine KG to get this embedding. This KG is WordNet. What are the specific interaction details? Speak slowly below.

TransEAnd ConvEmethods, can learn from Bowen learning. Briefly talk about these algorithms that are: Both <h,l,t>define a scoring function on the triples , which is used to represent the distance from a certain operation of hsum lto t. Then the two use different distance functions and loss functions to reduce the loss during the training process, so as to achieve a better effect. Therefore, it is possible to reduce the dimensionality of a knowledge graph and express it as a vector with a strong relationship.

3.4.2 Training steps

The entire training process can be described in the following paragraph:

  • There is a text, use BiLSTM to get the embedding representation of the word that needs to be disambiguated (assuming this word is A)
  • Do a series of processing on this embedding (in order: self-attention with other words => splicing => linear projection), and then get the final vector uiu^iui , that is, in the figuresense embedding prediction, we call it x.
    Insert picture description here
  • Obtain the sense inventory of a word through wordNet, and convert these senses into sense embedding through the trained Definition Encoder. This sense embedding is the target embedding mentioned in the text, which is the "label" of the data in the model, and it is denoted as y
    Insert picture description here
  • Do a dot product operation on x and y, take the softmax operation on the result, and then use the cross entropy loss function to calculate the loss, thereby iteratively update the parameters

4. Data & experiment

There is nothing to say about this part, mainly around the key claim of EWISE — the abilitiy of disambiguate unseen and rare wordsTo design the experiment.

  • WSD on Rare Words [What should I do if the words that appear in the test set do not appear in the training set?
  • WSD on Rare Senses [What should I do if the meaning of the words that appear in the test set does not appear in the training set?
    Because of the characteristics of this algorithm, you can directly deal with the above two situations without using other algorithms (such as MFS) to deal with special situations.

part3 thinking

5. Personal thinking

  • The more data, the better the training effect. This sentence seems to be the golden rule. But the data does not necessarily have to be based on annotations, can it rely on other resources? For example: dictionary, structure information of sentence itself?
    It can be seen that a lot of work is making full use of this information to achieve a better effect.

7. Reference materials

  • https://blog.csdn.net/weixin_40871455/article/details/83341561 [used to introduce the TransE algorithm, good introductory information]
  • https://www.aclweb.org/anthology/P19-1568/ [Thesis address]
  • https://www.aclweb.org/anthology/S01-1004.pdf [Introduction of lexical sample task]
  • https://zhuanlan.zhihu.com/p/54657158 [Introduce some common concepts in lexical sample task and wordNet]

Guess you like

Origin blog.csdn.net/liu16659/article/details/109730175