Papers read -StrategiesForStructuringStroryGeneration

The author of this article to optimize the results generated by the rich vocabulary. The authors build process split into three parts, each have done a little bit of change. Look at how they write the story.

First, SRLthe tool, converting the story into a structured data. Here, the same entity are expressed in a placeholder, such as the figure above, ent0 is me, ent1 a paw, ent1 is the head. (In fact, here I do not quite understand, because ent2 follow-up is filled into 我的头, there is a word belongs to me, and ent1 there 锋利的爪子and 他们that is claws of pronouns, so the "I" in the end is, or is part of the entity to other entities it?)

Then the story related entities will be replaced with a placeholder,

Finally, to fill in the blank.

data

Prompt written word is man, using a data set WRITINGPROMPTS, each story about 734 words. Story number of words used in the experiment is limited to less than 1000, suggesting that the size of 19025 word dictionary, the story dictionary size 104960.

First, the structure of the data story

This step is data preprocessing, according to the text input and output parameters and sequence in which the predicate to determine "who did what to whom" In essence, "when" and "where."

It feels like a sequence labeling. Marked sentence predicates and parameters through a pre-training model. However OF improved again here, the decoder of a multi-head, for a generated verb attention. As a result, generate a broader range of verbs, but also to avoid the same problems recurring word in the sentence.

Second, solid modeling

Like names, names such low word frequency words, would like to count on the word size of the language model generation is difficult, but we have to solve this problem from the physical size of the word would be much better. Can be understood as our President into the slot, think of ways to fill these slots, fill both accurate and rich.

This problem is split into two steps, pull the slot and fill the slot.

2.1 pull slot

Uses a similar reading comprehension way to do it, the story of a single entity with the same placeholder to represent.

Here are two specific mention:

  1. Use NER model identification names, place names, organization names , the same name with the same represents a placeholder.

  2. When multiple different strings represent the same entity, can not be used for the above-described slot NER replaced. In this case we will use one based on the recognition model generation means clustering will represent the same semantic entity in this way, and then replaced with a placeholder. But using this method should be noted that if an entity is only one representation (that is, the story appeared once) then this does not refer to the entity relationship needs to be unique only a placeholder to represent.

2.2 fill slots

How the same entity expressed in a different way out of it, is the "I," "me" or "that girl"? Here it is again used in the model, it referred to here sub-word seq2seq, because we are using a seq2seq model to generate these representations of text, and the decoder added pointer-copybefore the mechanism, which would generate a new pronouns, or use entity The name.

Here, I probably understand pointer-copy mechanism, but not really understand the original text, which reads as follows:

To generate an entity reference, the decoder can either generate a new abstract entity token or choose to copy an already generated abstract entity token, which encourages the model to use consistent
naming for the entities.

Author also before the whole model added some input. It reads as follows:

A bag-of-words context window around the
specific entity mention, which allows local
context to determine if an entity should be a
name, pronoun or nominal reference.

Previously generated references for the same
entity placeholder.

The following diagram is a model structure of my comprehension, chart the following section is an additional input information I understand, the above is the original pointer-generator model.

Here Insert Picture Description

2.2.1 word-level

We used to say that sub-word seq2seq, then what is sub-wordit? The idea was originally also from research in machine translation, so let's look at word-levelways to translate the model.

Here I made reference to this blog , which said, for the word-leveltranslation model, often used back-off dictionaryto deal with OOVwords, such as will sourceand targettwenty-two correspondence, use OOVto indicate such results appear in the translation OOVwhen you use sourcethe corresponding targetinstead.

But to do so is based on source-targetthe premise of the word always one to one, because of the different degree of morphological synthesis between language, this assumption is often not true; secondly word-leveltranslation model can not generate the word model have not seen (not in the vocabulary middle), in this case, there is a direct copy paper made from unknownwords to targetwords in, but this treatment strategy is only limited to some of the name of the entity class vocabulary; at the same time in order to save computing time and resources, vocabulary size is generally limited to 30k between -50k, so the word table space is more expensive, if the number of similar meaning words into the vocabulary, such as like,liked,likingsimilar words and other forms were put on vocabulary, intuitive feeling some waste.

2.2.2 sub-word

In actual translation is not necessarily to wordbe translated as a basic unit, by less than a word unit sub-wordto translate, for example, compound words (similar words, prefixes and suffixes same morphology, such as run, runer, running, etc., may by synthesis (run and er synthesis) translation), cognates and loanwords (by subword voice and form conversion), the scientists wrote, centralized data analysis from the German, the 100 rare word (not the most frequent of 5000 word), most of the words can be translated by a smaller subword units.

How, then, will cut into the right word sub-wordit? The paper proposes the use of Byte pair encoding(BPE)compression algorithms, first to divide the characters, then merge. That is, to continue to occur most frequently in the bi-grammerger do gymnastics until it reaches the vocabulary size so far. This kind of frequency merger is more in line with common sense, such as' ing, '' er ', ed' this is more meaningful suffix, 'e' and 'r' at the same time appearing frequency should be more, "ing" and " ed "similar to the truth.

Machine translation model usually used attention mechanisms in word-levelthe model, the model can only calculate 在wordattention on the level, we hope that model can be placed in different learning every step of the attention will be sub-wordon, so obviously more meaningful and efficiency.

2.2.3 character-level

Character size, which is the result output seq2seq model according to the probability distribution for all the characters obtained. Generally use a dictionary to map vocabulary and char index index

baseline

The authors of this model and the 2018 model of fusion was compared (in fact, maybe this model are the author's), and then conducted other experiments:

  • Abstract generated: proposed new baseline, that is, follow the prompts to generate word summary, and then generate a summary based on the story. ? ? ?
  • Keyword Selection: generating keywords based on prompt word, and then build a story based on keywords.
  • Sentence Compression: Compression prompts to carry out the sentence word, and then generate a story based on the compressed sentence.

Third, the evaluation methods

3.1 automatic evaluation methods

Let's look at the problem is how dismantling. We'll Story x x is converted into a more abstract presentation of from from , so the objective function becomes:

L = log from p ( x from ) p ( from ) \mathcal{L}=-\log \sum_{z} p(x | z) p(z)

But for from from bemarginalizationdifficult to operate, especially when we put all the entities represented in a placeholder. So here we determine by constructing a posteriordeterministic posteriorupper bound of the loss of function optimization.

from = arg max z p ( z x ) L log p ( x z ) log p ( z ) \begin{aligned} z^{*} &=\arg \max _{z} p(z | x) \\ \mathcal{L} & \leq-\log p\left(x | z^{*}\right)-\log p\left(z^{*}\right) \end{aligned}

This approach allows model p ( z ( ) ) p(z^(*)) And p ( x z ) p(x|z^{*}) Can be separated and easier training.

Using the comparison results are as follows:

Can be seen by generating a model log structure SRL minimum loss of value, indicating easier than generate a summary, key words or sentence compression, the better because he takes advantage of the structured format of the number.

And compared to the original story, the story of how to generate it? We LongestCommonSubsequence(LCS)measure primarily up to LSC and the average LSC. LSC higher the value the more proof copy of the original sentence, the less the amount generated. The following figure shows the ability of the model to generate better.

3.2 manual evaluation

Here contrast fusion model 2018 (the upper portion in FIG.), And this model (FIG lower part below), generates the same results in the prompt word. Which is determined by the articles better mark person (only for the story, not to prompt word).

Published 120 original articles · won praise 35 · views 170 000 +

Guess you like

Origin blog.csdn.net/u012328476/article/details/104229923
Recommended