NeuralTemplateGen- Code Function comb

Let me talk about this model is doing -

Brief introduction

Based Encoder-Decoderembodiment of the text generation NLG model has become the mainstream, but it is present, such as (1) can not be interpreted, (2) it is difficult to select the content or practically speaking these disadvantages.

This paper is decoderimproved, using Hidden Semi Markov HSMMModels as a decoder, this model can be obtained by learning templates are also controllable and explanatory.

The model can be done automatically speak point selection and sorting , text template generation , template slots to fill several processes to finally obtain a complete words.

From the four areas to introduce the function code.

First, data preparation and data

1.1 open source data set E2E

E2EIt is one of the largest food and beverage field of open source data set. Commonly used NM. It is a sentence of words into a process.

  • mr: (textual meaning representation) is the "word", similar to the property name and property values
  • ref: generation can read sentences

A mr examples

name[The Vaults], 
eatType[pub], 
priceRange[more than £30], 
customer rating[5 out of 5], 
near[Café Adriatic]

Corresponding to the generated sentence

Near Café Adriatic is a five star rated, high priced pub called The Vaults.
The Vaults is a 5 stars pub with middle prices in Café Adriatic.
The Vaults Pub is close to Café Adriatic, it is five star rated and it has high prices
The Vaults is near Café Adriatic, it's a pub that ranges more than 30 and customers rate it 5 out of 5.
The Vaults is a five star, expensive public house situated close to Café Adriatic
There is an expensive, five-star pub called The Vaults located near Café Adriatic.
The Vaults is a local pub with a 5 star rating and prices starting at £30. You can find it near Café Adriatic.
The Vaults is a pub with menu items more than £30 and a customer rating of 5 out of 5. The Vaults is located near Café Adriatic.
The Vaults with a amazing 5 out of 5 customer rating, is a pub near the Café Adriatic.  Menu price are more than £30 per item.
Rated 5 star by diners, The Vaults offers Pub fair near Café Adriatic.
The Vaults in  Café Adriatic is a great 5 stars pub with middle prices.
The Vaults costs more than 30 pounds and has a 5 out of 5 rating.  It is a pub located near Café Adriatic.
The Vaults is a pub that costs more than 30 pounds and has a 5 out of 5 rating.  It is located near Café Adriatic.
The pub Café Adriatic ranges more than 30 and is rated 5 out of 5 its near The Vaults.
The Vaults is an expensive, five-star pub located near Café Adriatic.

Note item

1. Generate a class are described facts, such is a family-friendly restaurant, but there is no description of
2. The same sample may have different values in some dimensions, such as food prices, it was considered more appropriate , it was considered more expensive.
3. On a sample of each dimension is only one value, for example, a restaurant has a variety of cuisine, but cuisine is only one value for each sample (Japanese, Italian, Chinese dishes)

1.2 Use of Data

  1. source and target
  • src_train Data, attributes and attribute values, the total number of 42061, the number of de-emphasis 4862
  • train Sequence, generated sentences, the total number of 42061, the number of de-emphasis 40785
  1. field insource
['customerrating', 'name', 'area', 'food', 'near', 'priceRange', 'eatType']

1.3 Data Preparation

Currently data path data/sub_pathis included below src_train.txt, tgt_train.txt, train.txt, src_test.txt, tgt_test.txt, test.txt, src_valid.txt, tgt_valid.txtand valid.txtdocuments,

src_*.txtStructured data files, tgt_*.txtthe file is readable text.

But the model training is required train.txtand valid.txtfiles. It is how to get it?

You can perform

python data/make_e2e_labedata.py train > train.txt
python data/make_e2e_labedata.py valid > valid.txt

This process is a tag and original data exactly the same as those of the slot value generated from the text.
For chestnut:

raw data:

name: The Vaults
eatType: pub
priceRange: more than £ 30
customerrating: 5 out of 5
near: Café Adriatic 

Generated text:

The Vaults pub near Café Adriatic has a 5 star rating . Prices start at £ 30 .

Only you can see name, eatTypeand nearthese attributes attribute values corresponding generation appeared in the text, so we marked with the location and attributes id, id is the same punctuation.

Then the result of this example is as follows

[(0, 2, idx('name')), (2, 3, idx('eatType')), (4, 6, idx('near')), (11, 12, idx('unknow')), (17, 18, idx('unknow'))]

This will help us learn the location and type of relationship templates.

Second, the training process

python chsmm.py \
    -data data/labee2e/ \
    -emb_size 300 \
    -hid_size 300 \
    -layers 1 \
    -K 55 \
    -L 4 \
    -log_interval 200 \
    -thresh 9 \
    -emb_drop \
    -bsz 15 \
    -max_seqlen 55 \
    -lr 0.5 \
    -sep_attn \
    -max_pool \
    -unif_lenps \
    -one_rnn \
    -Kmul 5 \
    -mlpinp \
    -onmt_decay \
    -cuda \
    -seed 1818 \
    -save models/chsmm-e2e-300-55-5.pt

The main process is as follows:

  • The sample upset
  • make_combo_targs: The words and copy information into a single Tensor in
  • make_masks: Because generation may be new words in the text, in the follow-up will be disposed of, is the mask. And there are two operations, one is directly to remove the word, there is an average
  • get_uniq_fields: Each batch of Fields are filled to the maximum length

Third, the template extraction

This is primarily generated template, or in the above example, the template with the following slot is generated.

Here Insert Picture Description
After saving the template drawn under segs / path. Template generation method is as follows:

Using non-autoregressive methods:

python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -load models/e2e-55-5.pt -label_train | tee segs/seg-e2e-300-55-5.txt

Fourth, the text generation

Once you have models and templates, you can generate text friends. (Actually a template selection process and fill the slot.)

autoregressive model generation:

python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -dropout 0.3 -K 60 -L 4 -log_interval 100 -thresh 9 -lr 0.5 -sep_attn -unif_lenps -emb_drop -mlpinp -onmt_decay -one_rnn -max_pool -gen_from_fi data/labee2e/src_uniq_valid.txt -load models/e2e-60-1-far.pt -tagged_fi segs/seg-e2e-60-1-far.txt -beamsz 5 -ntemplates 100 -gen_wts '1,1' -cuda -min_gen_tokes 0 > gens/gen-e2e-60-1-far.txt

- gen_from_fiwith the back of the file is structured data
- tagged_fifollowed by the decimation of our good template
- loadwe trained the model
- gens/to retain the text of the next generation path.

Note format generated results: <text> ||| <template fragments> really need to keep using only text.

Published 120 original articles · won praise 35 · views 170 000 +

Guess you like

Origin blog.csdn.net/u012328476/article/details/103820529