Article Directory
-
It adds a personal comment GitHub Code
Let me talk about this model is doing -
Brief introduction
Based Encoder-Decoder
embodiment of the text generation NLG model has become the mainstream, but it is present, such as (1) can not be interpreted, (2) it is difficult to select the content or practically speaking these disadvantages.
This paper is decoder
improved, using Hidden Semi Markov HSMM
Models as a decoder, this model can be obtained by learning templates are also controllable and explanatory.
The model can be done automatically speak point selection and sorting , text template generation , template slots to fill several processes to finally obtain a complete words.
From the four areas to introduce the function code.
First, data preparation and data
1.1 open source data set E2E
E2E
It is one of the largest food and beverage field of open source data set. Commonly used NM. It is a sentence of words into a process.
- mr: (textual meaning representation) is the "word", similar to the property name and property values
- ref: generation can read sentences
A mr examples
name[The Vaults],
eatType[pub],
priceRange[more than £30],
customer rating[5 out of 5],
near[Café Adriatic]
Corresponding to the generated sentence
Near Café Adriatic is a five star rated, high priced pub called The Vaults.
The Vaults is a 5 stars pub with middle prices in Café Adriatic.
The Vaults Pub is close to Café Adriatic, it is five star rated and it has high prices
The Vaults is near Café Adriatic, it's a pub that ranges more than 30 and customers rate it 5 out of 5.
The Vaults is a five star, expensive public house situated close to Café Adriatic
There is an expensive, five-star pub called The Vaults located near Café Adriatic.
The Vaults is a local pub with a 5 star rating and prices starting at £30. You can find it near Café Adriatic.
The Vaults is a pub with menu items more than £30 and a customer rating of 5 out of 5. The Vaults is located near Café Adriatic.
The Vaults with a amazing 5 out of 5 customer rating, is a pub near the Café Adriatic. Menu price are more than £30 per item.
Rated 5 star by diners, The Vaults offers Pub fair near Café Adriatic.
The Vaults in Café Adriatic is a great 5 stars pub with middle prices.
The Vaults costs more than 30 pounds and has a 5 out of 5 rating. It is a pub located near Café Adriatic.
The Vaults is a pub that costs more than 30 pounds and has a 5 out of 5 rating. It is located near Café Adriatic.
The pub Café Adriatic ranges more than 30 and is rated 5 out of 5 its near The Vaults.
The Vaults is an expensive, five-star pub located near Café Adriatic.
Note item
1. Generate a class are described facts, such is a family-friendly restaurant, but there is no description of
2. The same sample may have different values in some dimensions, such as food prices, it was considered more appropriate , it was considered more expensive.
3. On a sample of each dimension is only one value, for example, a restaurant has a variety of cuisine, but cuisine is only one value for each sample (Japanese, Italian, Chinese dishes)
1.2 Use of Data
- source and target
src_train
Data, attributes and attribute values, the total number of 42061, the number of de-emphasis 4862train
Sequence, generated sentences, the total number of 42061, the number of de-emphasis 40785
- field insource
['customerrating', 'name', 'area', 'food', 'near', 'priceRange', 'eatType']
1.3 Data Preparation
Currently data path data/sub_path
is included below src_train.txt
, tgt_train.txt
, train.txt
, src_test.txt
, tgt_test.txt
, test.txt
, src_valid.txt
, tgt_valid.txt
and valid.txt
documents,
src_*.txt
Structured data files, tgt_*.txt
the file is readable text.
But the model training is required train.txt
and valid.txt
files. It is how to get it?
You can perform
python data/make_e2e_labedata.py train > train.txt
python data/make_e2e_labedata.py valid > valid.txt
This process is a tag and original data exactly the same as those of the slot value generated from the text.
For chestnut:
raw data:
name: The Vaults
eatType: pub
priceRange: more than £ 30
customerrating: 5 out of 5
near: Café Adriatic
Generated text:
The Vaults pub near Café Adriatic has a 5 star rating . Prices start at £ 30 .
Only you can see name
, eatType
and near
these attributes attribute values corresponding generation appeared in the text, so we marked with the location and attributes id, id is the same punctuation.
Then the result of this example is as follows
[(0, 2, idx('name')), (2, 3, idx('eatType')), (4, 6, idx('near')), (11, 12, idx('unknow')), (17, 18, idx('unknow'))]
This will help us learn the location and type of relationship templates.
Second, the training process
python chsmm.py \
-data data/labee2e/ \
-emb_size 300 \
-hid_size 300 \
-layers 1 \
-K 55 \
-L 4 \
-log_interval 200 \
-thresh 9 \
-emb_drop \
-bsz 15 \
-max_seqlen 55 \
-lr 0.5 \
-sep_attn \
-max_pool \
-unif_lenps \
-one_rnn \
-Kmul 5 \
-mlpinp \
-onmt_decay \
-cuda \
-seed 1818 \
-save models/chsmm-e2e-300-55-5.pt
The main process is as follows:
- The sample upset
make_combo_targs
: The words and copy information into a single Tensor inmake_masks
: Because generation may be new words in the text, in the follow-up will be disposed of, is the mask. And there are two operations, one is directly to remove the word, there is an averageget_uniq_fields
: Each batch of Fields are filled to the maximum length
Third, the template extraction
This is primarily generated template, or in the above example, the template with the following slot is generated.
After saving the template drawn under segs / path. Template generation method is as follows:
Using non-autoregressive methods:
python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -load models/e2e-55-5.pt -label_train | tee segs/seg-e2e-300-55-5.txt
Fourth, the text generation
Once you have models and templates, you can generate text friends. (Actually a template selection process and fill the slot.)
autoregressive model generation:
python chsmm.py -data data/sub_path/ -emb_size 300 -hid_size 300 -layers 1 -dropout 0.3 -K 60 -L 4 -log_interval 100 -thresh 9 -lr 0.5 -sep_attn -unif_lenps -emb_drop -mlpinp -onmt_decay -one_rnn -max_pool -gen_from_fi data/labee2e/src_uniq_valid.txt -load models/e2e-60-1-far.pt -tagged_fi segs/seg-e2e-60-1-far.txt -beamsz 5 -ntemplates 100 -gen_wts '1,1' -cuda -min_gen_tokes 0 > gens/gen-e2e-60-1-far.txt
- gen_from_fi
with the back of the file is structured data
- tagged_fi
followed by the decimation of our good template
- load
we trained the model
- gens/
to retain the text of the next generation path.
Note format generated results: <text> ||| <template fragments> really need to keep using only text.