为文本摘要网络Pointer-Generator Networks制作中文复述训练数据

这里首先向大家推荐一篇康奈尔大学的论文https://arxiv.org/abs/1704.04368。这篇论文介绍了一个文本摘要网络，具体是怎样的，这里不作详细介绍。向大家推荐知乎上一篇介绍它的文章https://zhuanlan.zhihu.com/p/27272224，其余请自行了解阅读论文。

下面是pointer-generator的开源项目地址：https://github.com/abisee/pointer-generator。我们现在要用它做中文复述的工作，那么首先来看一下它是如何处理英文文本摘要的。

Github网页上给了测试集输出结果，我们拿出第一篇看下效果：

这是文章原文（article）：

washington ( cnn ) president barack obama says he is `` absolutely committed to making sure '' israel maintains
 a military advantage over iran . his comments to the new york times , published on sunday , come amid 
criticism from israeli prime minister benjamin netanyahu of the deal that the united states and five other world
 powers struck with iran . tehran agreed to halt the country 's nuclear ambitions , and in exchange , western 
powers would drop sanctions that have hurt the iran 's economy . obama said he understands and respects 
netanyahu 's stance that israel is particularly vulnerable and does n't `` have the luxury of testing these 
propositions '' in the deal . `` but what i would say to them is that not only am i absolutely committed to making
 sure they maintain their qualitative military edge , and that they can deter any potential future attacks , but 
what i 'm willing to do is to make the kinds of commitments that would give everybody in the neighborhood ,
 including iran , a clarity that if israel were to be attacked by any state , that we would stand by them , '' obama 
said . that , he said , should be `` sufficient to take advantage of this once-in-a-lifetime opportunity to see 
whether or not we can at least take the nuclear issue off the table , '' he said . the framework negotiators 
announced last week would see iran reduce its centrifuges from 19,000 to 5,060 , limit the extent to which 
uranium necessary for nuclear weapons can be enriched and increase inspections . the talks over a final draft 
are scheduled to continue until june 30 . but netanyahu and republican critics in congress have complained that iran 
wo n't have to shut down its nuclear facilities and that the country 's leadership is n't trustworthy enough for the 
inspections to be as valuable as obama says they are . obama said even if iran ca n't be trusted , there 's still a case to 
be made for the deal . `` in fact , you could argue that if they are implacably opposed to us , all the more reason for 
us to want to have a deal in which we know what they 're doing and that , for a long period of time , we can prevent 
them from having a nuclear weapon , '' obama said .

这是对这篇文章提供的参考总结(referance summary):

1. in an interview with the new york times , president obama says he understands israel feels particularly vulnerable .
2. obama calls the nuclear deal with iran a `` once-in-a-lifetime opportunity '' .
3. israeli prime minister benjamin netanyahu and many u.s. republicans warn that iran can not be trusted .

下面是pointer-generator模型给出的结果：

1. president barack obama says he is `` absolutely committed to making sure '' israel maintains a military advantage 
over iran .
2. obama said he understands and respects netanyahu 's stance that israel is particularly vulnerable and 
does n't `` have the luxury of testing these propositions '' .

可以看出来效果还是不错的。
另外，Github上有提供pretrained model，也提供了英文训练数据，可以自己训练模型。那么接下来我们看一下它的训练数据是怎样处理的。

在https://github.com/abisee/cnn-dailymail这个网页上提供了英文数据处理的代码，当然它也提供了处理好了的数据。我们通过下载CNN Stories数据集，找出其中一篇查看一下。结构很简单，前面是一篇文章，最后一部分是给出的reference summary，由@highlight标记。reference summay格式如下，文章篇幅过长就不放了。

@highlight

A new tour in Taipei, Taiwan, allows tourists to do four-hour ride-alongs in local taxis

@highlight

Tourists go wherever local fares hire the cabs to go

@highlight

The appeal is going to unexpected locations and meeting chatty locals

@highlight

One English tourist was invited to a Taiwanese family dinner by a passenger in his taxi

那么训练过程就是将文章和reference summary输入，模型去学习自己产生这个summary。现在我们有中文平行语料，利用它做中文复述也很简单。我的中文语料是这种格式的（需要先分词，可以利用哈工大的ltp）：

这 并 不 奇怪 
没什么 奇怪 的

那么我们需要做的就是将中文语料第一句作为article，中文语料第2句作为reference summary输入就可以了。但是我们并不能利用上面提供的链接里的make_datafiles.py直接进行处理。为什么呢？差别在于英文语料是一个.story文件放置一篇article和它的reference summary，而我们是一个文本文件放不止一对的平行语料。代码只需稍作改动，注意不要把所有的中1语料放入一个article，中2放入一个abstract就行了，我们要做的就是分开一个一个存放。下面贴出我的代码:

# -*-coding:utf-8-*-
import os
import struct
import collections
from tensorflow.core.example import example_pb2


# We use these to separate the summary sentences in the .bin datafiles
SENTENCE_START = '<s>'
SENTENCE_END = '</s>'

train_file = "./train/train.txt"
val_file = "./val/val.txt"
test_file = "./test/test.txt"
finished_files_dir = "./finished_files"

VOCAB_SIZE = 200000


def read_text_file(text_file):
  lines = []
  with open(text_file, "r") as f:
    for line in f:
      lines.append(line.strip())
  return lines


def write_to_bin(input_file,out_file, makevocab=False):
  if makevocab:
    vocab_counter = collections.Counter()

  with open(out_file, 'wb') as writer:
    # read the  input text file , make even line become article and odd line to be abstract（line number begin with 0）
    lines = read_text_file(input_file)
    for i, new_line in enumerate(lines):
      if i % 2 == 0:
        article = lines[i]
      if i % 2 != 0:
        abstract = "%s %s %s" % (SENTENCE_START, lines[i], SENTENCE_END)

        # Write to tf.Example
        tf_example = example_pb2.Example()
        tf_example.features.feature['article'].bytes_list.value.extend([article])
        tf_example.features.feature['abstract'].bytes_list.value.extend([abstract])
        tf_example_str = tf_example.SerializeToString()
        str_len = len(tf_example_str)
        writer.write(struct.pack('q', str_len))
        writer.write(struct.pack('%ds' % str_len, tf_example_str))

        # Write the vocab to file, if applicable
        if makevocab:
          art_tokens = article.split(' ')
          abs_tokens = abstract.split(' ')
          abs_tokens = [t for t in abs_tokens if t not in [SENTENCE_START, SENTENCE_END]] # remove these tags from vocab
          tokens = art_tokens + abs_tokens
          tokens = [t.strip() for t in tokens] # strip
          tokens = [t for t in tokens if t!=""] # remove empty
          vocab_counter.update(tokens)

  print "Finished writing file %s\n" % out_file

  # write vocab to file
  if makevocab:
    print "Writing vocab file..."
    with open(os.path.join(finished_files_dir, "vocab"), 'w') as writer:
      for word, count in vocab_counter.most_common(VOCAB_SIZE):
        writer.write(word + ' ' + str(count) + '\n')
    print "Finished writing vocab file"


if __name__ == '__main__':

  if not os.path.exists(finished_files_dir): os.makedirs(finished_files_dir)

  # Read the text file, do a little postprocessing then write to bin files
  write_to_bin(test_file, os.path.join(finished_files_dir, "test.bin"))
  write_to_bin(val_file, os.path.join(finished_files_dir, "val.bin"))
  write_to_bin(train_file, os.path.join(finished_files_dir, "train.bin"), makevocab=True)

下面是我的工程目录：

只要将文件按照格式放在对应的路径下，产生的.bin会放在finished_files文件夹下，然后就可以拿来训练pointer-generator了。最后pointer-generator产生的结果就是我们想要的中文复述了。

好了，文章就到这里了。整篇文章其实没什么东西，只是向大家推荐了pointer-generator，和把文本摘要工具应用在中文复述的一点想法。小弟是自然语言处理的新手，还有很多东西需要向大家学习，欢迎找我交流。

为文本摘要网络Pointer-Generator Networks制作中文复述训练数据

猜你喜欢