Automatic summary generation of Chinese news text based on BERT-PGN model - text summary generation (paper reading)

Automatic summary generation of Chinese news text based on BERT-PGN model (2020.07.08)

Summary:

  • In order to solve the problem that the generative summary model in the automatic text summarization task does not fully understand the context of the sentence and the generated content is repeated, a generative summary model for Chinese news text - BERT is proposed based on BERT and pointer generation network (PGN). -Pointer Generation Network (BERTPGN). First, use the BERT pre-trained language model combined with multi-dimensional semantic features to obtain word vectors to obtain a more fine-grained text context representation; then, use the PGN model to extract words from the vocabulary or original text to form a summary; finally, a coverage mechanism is used to reduce Generate duplicate content and obtain final summary results. Experimental results on the 2017 CCF International Natural Language Processing and Chinese Computing Conference (NLPCC2017) single-document Chinese news summary evaluation data set show that it is comparable to models such as PGN and long short-term memory neural network with attention mechanism (LSTM-attention). Compared with the BERT-PGN model that combines multi-dimensional semantic features, the original text of the summary is more fully understood, the generated summary content is richer, and the generation of repeated and redundant content is comprehensively and effectively reduced. The Rouge-2 and Rouge-4 indicators have been improved respectively. 1. 5% and 1. 2%.

0 Preface

  • With the rapid development of the Internet industry in recent years, a large number of news websites and mobile news software have appeared in daily life. More and more users quickly obtain the latest information through news websites and mobile software. According to the 42nd Development Statistics Report of China Internet Network Information Center (CNNIC), by June 2018, the number of mobile phone users in China reached 788 million, and the proportion of netizens accessing the Internet is also increasing. Reached 98. 3% via mobile phone. As the number of netizens increases, the usage rate of news media online platforms continues to increase, and the frequency with which netizens use news media such as Toutiao continues to increase. In order to adapt to today's fast-paced life, netizens need to read the minimum number of news words and obtain the key content of news articles. Netizens can use automatic text summarization technology to summarize the main content of news, save reading time and improve information usage efficiency. Therefore, the news-oriented automatic text summarization model proposed in this article is of great significance.
  • Domestic and foreign scholars have done a lot of research on automatic text summarization. Automatic text summarization is a computer-based text summarization technology that emerged in the 1950s to help people liberate themselves from the ocean of information and improve the efficiency of information use [2]. Since the National Institute of Standards and Technology held a document understanding conference in 2001, research on automatic text summarization has received more and more attention [3].
  • Inspired by the literature [4], this paper proposes a method for Chinese news text based on BERT (Bidirectional Encoder Representations from Transformers) and Pointer Generator Network (PGN) to solve the problem that netizens spend a lot of time reading and understanding news. The automatic summary model - BERT-Pointer Generation Network (Bidirectional Encoder Representations from Transformers-Pointer Generator Network, BERT-PGN), can effectively save time and improve the efficiency of information use. The model first uses the BERT pre-trained language model to obtain word vectors of news texts, and combines multi-dimensional semantic features to score the sentences where the words in the news are located. The results are input into the pointer generation network as an input sequence for training, and the results of the news summary are obtained. . The main contributions of this article are as follows.
  • 1) This article proposes a model for automatic summarization of news texts - BERT-PGN, which is implemented in two stages: the word vector acquisition stage based on the pre-training model and multi-dimensional semantic features, and the sentence generation stage based on the pointer generation network model. stage.
  • 2) Experimental results show that the model achieved good results on the 2017 CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC2017) single-document Chinese news summary evaluation data set As a result, the Rouge-2 and Rouge-4 indicators increased by 1. 5% and 1. 2% respectively.

Related research

-There are two mainstream methods of automatic text summarization, namely extractive summarization and generative summarization [5]. In the research on semantic mining of text, many classic classification and clustering algorithms have been proposed [6]. The earliest summarization work mainly used statistical-based techniques based on word frequency and sentence position [7]. In 1958, Luhn [8] proposed the first automatic text summarization system. In the past decade or so, with the rapid development of machine learning (ML) and natural language processing (NLP), many accurate and efficient text summary algorithms have been proposed [9]. The rapid growth of the Internet as a commercial medium has resulted in users absorbing too much information. To solve this information overload, automatic text summarization plays a key role. Automatic text summarization can not only block a large amount of interfering text, but also allow users to obtain key information more quickly and adapt to the current fast-paced life [10]
.

  • The extractive summarization method is to divide an article into small units, and then extract some of them as summaries of the article. Liu et al. [11] proposed an adversarial process for extractive text summarization, using the Generative Adversarial Network (GAN) model to obtain a competitive Rouge score. This method can generate more abstract, readable and diverse text summary; AlSabahi et al. [12] use the Hierarchical Structured Self-Attentive Model (HSSAM) to reflect the hierarchical structure of the document, thereby obtaining better feature representation and solving the problem of excessive memory usage of the model. Problems such as the inability to adequately model; Slamet et al. [13] proposed a Vector Space Model (VSM), using VSM to conduct word similarity tests, evaluate the results of automatic text summarization, and compare the effects of text summarization. ; Alguliyev et al. [14] found that compared with traditional text automatic summarization methods, research on automatic text summarization based on clustering, optimization and evolutionary algorithms has recently shown good results. However, extractive summarization does not consider the chapter structure information of the text, lacks understanding of keywords and words in the text, and the generated summary has poor readability and continuity.
  • The generative summarization method is a summary method that uses more advanced natural language processing algorithms to paraphrase, replace, etc. sentences in the article to generate an article summary without using any existing sentences or phrases. With the rapid development of deep learning in recent years, more and more deep learning methods are used in text summarization.
  • Cho et al. [15] and Sutskever et al. [16] first proposed the seq2seq (sequence-to-sequence) model consisting of an encoder and a decoder; Tan et al. [17] proposed a graph-based attention mechanism neural model. In text have achieved good results in the task of automatic summarization; Siddiqui et al. [18] improved on the sequence-to-sequence model proposed by the Google Brain team, using a local attention mechanism instead of a global attention mechanism to solve the problem of generating duplicates. have achieved good results; Celikyilmaz et al. [19] proposed a deep communication agent algorithm based on the encoder-decoder architecture to generate abstracts of long documents; Khan et al. [20] proposed a semantic role-based Marking framework, using deep learning methods to achieve multi-document summarization tasks from the perspective of semantic role understanding; Jiang Yuehua et al. [21] proposed a generative summary algorithm based on seq2seq structure and attention mechanism and integrating lexical features, which can During the summary generation process, vocabulary features are used to identify more key vocabulary content to further improve the quality of summary generation.
  • At present, most automatic text summarization methods mainly use machine learning or deep learning models to automatically extract features, and use the models to select and compress summary sentences. However, the automatically extracted features and summary text may be insufficient and inconsistent, and cannot describe the summary text well. The BERT-PGN model proposed in this article is based on the BERT pre-trained language model and multi-dimensional semantic features. For Chinese news texts, it extracts features from more dimensions and deeply characterizes the summary text, so as to obtain summary content that is closer to the topic.

2 BERT-PGN model

  • The BERT-PGN model proposed in this article is mainly implemented in two stages, namely the word vector acquisition stage based on the pre-training model and multi-dimensional semantics and the sentence generation stage based on the pointer generation network model, as shown in Figure 1. In the first stage of the model, the pre-trained language model BERT is used to obtain word vectors of news articles, and at the same time, multi-dimensional semantic features are used to score sentences in the news, and the two are simply spliced ​​to generate an input sequence; in the second stage, the obtained input sequence is input In the pointer generation network model, the coverage mechanism is used to reduce the generation of repeated text while retaining the ability to generate new text to obtain news summaries.

Insert image description here

2. 1 Word vector acquisition stage based on pre-training model and multi-dimensional semantic features

2. 1. 1 BERT pre-trained language model
  • Language model is an important concept in the field of natural language processing. After using language model to describe objective facts, we can obtain a language representation that can be processed by computers. The language model is used to calculate the probability p(a1, a2, ..., an) of any language sequence a1, a2, ..., an, that is:
    Insert image description here
  • The word vector obtained through the traditional neural network language model is single and fixed, and there are problems such as being unable to represent the ambiguity of words. The pre-trained language model solves this problem well and can represent words based on their contextual content. BERT uses a bidirectional Transformer as the encoder for feature extraction, which can obtain more contextual information and greatly improve the ability of the language model to extract features. The Transformer coding unit consists of two parts: self-attention mechanism and feed-forward neural network. The input part of the self-attention mechanism is composed of three different vectors from the same word, respectively Query vector (Q), Key vector (K) and Value vector (V). The similarity between the input word vectors is expressed by multiplying the Query vector and the Key vector, which is recorded as [QK]T, and is scaled by dk to ensure that the result is of moderate size. Finally, softmax is used for normalization operation to obtain the probability distribution, and then the weight summation representation of all word vectors in the sentence is obtained. The word vector obtained in this way combines contextual information and is more accurate. The calculation method is as follows:
    Insert image description here
  • The BERT pre-training model uses the "MultiHead" mode, which uses multiple attention mechanisms to obtain contextual semantic information of sentences, which is called a multi-head attention mechanism. The BERT pre-trained language model can enable word vectors to obtain more contextual information and better represent the original content.
2. 1. 2 Multi-dimensional semantic features
  • In view of the characteristics that the key content of Chinese news is concentrated at the beginning of the news and the keywords appear frequently, this paper introduces traditional features and topic features to describe the sentences in the Chinese news text in a fine-grained manner, and improves the contextual semantic expression performance of the sentences in the text.
  • 1) Traditional characteristics.
  • The traditional features selected in this article are mainly two features at the sentence level: word frequency in the sentence and position in the article. The word frequency feature is a statistical feature that reflects the most important information in news articles. It is also the simplest and most direct statistical feature. The word frequency of words appearing in news articles can be calculated using equation (3):
    Insert image description here
  • Among them, wordj represents the number of times the j-th word appears in the article. In this article, sentences in the article are selected as the final basic unit of scoring. A sentence is a collection of words. If the words contained in a sentence include high-frequency words that appear frequently in news articles, the sentence is considered to be more important in the article. The word frequency feature scoring formula for the i-th sentence in the news article is as follows:
    Insert image description here
  • Among them: TFi represents the sum of word frequencies of words contained in the i-th sentence, and seni represents all words contained in the i-th sentence. Location features are also a statistical feature that reflects important information in news articles. A news article is composed of multiple sentences. The positions of the sentences are different and their importance is also different. For example, the first sentence in the article is mostly the most important sentence in the news article. The position feature scoring formula of the i-th sentence in the news article is as follows:
    Insert image description here
  • Among them: Posi represents the position score of the i-th sentence, pi represents the position of the i-th sentence in the news article, and n represents the total number of sentences in the article.
  • 2) Theme characteristics.
  • The theme characteristics selected in this article can also be expressed as title characteristics. The title in a news article has a high reference value and can largely represent the topic of the article. Therefore, if a sentence in the article has a high degree of similarity with the title of the news article, then this sentence is more likely to be selected as a sentence in the article abstract. This article uses cosine similarity to calculate the topic feature score of the i-th sentence in the news article. The scoring formula is as follows:

Insert image description here

  • Among them: Simi represents the similarity between the i-th sentence and the title of the news article, s and t represent the vectorized representation of the title and the sentence in the news article respectively.

2. 2 Sentence generation stage based on pointer generation network model

  • The pointer generation network model combines the pointer network (PN) and the sequence-to-sequence model based on the attention mechanism, allowing the generated words to be pointed directly through the pointer, or the words can be generated from a fixed vocabulary. The words wi in the text are sequentially passed into the BERT-multidimensional semantic feature encoder and the Bidirectional Long Short-Term Memory (BiLSTM) encoder to generate the hidden layer state sequence hi. At time t, the Long Short-Term Memory (LSTM) neural network decoder receives the word vector generated at the previous time and obtains the decoding state sequence st.
  • The attention distribution at is used to determine the characters in the input sequence that need attention when outputting sequence characters at time t. Calculated as follows:
    Insert image description here
  • Among them, v, Wh, Ws, and battn are parameters obtained through training. The attention distribution is used to weight the average of the hidden layer state of the encoder to generate the context vector ht*.

Insert image description here

  • The context vector ht * is concatenated with the decoding state sequence st, and through two linear mappings, the distribution Pvocab of the current prediction on the dictionary is generated. The calculation formula is as follows:
    Insert image description here
  • Among them, V’, V, b, b’ are parameters obtained through training.
  • The model uses the generation probability Pgen to determine whether to copy a word or generate a word. The calculation formula is as follows:
    Insert image description here
  • Among them, wh, ws, wx, and bptr are parameters obtained through training, σ is the sigmoid function, and xt is the decoded input sequence. Using at
    as the model output, we get the probability distribution of generated word w:
    Insert image description here
  • In order to solve the problem of repeated words, this article introduces the coverage mechanism. Improving the pointer generation network model through the coverage mechanism can effectively reduce duplication in the generated summary. The coverage vector ct is introduced to track the words that have been generated and impose a certain penalty on the words that have been generated to minimize duplication of generation. The coverage vector ct is calculated as follows:
    Insert image description here
  • In layman's terms, ct represents the degree of coverage a word has received from the attention mechanism so far. Use the coverage vector ct to affect the attention distribution and re-obtain the attention distribution at. The calculation formula is as follows:
    Insert image description here
  • where Wc is the parameter obtained through training.

3 Experiments and Analysis

3.1 Experimental data

  • The data used in the experimental part of this article is provided by the 2017 CCF International Natural Language Processing and Chinese Computing Conference (NLPCC2017) and comes from the NLPCC2017 Chinese single-document news summary evaluation data set, which contains 49,500 news texts in the training set and news texts in the test set 500 articles. The length of the summary generated in this task is required to be no more than 60 characters.

3.2 Evaluation indicators

  • Rouge is one of the common indicators for summary evaluation technology in the field of automatic text summarization. It evaluates the quality of the summary generated by the model through the basic unit of overlap between the summary generated by the statistical model and the manual summary. This article refers to the NLPCC2017 Chinese single-document news summary evaluation task, and uses Rouge-2, Rouge-4 and Rouge-SU4 as evaluation indicators to evaluate the summary results.

3.3 Comparative experiment

  • The experimental part of this article selects 8 basic models: the model proposed by the team (ccnuSYS, LEAD, NLP@WUST, NLP_ONE) with better results in the NLPCC2017 single-document news summary evaluation task
    [ 22], PGN (without coverage mechanism) [23], PGN [23], topic keyword information fusion model [24] and BERT-PGN (without semantic features). The validity of the manually extracted topic features and traditional features is verified to verify the effectiveness of the method proposed in this article.

  • 1) ccnuSYS[22]: Use the LSTM encoder-decoder structural model based on the attention mechanism to generate summaries.

  • 2) LEAD[22]: Select the first 60 words from the original text as the text summary.

  • 3) NLP@WUST[22]: Use feature engineering methods to extract sentences, and use sentence compression algorithms to compress the extracted sentences.

  • 4) NLP_ONE [22]: The algorithm that ranked first in the NLPCC2017 single-document news summary evaluation task, including the attention mechanism of input and output sequences.

  • 5) PGN (without coverage mechanism) [23]: A generative model proposed in ACL2017, which uses a pointer network and a sequence-to-sequence model based on the attention mechanism to generate summaries without using the coverage mechanism.

  • 6) PGN (coverage mechanism) [23]: An improved pointer generation network model that uses the coverage mechanism to solve the problem of generating repeated words and unregistered words.

  • 7) Topic keyword fusion model[24]: A multi-attention mechanism model that combines topic keyword information.

  • 8) BERT-PGN (without semantic features): This article proposes a model based on BERT and pointer generation network, which uses the coverage mechanism to reduce the generation of duplicate content.

  • 9) BERT-PGN (semantic features): A model optimized on the BERT-PGN (without semantic features) model, which combines multi-dimensional semantic features to obtain fine-grained text context representation.

3.4 Experimental environment and parameter settings

  • The experiment in this article uses a single GTX-1080Ti (GPU) for training. In this experiment, the BERT-base pre-trained model is used to obtain text word vectors. The BERT-base model has a total of 12 layers and a hidden layer of 768 dimensions. Set the maximum sequence length to 128, train_batch_size to 16, and learning_rate to 5E-5. The pointer generation network model sets batch_size to 8, the hidden layer has 256 dimensions, and the dictionary size is set to 50k. The training process involves a total of 700k iterations, and the total training time is approximately 7 d5 h (173 h in total).

3.5 Experimental results and analysis

3.5.1 Overall summary results comparison experiment
  • This article re-runs part of the baseline model and compares the obtained results with the model results proposed in this article. The experimental results are shown in Table 1.
    Insert image description here
  • As can be seen from Table 1, the performance of the model proposed in this article has been significantly improved compared to models such as PGN and NLP_ONE. It has obvious advantages in the evaluation indicators of Rouge-2, Rouge-4 and Rouge-SU4. The Rouge indicator An increase of 1. 2 to 1. 5 percentage points. Comparing the BERT-PGN (semantic features) model with the PGN and BERT-PGN (without semantic features) models, it can be seen that using the BERT pre-trained model combined with effective multi-dimensional artificial features can significantly improve the model effect. The sentence context representation obtained by using the BERT pre-training model combined with manually extracted features can provide a deeper and more accurate semantic understanding of sentences in the text, which can effectively improve performance in automatic text summarization tasks.
  • According to the contents of the summaries generated by different models in Table 2, it can be found that compared with other models, the summary content generated by the BERT-PGN model proposed in this article in the automatic summarization task of Chinese news texts is richer, more comprehensive, and closer to the standard summary, indicating that This model has a more complete understanding of the full text, can fully understand the meaning of sentences and words based on the context of the sentences in the text, and can describe the sentences and words in the text in more detail.
    Insert image description here
3.5.2 Multi-dimensional semantic feature comparison experiment
  • For the multi-dimensional feature selection part, this paper selects traditional features, word frequency features, position features and title features from the traditional features and topic features, which are represented as TF, Pos and Main respectively, in view of the characteristics of news texts that "the main content is concentrated in the beginning". As can be seen from Table 3, the same model combined with manually extracted word frequency features and positional features has the best effect, with the Rouge-2 index increasing by up to 1.2 percentage points, and the Rouge-4 index increasing by up to 1.0 percentage points.

Insert image description here

  • The theme feature Main selected in this article can improve the Rouge index of the model to a certain extent. From the comparison of the feature combination results of Pos and Pos+Main, TF and TF+Main, it can be seen that there is a significant improvement when the topic feature is combined with the word frequency feature, but there is basically no improvement when combined with the position feature. When the sentence is positioned higher in the news, it is more similar to the title, indicating that the two artificial features play a similar role in measuring the importance of the sentence in the news.
  • By comparing the results of the two feature combinations TF+Main and TF+Pos, it can be seen that word frequency information combined with position information has a better effect than combining topic information and can fully express the importance of sentences in news articles. Therefore, this paper chooses to use a combination of word frequency features and position features as multi-dimensional features. Keywords that appear multiple times in news articles are a statistical feature that reflects the most important information in news articles. The significance of word frequency statistics is to find out the key points expressed in the article; in addition, the position where a sentence appears is also the key to reflecting the importance of the sentence. , the earlier it appears, the greater the role the sentence plays in the article. Therefore, word frequency and position features are the key to improving the automatic summary model.
3. 5. 3 Experimental analysis of coverage mechanism
  • The model used in this article uses the coverage mechanism to try to solve the problem of generating duplicate content. By calculating the proportions of 1-gram, 2-gram, 3-gram and 4-gram in the generated summary, we quantitatively analyze the effect of introducing the coverage mechanism to solve the problem of duplicate generated content. As can be seen from Table 4, the BERT-PGN model proposed in this article can effectively reduce the duplication of generated content compared to NLP_ONE, and has obvious effects in solving duplication. In the quantitative analysis of 3-gram and 4-gram summary results, it is close to Effects of standard summaries.
  • As can be seen from Table 4, the BERT-PGN model proposed in this article can effectively reduce the duplication of generated content compared to NLP_ONE, and has obvious effects in solving duplication. In the quantitative analysis of 3-gram and 4-gram summary results, it is close to Effects of standard summaries.
    Insert image description here

4 Conclusion

  • This paper proposes a BERT-PGN model for Chinese news text. It combines the BERT preprocessing model and multi-dimensional semantic features to obtain word vectors, and uses the pointer generation network model combined with the coverage mechanism to reduce the generation of duplicate content. Experiments show that in the Chinese news summary task, the BERT-PGN model generates summary results that are closer to standard summaries, contain more key information of the original text, and can effectively solve the problem of duplicate generated content. The next step will be to try to mine more elements, such as: effective artificial features for news texts, etc., to improve summary results; simplify the model and shorten model training time; improve the completeness and fluency of generated summary content; build external data in the news field, Help the model fully understand the meaning of the sentence based on the context of the sentence.

I just saw this paper at work or in business, and I wanted to record my learning by the way. If you need the original paper, you can comment on the email and send it directly to the email.

Guess you like

Origin blog.csdn.net/qq_38978225/article/details/129361343