PromptRank: Using Prompt for unsupervised keyword extraction

Insert image description here

  Paper title: PromptRank: Unsupervised Keyphrase Extraction Using Prompt
  Paper date: 2023/05/15(ACL 2023)
  Paper address: https://arxiv.org/abs/2305.04490
  GitHub Address: https://github.com/HLT-NLP/PromptRank

Abstract

  The keyword extraction (keyphrase extraction, KPE)task refers to automatically selecting phrases from a given document to summarize its core content. embeddingState-of-the-art performance has recently been achieved by (SOTA)algorithms based on ranking candidate texts based on how similar their embeddings are to document embeddings. However, such solutions either struggle to account for differences in document and candidate lengths, or fail to fully exploit pre-trained language models without further fine-tuning (PLM). In order to solve these problems, this paper proposes a simple and effective unsupervised method PromptRankbased on encoder-decoderarchitecture PLM. Specifically, PromptRankthe document is fed into the encoder and (prompt)the probability of the decoder generating candidate words is calculated based on the designed cue words. The proposed is PromptRankextensively evaluated on six widely used benchmarks. Compared with that SOTAof the method MDERank, PromptRankin the returned results of top5, top10and , the scores are improved by , and , respectively , which shows the great potential of using , for unsupervised keyword extraction.top15F134.18%24.87%17.57%prompt

1. Introduction

  Keyword extraction aims to automatically select phrases from a given document to succinctly summarize the topic, help readers quickly understand key information, and facilitate subsequent information retrieval, text mining, summarization and other tasks. Existing keyword extraction work can be divided into two categories: supervised and unsupervised. With the development of deep learning, supervised keyword extraction methods have achieved great success by using advanced architectures, such as LSTMand Transformer. However, supervised methods require large-scale labeled training data and may generalize poorly to new domains. Unsupervised keyword extraction methods, which mainly include statistics-based (statistics-based), graph-based (graph-based)and embedding- (embedding-based)based methods, are more popular in industrial scenarios.
  Embedding-based methods have recently achieved SOTAperformance and can be further divided into two types. The first method, such as EmbedRankand SIFRank, embeds the document and candidate keywords into a latent space, calculates the similarity between the document and candidate keyword embeddings, and then selects the most similar previous kkeyword. Due to the length difference between the document and the document's candidate keywords, the performance of these methods is suboptimal, especially for long documents. In order to alleviate this problem, a second method is proposed: by utilizing a pre-trained language model (PLM), MDERankthe embedding of candidate keywords is replaced by the embedding of masked documents, where the candidate keywords are masked from the original documents. When the length of the masked document and the original document are similar, the distance between them is measured. The greater the distance, the higher the importance of the masked candidate document as a keyword. Although MDERankit solves the problem of length differences, it faces another challenge: PLMit is not optimized specifically for measuring this distance, so contrast fine-tuning is required (contrastive fine-tuning)to further improve performance. This puts an additional burden on training and deploying keyword extraction systems. PLMFurthermore, this hinders the rapid application of large language models when more powerful ones emerge.
  Inspired by CLIPthe work, in this paper, the authors propose to promptextend the length of candidate keywords by putting them into a carefully designed template (i.e., ). Then to compare documents and corresponding hints, encoder-decoderan architecture is adopted to map the input (i.e. original document) and output (i.e. hints) to a shared latent space. encoder-decoderThe architecture has been widely adopted and achieved great success in many fields, including machine translation, image description, etc. by aligning the input and output spaces. promptThe unsupervised keyword extraction method based encoder-decoderon PLM(for example T5) to measure similarity without any fine-tuning. After selecting the candidate keywords, a given document is fed into the encoder and promptthe decoder generates the probability of the candidate keywords based on the designed calculation. The higher the probability, the more important the candidate word is.
  PromptRankIt is the first promptsystem used for unsupervised keyword extraction. It only requires the document itself, no more information. Sufficient experiments demonstrate PromptRankits effectiveness on short and long texts.
  The main contributions of this paper are summarized as follows: A simple and effective unsupervised keyword extraction method
  (1)is proposed , using an architecture to rank candidate words. This method is the first to be used for unsupervised keyword extraction. Factors affecting ranking performance are further studied, including candidate position information , hint length , and evaluated on six widely used benchmarks. Experimental results show that it outperforms current methods to a large extent , demonstrating the great potential of using it for unsupervised keyword extraction.PromptRankencoder-decoderPLMprompt
  (2)(candidate position information)(prompt length)提示内容(prompt content)
  (3) PromptRankPromptRankSOTAMDERankprompt

2. Related Work

2.1 Unsupervised Keyphrase Extraction

  Mainstream unsupervised keyword extraction methods are divided into three categories: statistical-based methods, graph-based methods and embedding-based methods. Statistics-based methods rank candidate words by comprehensively considering their statistical characteristics such as frequency, position, case, and other captured contextual information features. The graph-based method was first TextRankproposed by . This method uses candidate words as vertices, constructs edges based on the co-occurrence relationships of the candidate words, and PageRankdetermines the weight of the vertices. The subsequent work, such as SingleRank, TopicRank, , PositionRankand MultipartiteRanketc., are all TextRankimprovements.
  In recent years, embedding-based methods have achieved good performance. EmbedRankCandidates are ranked based on embedded similarities between documents and candidates. SIFRankFollowing the idea of ​​embedding, the sentence embedding model is combined SIFwith the pre-trained language model ELMoto obtain better embedding representation. However, these algorithms perform poorly when processing long texts due to the mismatch in length of the document and candidate text. MDERankThis problem was solved by replacing the embedding of candidate documents with that of mask documents, but was not fully exploited without fine-tuning PLM. To solve these problems, this paper proposes an unsupervised keyword extraction method based on cue learning PromptRank. In addition to statistical-based, graph-based, and embedding-based methods, AttentionRankpre-trained language models are used to calculate self-attention and cross-attention to determine the importance and semantic relevance of candidate words in the document.

2.2 Prompt Learning

  In NLPthe field, cued learning (prompt learning)is considered a new paradigm that can replace fine-tuning pre-trained language models on downstream tasks. Compared with fine-tuning, the natural language form promptis more suitable for the pre-training tasks of the model. Hint learning has been widely used in many NLPtasks, such as text classification, relation extraction, named entity recognition, text generation, etc. This paper is the first to use hint learning for unsupervised keyword extraction, leveraging the capabilities of encoder-decoderarchitectures such as and . The author's work is also inspired by using hints to increase the length of candidate words to alleviate the length mismatch problem.PLMBARTT5CLIP

3. PromptRank

Insert image description here

  POS, that is part-of-speech, part-of-speech tagging.

  PromptRankThe core architecture is shown in the figure above. PromptRankConsists of the following four main steps:
  (1)Given a document ddd , generate a candidate setC = { c 1 , c 2 , . . . , cn } C = \{c_1, c_2, ..., c_n\}C={ c1,c2,...,cn} ;
  (2)After inputting the document into the encoder, for each candidate wordc ∈ C c \in CcC , use the designed prompt information to calculate the probability of the decoder generating candidate words, recorded aspc p_cpc;
  (3)Use location information to calculate ccThe position penalty of c is recorded asrc r_crc; Calculate the final score sc s_c
  (4) based on probability and position penaltysc, then press sc s_cscSort candidate words in descending order.

3.1 Candidates Generation

  Following the usual practice, the author uses regular expressions to extract noun phrases as key phrase candidates from the content after word segmentation and part-of-speech tagging. The regular expression is<NN. *|JJ> * <NN.*>

3.2 Probability Calculation

  To address the limitations of embedding-based methods, the authors adopt an encoder-decoderarchitecture that transforms the original document and templates populated with candidate words into a shared latent space. The similarity between the document and the template is determined by the probability of the decoder generating a filled template. The higher the probability, the closer the filled template is aligned with the document, and the more important the candidate word is considered. In order to simplify the calculation, we choose to place the candidate words at the end of the template, so that we only need to calculate the probability of the candidate words to determine their ranking.
  Specifically, the encoder template is filled with the original document, the decoder template is filled with a candidate document, and then PLMthe sequence probability p ( yi ∣ y < i ) p(y_i | y<i) of the candidate decoder template is obtained .p ( andiy<i ) . Length normalized log-likelihood is(length-normalized log-likelihood)widely used because of its superior performance. Therefore, the probability of a candidate word can be calculated according to the following formula: pc = 1 ( lc ) α ∑ i = jj + lc − 1 log p ( yi ∣ y < i ) p_c = \frac {1} {(l_c)^{\alpha}} \sum_{i=j}^{j+l_c-1} log\ p(y_i | y<i)pc=(lc)a1i=jj + lc1log p(yiy<i )   where,jjj is the candidate wordccStarting index of c , lc l_clcis the length of the candidate word, α \alphaα isPromptRanka hyperparameter that adjusts the preference for candidate word length. Use negativepc p_cpcEvaluate the importance of candidate words in descending order.

3.3 Position Penalty Calculation

  When we write an article, we usually start with the main points of the article. Research shows that the position of candidate words in documents can be used as an effective statistical feature for keyword extraction.

  When writing an article, it is common practice to begin with the main points of the article.It doesn’t feel very reliable, right?

  In this paper, the authors use positional penalties to modulate the log probability of candidate words through multiplication. The log probability is negative, so for unimportant positions, the corresponding position penalty is larger. This will result in candidate words in unimportant positions having lower overall scores, thereby reducing their likelihood of being selected as keywords. Specifically, for the candidate word ccc ,PromptRankcalculate its position penalty as follows: rc = poslen + β r_c = \frac {pos} {len} + \betarc=lenpos+β   wherepos posp os is the candidate wordccThe position where c first appears,len lenl e n is the length of the document,β \betaβ is a positive parameter used to adjust the influence of position information. β \betaThe larger the value of β , the smaller the role of position information in calculating the position penalty, that is, whenβ \betaWhen β is large, the two positions penalize the positionrc r_crcThe difference in contribution will be reduced. Therefore, different β \beta can be usedβ value to control the sensitivity of candidate positions.
  The authors also observed that the effectiveness of location information is related to document length. The longer the article, the more effective the location information is. Therefore, for longer documents,β \betaβ is assigned a smaller value. As a rule of thumb, the author dividesβ \betaβ表述为: β = γ len 3 \beta = \frac {\gamma} {len^3}b=len3c  where {\gamma} is a hyperparameter that needs to be determined experimentally.

3.4 Candidates Ranking

  Penalty points rc r_c for gaining positionrcAfter that, PromptRankthe final score is calculated as follows: sc = rc × pc s_c = r_c \times p_csc=rc×pcPosition penalty is used to adjust the logarithmic probability of candidate words and reduce the possibility of candidate words far from the beginning of the article being selected as keywords. The author sorts the candidate words in descending order according to the final score, and finally selects the previous kcandidate word as the keyword.

4. Experiments

4.1 Datasets and Evaluation Metrics

  For a comprehensive and accurate evaluation, the authors evaluated on six widely used datasets PromptRank, consistent with current SOTAmethods MDERank. These datasets are Inspec, SemEval-2010, SemEval-2017, DUC2001, NUSand Krapivin. The statistics of the data set are shown in the following table:

Insert image description here

  Following previous work, candidate words F1for top5, top10and top15were evaluated. During calculation F1, duplicate candidates are removed and stemming is applied.

4.2 Baselines and Implementation Details

  The authors chose the MDERanksame baseline as. These baselines include graph-based methods such as TextRank, SingleRank, TopicRankand MultipartiteRankstatistics-based methods such as YAKE, as well as embedding-based methods such as EmbeddRank, SIFRankand MDERankitself, for which MDERankthe baseline results of , are directly used. For fair comparison, consistency PromptRankwith MDERankpre- and post-processing was ensured. The authors also use T5-basea (220 million parameters) model, which has a similar parameter scale to that MDERankused in . BERT-baseAdditionally, to match BERTthe settings, the maximum length of the encoder input is set to 512.
  PromptRankIt is an unsupervised algorithm that only needs to set two hyperparameters: α \alphaα γ \gamma γ . PromptRankDesigned to generalize out of the box rather than fitting a single data set. Therefore, the authors evaluated on six datasets using the same hyperparametersPromptRank. Hereα \alphaα is set to0.6 0.60.6 γ \gamma γ is set to1.2 × 1 0 8 1.2 \times 10^81.2×108

4.3 Overall Results

Insert image description here

  The table above shows the results of , and scores PromptRankof the baseline model on six data sets . The results show that the best performance is achieved on almost all evaluation indicators on all data sets, proving the effectiveness of the proposed method. Specifically, , outperforms the method , achieving average relative improvements of , and , on , and , respectively . It is worth noting that compared with embedding and embedding , the performance is mainly improved on two very long datasets ( , ), while the proposed method achieves the best performance on almost all datasets. This highlights the generalization ability of the proposed method and can work well on different datasets with different document lengths.   As the text length increases, the length difference between the text and the candidate text becomes more and more severe. To further investigate the ability to solve this problem, the authors compared it with the average performance on six datasets , , and . As the document length increases, the number of candidate words increases rapidly and the keyword extraction performance decreases.F1@5F1@10F1@15PromptRank6PromptRankSOTAMDERankF1@5F1@10F1@1534.18%24.87%17.57%SIFRankMDERankKrapivinNUSPromptRank
PromptRankEmbeddRankMDERankF1@5F1@10F1@15

Insert image description here
  As shown in the figure above, EmbedRankthe effect of the length difference is particularly large, and the performance drops quickly. MDERankand PromptRankboth mitigated this decline. However, MDERankthe masked document embedding used in does not work as expected. This is because BERTit is not trained to ensure that the more important phrases are blocked, the more drastic the embedding changes. BERTJust being trained to recover masked tokens. By leveraging PLMthe encoder-decoderstructure and using prompt, PromptRankit not only MDERanksolves the performance degradation problem of embedding on long texts more effectively than , but also outperforms both on short texts.

4.4 Ablation Study

4.4.1 Effects of Position Penalty

  To evaluate PromptRankthe contribution of positional penalties to overall performance, the authors conducted experiments in which candidate words were ranked based only on their cue probabilities. The results are shown in the table below:

Insert image description here
PromptRankThe performance   without position penalty is significantly better MDERank. When position penalties are taken into account, the performance further improves, especially on long text datasets. This suggests that cue-based probabilities are PromptRankcentral, while location information can provide additional benefits.

4.4.2 Effects of Template Length

  PromptRankThe length differences in embeddings are accounted for by filling candidate words into templates. In order to study how long the template can avoid embedding defects, the author conducted experiments using templates of different lengths, namely 0, 2, 5, 10and 20. Except for the group of length 0, each length contains 4a manual template (see appendix for details A.2), and the position information is not used. To exclude the influence of template content, for each template, the ratio of the performance of each dataset to the Inspec(short text)performance of the dataset is calculated to measure the degradation caused by increasing text length.

Insert image description here

  As shown in the figure above, the higher the polyline, the smaller the degradation. Templates of lengths 01 and 2 2are severely degraded and face the same problems as embedding, making them promptdifficult to exploit. Templates with a length greater than or equal 5to better solve the length difference problem and provide guidance for template selection.

4.4.3 Effects of Template Content

Insert image description here

  The content of the template directly affects the performance of keyword extraction. Some typical templates and their results are shown in the table above (no location information is used). 1Empty templates give the worst results. The templates 2-5have the same length, 5and the performance is better than the template 1. The template 4achieves the best performance on all metrics. Therefore, this paper concludes that well-designed cues are beneficial. Note that all templates are designed manually, leaving the automation of template construction as future work.

4.4.4 Effects of Hyperparameter α

  PromptRankThe tendency towards candidate word length is given by α \alphaα control,α \alphaThe higher α is, the more likely it is that longer candidate words will be selected. In order to explore differentα \alphaThe influence of α value, the author conducted experiments without using position information, changingα \alphaα0.2 is adjustedfrom1, with a step size of0.1. α \alphaThe optimal value of α on6each data set is shown in the following table:

Insert image description here
  L and L_{and}Landis the average word count of golden keywords. Intuitively speaking, L ak L_{ak} in the data setLandThe smaller, α \alphaThe optimal value of α is smaller. The results show that most data sets are consistent with this conjecture. Note that the highestL ak L_{ak}Landof SemEval2017α \alphaThe reason for α insensitivity is thatSemEval2017the distribution of golden keyphrases in the data set is relatively more balanced. In order to maintainα \alphaPromptRankthat performs well on each benchmarkα instead of pursuing the best average of all data setsF1. Therefore, it is recommended to changeα \alphaα is set to0.6.

4.4.5 Effects of Hyperparameter γ

  The influence of position information is determined by β \betaβ control,β \betaThe larger β is, the smaller the impact of location information on ranking. Previous work has shown that including positional information can improve performance for long texts while degrading performance for short texts. In order to solve this problem, the author uses the hyperparameterγ \gammaγ Dynamically adjustsβ \betaβ , aiming to minimize largeβ \betaThe effect of β on short texts while maximizing smallβ \betaBenefits of beta for long texts. Through experiments, determineγ \gammaThe optimal value of γ is 1.2 × 1 0 8 1.2 \times 10^81.2×108 . By calculating β \betaon six data setsThe average values ​​of β5 are shown in the table . As the table3shows,PromptRankthe performance on short texts remains the same, while the performance on long texts improves significantly.

4.4.6 Effects of the PLM

  PromptRankused T5-baseas the default PLM, but to explore PromptRankwhether the mechanism is restricted to a specific one PLM, the authors conducted experiments using models of different sizes and types, BARTe.g. The results are shown in the table , all models perform better than current methods 6even when hyperparameters and hints are optimized . This shows that it is not limited to specific and has strong generality for different structures . This method enables rapid application of new ones when more powerful ones become available .T5-baseSOTAMDERankPromptRankPLMencoder-decoderPLMPLMPLM

Insert image description here

4.5 Case Study

  To demonstrate PromptRankthe effectiveness, the authors randomly select a document from Inspecthe dataset and compare MDERankand compare PromptRankthe differences between the resulting scores. The authors normalize the raw scores and present them in the form of a heat map, where the warmer the color, the higher the score, and the more important the candidate word. Golden keywords are underlined in bold italics. MDERankThe comparison shows that it PromptRankis more accurate to give the golden keywords a higher rating and better distinguish between irrelevant candidates. Experimental results show better performance PromptRankthan SOTAthe method .MDERank

5. Conclusion

  This paper proposes an promptunsupervised keyword extraction method based PromptRankon encoder-decoderthe architecture PLM. Candidate words are ranked by calculating the probability of generating candidate words based on hints designed by the decoder. Extensive experiments on six widely used benchmarks demonstrate the effectiveness of the proposed method, significantly outperforming powerful baselines. Various factors affecting performance are thoroughly studied PromptRankand valuable insights are gained. This method does not require PLMany modification to the architecture and does not introduce any additional parameters, making it a simple and powerful keyword extraction method.


  Follow the WeChat public account: 夏小悠to obtain more articles, papers PPTand other information ^_^

Guess you like

Origin blog.csdn.net/qq_42730750/article/details/131654369