ICTIR 2016 Analysis of the Paragraph Vector Model for Information Retrieval

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/yangliuy/article/details/52970190
中文简介:本文是对前面的 SIGIR‘16工作的拓展, 主要是对PV model适用于IR的task时的三方面的问题进行了更加深入的分析,并且提出了针对这三个问题的相应改进。
论文出处:ICTIR' 16

英文摘要:Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document overfitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.  

下载链接:https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1242


中文简介:本文是对前面的 SIGIR‘16工作的拓展, 主要是对PV model适用于IR的task时的三方面的问题进行了更加深入的分析,并且提出了针对这三个问题的相应改进。
论文出处:ICTIR' 16

英文摘要:Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document overfitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.  

下载链接:https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1242


猜你喜欢

转载自blog.csdn.net/yangliuy/article/details/52970190
今日推荐