Analysis of the Paragraph Vector Model for Information Retrieval
Q. Ai, L. Yang, J. Guo, и W. Croft. Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, стр. 133--142. New York, NY, USA, ACM, (2016)
DOI: 10.1145/2970398.2970409
Аннотация
Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.
Описание
Analysis of the Paragraph Vector Model for Information Retrieval
%0 Conference Paper
%1 Ai:2016:APV:2970398.2970409
%A Ai, Qingyao
%A Yang, Liu
%A Guo, Jiafeng
%A Croft, W. Bruce
%B Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
%C New York, NY, USA
%D 2016
%I ACM
%K doc2vec ma-zehe paragraphvectors
%P 133--142
%R 10.1145/2970398.2970409
%T Analysis of the Paragraph Vector Model for Information Retrieval
%U http://doi.acm.org/10.1145/2970398.2970409
%X Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.
%@ 978-1-4503-4497-5
@inproceedings{Ai:2016:APV:2970398.2970409,
abstract = {Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.},
acmid = {2970409},
added-at = {2016-12-18T14:45:47.000+0100},
address = {New York, NY, USA},
author = {Ai, Qingyao and Yang, Liu and Guo, Jiafeng and Croft, W. Bruce},
biburl = {https://www.bibsonomy.org/bibtex/285d9a1411ddba7ebca06c0ed2b8004d9/albinzehe},
booktitle = {Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval},
description = {Analysis of the Paragraph Vector Model for Information Retrieval},
doi = {10.1145/2970398.2970409},
interhash = {bb8a371ee918861e300d4935d65dd849},
intrahash = {85d9a1411ddba7ebca06c0ed2b8004d9},
isbn = {978-1-4503-4497-5},
keywords = {doc2vec ma-zehe paragraphvectors},
location = {Newark, Delaware, USA},
numpages = {10},
pages = {133--142},
publisher = {ACM},
series = {ICTIR '16},
timestamp = {2016-12-18T14:45:47.000+0100},
title = {Analysis of the Paragraph Vector Model for Information Retrieval},
url = {http://doi.acm.org/10.1145/2970398.2970409},
year = 2016
}