@albinzehe

Analysis of the Paragraph Vector Model for Information Retrieval

, , , и . Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, стр. 133--142. New York, NY, USA, ACM, (2016)
DOI: 10.1145/2970398.2970409

Аннотация

Previous studies have shown that semantically meaningful representations of words and text can be acquired through neural embedding models. In particular, paragraph vector (PV) models have shown impressive performance in some natural language processing tasks by estimating a document (topic) level language model. Integrating the PV models with traditional language model approaches to retrieval, however, produces unstable performance and limited improvements. In this paper, we formally discuss three intrinsic problems of the original PV model that restrict its performance in retrieval tasks. We also describe modifications to the model that make it more suitable for the IR task, and show their impact through experiments and case studies. The three issues we address are (1) the unregulated training process of PV is vulnerable to short document over-fitting that produces length bias in the final retrieval model; (2) the corpus-based negative sampling of PV leads to a weighting scheme for words that overly suppresses the importance of frequent words; and (3) the lack of word-context information makes PV unable to capture word substitution relationships.

Описание

Analysis of the Paragraph Vector Model for Information Retrieval

Линки и ресурсы

тэги

сообщество

  • @albinzehe
  • @dblp
@albinzehe- тэги данного пользователя выделены