Zusammenfassung
Paragraph Vectors has been recently proposed as an unsupervised method for
learning distributed representations for pieces of texts. In their work, the
authors showed that the method can learn an embedding of movie review texts
which can be leveraged for sentiment analysis. That proof of concept, while
encouraging, was rather narrow. Here we consider tasks other than sentiment
analysis, provide a more thorough comparison of Paragraph Vectors to other
document modelling algorithms such as Latent Dirichlet Allocation, and evaluate
performance of the method as we vary the dimensionality of the learned
representation. We benchmarked the models on two document similarity data sets,
one from Wikipedia, one from arXiv. We observe that the Paragraph Vector method
performs significantly better than other methods, and propose a simple
improvement to enhance embedding quality. Somewhat surprisingly, we also show
that much like word embeddings, vector operations on Paragraph Vectors can
perform useful semantic results.
Nutzer