An Empirical Evaluation of doc2vec with Practical Insights into Document
Embedding Generation
J. Lau, and T. Baldwin. (2016)cite arxiv:1607.05368Comment: 1st Workshop on Representation Learning for NLP.
Abstract
Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec
(Mikolov et al., 2013a) to learn document-level embeddings. Despite promising
results in the original paper, others have struggled to reproduce those
results. This paper presents a rigorous empirical evaluation of doc2vec over
two tasks. We compare doc2vec to two baselines and two state-of-the-art
document embedding methodologies. We found that doc2vec performs robustly when
using models trained on large external corpora, and can be further improved by
using pre-trained word embeddings. We also provide recommendations on
hyper-parameter settings for general purpose applications, and release source
code to induce document embeddings using our trained doc2vec models.
Description
[1607.05368] An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Pretrained doc2vec models available from https://github.com/jhlau/doc2vec
%0 Generic
%1 lau2016empirical
%A Lau, Jey Han
%A Baldwin, Timothy
%D 2016
%K doc2vec fakenews
%T An Empirical Evaluation of doc2vec with Practical Insights into Document
Embedding Generation
%U http://arxiv.org/abs/1607.05368
%X Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec
(Mikolov et al., 2013a) to learn document-level embeddings. Despite promising
results in the original paper, others have struggled to reproduce those
results. This paper presents a rigorous empirical evaluation of doc2vec over
two tasks. We compare doc2vec to two baselines and two state-of-the-art
document embedding methodologies. We found that doc2vec performs robustly when
using models trained on large external corpora, and can be further improved by
using pre-trained word embeddings. We also provide recommendations on
hyper-parameter settings for general purpose applications, and release source
code to induce document embeddings using our trained doc2vec models.
@misc{lau2016empirical,
abstract = {Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec
(Mikolov et al., 2013a) to learn document-level embeddings. Despite promising
results in the original paper, others have struggled to reproduce those
results. This paper presents a rigorous empirical evaluation of doc2vec over
two tasks. We compare doc2vec to two baselines and two state-of-the-art
document embedding methodologies. We found that doc2vec performs robustly when
using models trained on large external corpora, and can be further improved by
using pre-trained word embeddings. We also provide recommendations on
hyper-parameter settings for general purpose applications, and release source
code to induce document embeddings using our trained doc2vec models.},
added-at = {2017-02-28T12:04:06.000+0100},
author = {Lau, Jey Han and Baldwin, Timothy},
biburl = {https://www.bibsonomy.org/bibtex/2af3c2a80cf9139a541308611ea5b7162/albinzehe},
description = {[1607.05368] An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
Pretrained doc2vec models available from https://github.com/jhlau/doc2vec},
interhash = {ffa9ea0ab141686bb174a4c165cda0f9},
intrahash = {af3c2a80cf9139a541308611ea5b7162},
keywords = {doc2vec fakenews},
note = {cite arxiv:1607.05368Comment: 1st Workshop on Representation Learning for NLP},
timestamp = {2017-02-28T12:04:06.000+0100},
title = {An Empirical Evaluation of doc2vec with Practical Insights into Document
Embedding Generation},
url = {http://arxiv.org/abs/1607.05368},
year = 2016
}