Word embeddings and convolutional neural networks (CNN) have attracted
extensive attention in various classification tasks for Twitter, e.g. sentiment
classification. However, the effect of the configuration used to train and
generate the word embeddings on the classification performance has not been
studied in the existing literature. In this paper, using a Twitter election
classification task that aims to detect election-related tweets, we investigate
the impact of the background dataset used to train the embedding models, the
context window size and the dimensionality of word embeddings on the
classification performance. By comparing the classification results of two word
embedding models, which are trained using different background corpora (e.g.
Wikipedia articles and Twitter microposts), we show that the background data
type should align with the Twitter classification dataset to achieve a better
performance. Moreover, by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities, we found that
large context window and dimension sizes are preferable to improve the
performance. Our experimental results also show that using word embeddings and
CNN leads to statistically significant improvements over various baselines such
as random, SVM with TF-IDF and SVM with word embeddings.
Description
Using Word Embeddings in Twitter Election Classification
%0 Conference Paper
%1 yang2016using
%A Yang, Xiao
%A Macdonald, Craig
%A Ounis, Iadh
%D 2016
%K cnn embeddings nlp svm twitter word
%T Using Word Embeddings in Twitter Election Classification
%U http://arxiv.org/abs/1606.07006
%X Word embeddings and convolutional neural networks (CNN) have attracted
extensive attention in various classification tasks for Twitter, e.g. sentiment
classification. However, the effect of the configuration used to train and
generate the word embeddings on the classification performance has not been
studied in the existing literature. In this paper, using a Twitter election
classification task that aims to detect election-related tweets, we investigate
the impact of the background dataset used to train the embedding models, the
context window size and the dimensionality of word embeddings on the
classification performance. By comparing the classification results of two word
embedding models, which are trained using different background corpora (e.g.
Wikipedia articles and Twitter microposts), we show that the background data
type should align with the Twitter classification dataset to achieve a better
performance. Moreover, by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities, we found that
large context window and dimension sizes are preferable to improve the
performance. Our experimental results also show that using word embeddings and
CNN leads to statistically significant improvements over various baselines such
as random, SVM with TF-IDF and SVM with word embeddings.
@inproceedings{yang2016using,
abstract = {Word embeddings and convolutional neural networks (CNN) have attracted
extensive attention in various classification tasks for Twitter, e.g. sentiment
classification. However, the effect of the configuration used to train and
generate the word embeddings on the classification performance has not been
studied in the existing literature. In this paper, using a Twitter election
classification task that aims to detect election-related tweets, we investigate
the impact of the background dataset used to train the embedding models, the
context window size and the dimensionality of word embeddings on the
classification performance. By comparing the classification results of two word
embedding models, which are trained using different background corpora (e.g.
Wikipedia articles and Twitter microposts), we show that the background data
type should align with the Twitter classification dataset to achieve a better
performance. Moreover, by evaluating the results of word embeddings models
trained using various context window sizes and dimensionalities, we found that
large context window and dimension sizes are preferable to improve the
performance. Our experimental results also show that using word embeddings and
CNN leads to statistically significant improvements over various baselines such
as random, SVM with TF-IDF and SVM with word embeddings.},
added-at = {2018-01-17T13:31:34.000+0100},
author = {Yang, Xiao and Macdonald, Craig and Ounis, Iadh},
biburl = {https://www.bibsonomy.org/bibtex/2bd1b364287eaddbe1e9e41740ce92b85/schwemmlein},
description = {Using Word Embeddings in Twitter Election Classification},
interhash = {3b401f1c3d9933940225b8870da6bc73},
intrahash = {bd1b364287eaddbe1e9e41740ce92b85},
keywords = {cnn embeddings nlp svm twitter word},
note = {cite arxiv:1606.07006Comment: NeuIR Workshop 2016},
timestamp = {2018-09-05T16:38:48.000+0200},
title = {Using Word Embeddings in Twitter Election Classification},
url = {http://arxiv.org/abs/1606.07006},
year = 2016
}