@jamesh

Extending the single words-based document model: a comparison of bigrams and 2-itemsets

, , , and . DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering, page 138--146. New York, NY, USA, ACM, (2006)
DOI: http://doi.acm.org/10.1145/1166160.1166197

Abstract

The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.

Description

Extending the single words-based document model

Links and resources

Tags

community

  • @dblp
  • @jamesh
@jamesh's tags highlighted