copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extending the single words-based document model: a comparison of bigrams and 2-itemsets

R. Tesar, V. Strnad, K. Jezek, and M. Poesio. DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering, page 138--146. New York, NY, USA, ACM, (2006)
DOI: http://doi.acm.org/10.1145/1166160.1166197

Abstract

The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.

Description

Extending the single words-based document model

Links and resources

BibTeX key: Tesar2006
entry type: inproceedings
address: New York, NY, USA
booktitle: DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering
year: 2006
pages: 138--146
publisher: ACM
location: Amsterdam, The Netherlands
isbn: 1-59593-515-0
DOI: http://doi.acm.org/10.1145/1166160.1166197
url: http://portal.acm.org/citation.cfm?id=1166160.1166197

@jamesh's tags highlighted

Cite this publication

@inproceedings{Tesar2006, abstract = {The basic approach in text categorization is to represent documents by single words. However, often other features are utilized to achieve better classification results. In this paper, our attention is focused on bigrams and 2-itemsets. We compare the performance improvement in terms of classification accuracy when these features are used to extend the single words-based document representation on two standard text corpora: Reuters-21578 and 20 Newsgroups. For this comparison we use the multinomial Naive Bayes classifier and five different feature selection approaches. Algorithms for bigrams and 2-itemsets discovery are presented as well. Our results show a statistically significant improvement when bigrams and also 2-itemsets are incorporated. However, in the case of 2-itemsets it is important to use an appropriate feature selection method. On the other hand, even when a simple feature selection approach is applied to discover bigrams the classification accuracy improves. The conclusion is that, in our case, it is not very effective to extend document representation with 2-itemsets because bigrams achieve better results and discovering them is less resource-consuming.}, added-at = {2009-05-14T07:56:06.000+0200}, address = {New York, NY, USA}, author = {Tesar, Roman and Strnad, Vaclav and Jezek, Karel and Poesio, Massimo}, biburl = {https://www.bibsonomy.org/bibtex/248079e2741af01306bc91583f028be30/jamesh}, booktitle = {DocEng '06: Proceedings of the 2006 ACM symposium on Document engineering}, description = {Extending the single words-based document model}, doi = {http://doi.acm.org/10.1145/1166160.1166197}, interhash = {ee2cf973053b39bb099ecccdda0e1385}, intrahash = {48079e2741af01306bc91583f028be30}, isbn = {1-59593-515-0}, keywords = {bigram textcateg}, location = {Amsterdam, The Netherlands}, pages = {138--146}, publisher = {ACM}, timestamp = {2009-05-14T07:56:06.000+0200}, title = {Extending the single words-based document model: a comparison of bigrams and 2-itemsets}, url = {http://portal.acm.org/citation.cfm?id=1166160.1166197}, year = 2006 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Extending the single words-based document model: a comparison of bigrams and 2-itemsets

Comments and Reviews
(0)