A study using n-gram features for text categorization

Abstract

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm RIPPER indicate that, after the removal of stop words, word sequences of length 2 or 3 are most useful. Using longer sequences reduces classification performance. 1 Introduction After...

BibTeX key: citeulike:1952805
entry type: misc
year: 1998
posted-at: 2007-11-21 16:18:49
priority: 0
citeulike-article-id: 1952805
Document: http://citeseer.ist.psu.edu/176994.html

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy

A study using n-gram features for text categorization

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on