copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

A Comparative Study on Feature Selection in Text Categorization

Y. Yang, and J. Pedersen. Proceedings of the Fourteenth International Conference on Machine Learning, page 412--420. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., (1997)

Abstract

This paper is a comparative study of featureselection methods in statistical learning oftext categorization. The focus is on aggres-sive dimensionality reduction. Five meth-ods were evaluated, including term selectionbased on document frequency (DF), informa-tion gain (IG), mutual information (MI), a 2-test (CHI), and term strength (TS). We found IG and CHI most e ective in our ex-periments. Using IG thresholding with a k-nearest neighbor classi er on the Reuters cor-pus, removal of up to 98% removal of uniqueterms actually yielded an improved classi -cation accuracy (measured by average preci-sion). DF thresholding performed similarly.Indeed we found strong correlations betweenthe DF, IG and CHI values of a term. Thissuggests that DF thresholding, the simplestmethod with the lowest cost in computation,can be reliably used instead of IG or CHIwhen the computation of these measures aretoo expensive. TS compares favorably withthe other methods with up to 50% vocabularyreduction but is not competitive at higher vo-cabulary reduction levels. In contrast, MIhad relatively poor performance due to itsbias towards favoring rare terms, and its sen-sitivity to probability estimation errors.

Description

A Comparative Study on Feature Selection in Text Categorization

Links and resources

BibTeX key: Yang:1997:CSF:645526.657137
entry type: inproceedings
address: San Francisco, CA, USA
booktitle: Proceedings of the Fourteenth International Conference on Machine Learning
year: 1997
pages: 412--420
publisher: Morgan Kaufmann Publishers Inc.
series: ICML '97
acmid: 657137
isbn: 1-55860-486-3
numpages: 9
url: http://portal.acm.org/citation.cfm?id=645526.657137

@jennymac's tags highlighted

Cite this publication

@inproceedings{Yang:1997:CSF:645526.657137, abstract = {This paper is a comparative study of featureselection methods in statistical learning oftext categorization. The focus is on aggres-sive dimensionality reduction. Five meth-ods were evaluated, including term selectionbased on document frequency (DF), informa-tion gain (IG), mutual information (MI), a 2-test (CHI), and term strength (TS). We found IG and CHI most e ective in our ex-periments. Using IG thresholding with a k-nearest neighbor classi er on the Reuters cor-pus, removal of up to 98% removal of uniqueterms actually yielded an improved classi -cation accuracy (measured by average preci-sion). DF thresholding performed similarly.Indeed we found strong correlations betweenthe DF, IG and CHI values of a term. Thissuggests that DF thresholding, the simplestmethod with the lowest cost in computation,can be reliably used instead of IG or CHIwhen the computation of these measures aretoo expensive. TS compares favorably withthe other methods with up to 50% vocabularyreduction but is not competitive at higher vo-cabulary reduction levels. In contrast, MIhad relatively poor performance due to itsbias towards favoring rare terms, and its sen-sitivity to probability estimation errors.}, acmid = {657137}, added-at = {2011-06-07T11:15:40.000+0200}, address = {San Francisco, CA, USA}, author = {Yang, Yiming and Pedersen, Jan O.}, biburl = {https://www.bibsonomy.org/bibtex/253ebc40f81bd9e4e7f24ddcc90dba2d3/jennymac}, booktitle = {Proceedings of the Fourteenth International Conference on Machine Learning}, description = {A Comparative Study on Feature Selection in Text Categorization}, interhash = {016cecde345f4e36cfcac1acf6552a65}, intrahash = {53ebc40f81bd9e4e7f24ddcc90dba2d3}, isbn = {1-55860-486-3}, keywords = {categorization comparative feature selection study text}, numpages = {9}, pages = {412--420}, publisher = {Morgan Kaufmann Publishers Inc.}, series = {ICML '97}, timestamp = {2011-06-07T11:15:40.000+0200}, title = {A Comparative Study on Feature Selection in Text Categorization}, url = {http://portal.acm.org/citation.cfm?id=645526.657137}, year = 1997 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

A Comparative Study on Feature Selection in Text Categorization

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML A Comparative Study on Feature Selection in Text Categorization

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

A Comparative Study on Feature Selection in Text Categorization

Comments and Reviews
(0)