copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Intelligent Fusion of Evidence from Multiple Sources for Text Classification

B. Zhang. Virginia Polytechnic Institute and State University, USA, Doctor of Philosophy in Computer Science and Applications, (September 2006)

Abstract

Automatic text classification using current approaches is known to perform poorly when documents are noisy or when limited amounts of textual content is available. Yet, many users need access to such documents, which are found in large numbers in digital libraries and in the WWW. If documents are not classified, they are difficult to find when browsing. Further, searching precision suffers when categories cannot be checked, since many documents may be retrieved that would fail to meet category constraints. In this work, we study how different types of evidence from multiple sources can be intelligently fused to improve classification of text documents into predefined categories. We present a classification framework based on an inductive learning method -- Genetic Programming (GP) -- to fuse evidence from multiple sources. We show that good classification is possible with documents which are noisy or which have small amounts of text (e.g., short metadata records) -- if multiple sources of evidence are fused in an intelligent way. The framework is validated through experiments performed on documents in two testbeds. One is the ACM Digital Library (using a subset available in connection with CITIDEL, part of NSF's National Science Digital Library). The other is Web data, in particular that portion associated with the Cadê Web directory. Our studies have shown that improvement can be achieved relative to other machine learning approaches if genetic programming methods are combined with classifiers such as kNN. Extensive analysis was performed to study the results generated through the GP-based fusion approach and to understand key factors that promote good classification.

Links and resources

BibTeX key: oai:VTETD:etd-07032006-152103
entry type: phdthesis
address: USA
year: 2006
month: September~06
school: Virginia Polytechnic Institute and State University
type: Doctor of Philosophy in Computer Science and Applications
bibsource: OAI-PMH server at scholar.lib.vt.edu
rights: unrestricted; I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Virginia Tech or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.
size: 146 pages
contributor: Dan Spitzner and Chang-Tien Lu and Edward A. Fox and Weiguo Fan and Pável Calado
oai: oai:VTETD:etd-07032006-152103
language: en
notes: URN etd-07032006-152103
url: http://scholar.lib.vt.edu/theses/available/etd-07032006-152103/

@brazovayeye's tags highlighted

Cite this publication

%0 Thesis %1 oai:VTETD:etd-07032006-152103 %A Zhang, Baoping %C USA %D 2006 %K algorithms, genetic programming %T Intelligent Fusion of Evidence from Multiple Sources for Text Classification %U http://scholar.lib.vt.edu/theses/available/etd-07032006-152103/ %X Automatic text classification using current approaches is known to perform poorly when documents are noisy or when limited amounts of textual content is available. Yet, many users need access to such documents, which are found in large numbers in digital libraries and in the WWW. If documents are not classified, they are difficult to find when browsing. Further, searching precision suffers when categories cannot be checked, since many documents may be retrieved that would fail to meet category constraints. In this work, we study how different types of evidence from multiple sources can be intelligently fused to improve classification of text documents into predefined categories. We present a classification framework based on an inductive learning method -- Genetic Programming (GP) -- to fuse evidence from multiple sources. We show that good classification is possible with documents which are noisy or which have small amounts of text (e.g., short metadata records) -- if multiple sources of evidence are fused in an intelligent way. The framework is validated through experiments performed on documents in two testbeds. One is the ACM Digital Library (using a subset available in connection with CITIDEL, part of NSF's National Science Digital Library). The other is Web data, in particular that portion associated with the Cadê Web directory. Our studies have shown that improvement can be achieved relative to other machine learning approaches if genetic programming methods are combined with classifiers such as kNN. Extensive analysis was performed to study the results generated through the GP-based fusion approach and to understand key factors that promote good classification.

@phdthesis{oai:VTETD:etd-07032006-152103, abstract = {Automatic text classification using current approaches is known to perform poorly when documents are noisy or when limited amounts of textual content is available. Yet, many users need access to such documents, which are found in large numbers in digital libraries and in the WWW. If documents are not classified, they are difficult to find when browsing. Further, searching precision suffers when categories cannot be checked, since many documents may be retrieved that would fail to meet category constraints. In this work, we study how different types of evidence from multiple sources can be intelligently fused to improve classification of text documents into predefined categories. We present a classification framework based on an inductive learning method -- Genetic Programming (GP) -- to fuse evidence from multiple sources. We show that good classification is possible with documents which are noisy or which have small amounts of text (e.g., short metadata records) -- if multiple sources of evidence are fused in an intelligent way. The framework is validated through experiments performed on documents in two testbeds. One is the ACM Digital Library (using a subset available in connection with CITIDEL, part of NSF's National Science Digital Library). The other is Web data, in particular that portion associated with the Cad{\^e} Web directory. Our studies have shown that improvement can be achieved relative to other machine learning approaches if genetic programming methods are combined with classifiers such as kNN. Extensive analysis was performed to study the results generated through the GP-based fusion approach and to understand key factors that promote good classification.}, added-at = {2008-06-19T17:35:00.000+0200}, address = {USA}, author = {Zhang, Baoping}, bibsource = {OAI-PMH server at scholar.lib.vt.edu}, biburl = {https://www.bibsonomy.org/bibtex/24b967e54357b8713749a76a0056a75cb/brazovayeye}, contributor = {Dan Spitzner and Chang-Tien Lu and Edward A. Fox and Weiguo Fan and P{\'a}vel Calado}, interhash = {7becab678fe051f4d439c10fca53d7b2}, intrahash = {4b967e54357b8713749a76a0056a75cb}, keywords = {algorithms, genetic programming}, language = {en}, month = {September~06}, notes = {URN etd-07032006-152103}, oai = {oai:VTETD:etd-07032006-152103}, rights = {unrestricted; I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to Virginia Tech or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.}, school = {Virginia Polytechnic Institute and State University}, size = {146 pages}, timestamp = {2008-06-19T17:55:13.000+0200}, title = {Intelligent Fusion of Evidence from Multiple Sources for Text Classification}, type = {Doctor of Philosophy in Computer Science and Applications}, url = {http://scholar.lib.vt.edu/theses/available/etd-07032006-152103/}, year = 2006 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Intelligent Fusion of Evidence from Multiple Sources for Text Classification

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Intelligent Fusion of Evidence from Multiple Sources for Text Classification

Abstract

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Intelligent Fusion of Evidence from Multiple Sources for Text Classification

Comments and Reviews
(0)