Intelligent Fusion of Evidence from Multiple Sources
for Text Classification
B. Zhang. Virginia Polytechnic Institute and State University, USA, Doctor of Philosophy in Computer Science and
Applications, (September 2006)
Abstract
Automatic text classification using current approaches
is known to perform poorly when documents are noisy or
when limited amounts of textual content is available.
Yet, many users need access to such documents, which
are found in large numbers in digital libraries and in
the WWW. If documents are not classified, they are
difficult to find when browsing. Further, searching
precision suffers when categories cannot be checked,
since many documents may be retrieved that would fail
to meet category constraints. In this work, we study
how different types of evidence from multiple sources
can be intelligently fused to improve classification of
text documents into predefined categories. We present a
classification framework based on an inductive learning
method -- Genetic Programming (GP) -- to fuse evidence
from multiple sources. We show that good classification
is possible with documents which are noisy or which
have small amounts of text (e.g., short metadata
records) -- if multiple sources of evidence are fused
in an intelligent way. The framework is validated
through experiments performed on documents in two
testbeds. One is the ACM Digital Library (using a
subset available in connection with CITIDEL, part of
NSF's National Science Digital Library). The other is
Web data, in particular that portion associated with
the Cadê Web directory. Our studies have shown that
improvement can be achieved relative to other machine
learning approaches if genetic programming methods are
combined with classifiers such as kNN. Extensive
analysis was performed to study the results generated
through the GP-based fusion approach and to understand
key factors that promote good classification.
Virginia Polytechnic Institute and State University
type
Doctor of Philosophy in Computer Science and
Applications
bibsource
OAI-PMH server at scholar.lib.vt.edu
rights
unrestricted; I hereby certify that, if appropriate, I
have obtained and attached hereto a written permission
statement from the owner(s) of each third party
copyrighted matter to be included in my thesis,
dissertation, or project report, allowing distribution
as specified below. I certify that the version I
submitted is the same as that approved by my advisory
committee. I hereby grant to Virginia Tech or its
agents the non-exclusive license to archive and make
accessible, under the conditions specified below, my
thesis, dissertation, or project report in whole or in
part in all forms of media, now or hereafter known. I
retain all other ownership rights to the copyright of
the thesis, dissertation or project report. I also
retain the right to use in future works (such as
articles or books) all or part of this thesis,
dissertation, or project report.
size
146 pages
contributor
Dan Spitzner and Chang-Tien Lu and Edward A. Fox and
Weiguo Fan and Pável Calado
%0 Thesis
%1 oai:VTETD:etd-07032006-152103
%A Zhang, Baoping
%C USA
%D 2006
%K algorithms, genetic programming
%T Intelligent Fusion of Evidence from Multiple Sources
for Text Classification
%U http://scholar.lib.vt.edu/theses/available/etd-07032006-152103/
%X Automatic text classification using current approaches
is known to perform poorly when documents are noisy or
when limited amounts of textual content is available.
Yet, many users need access to such documents, which
are found in large numbers in digital libraries and in
the WWW. If documents are not classified, they are
difficult to find when browsing. Further, searching
precision suffers when categories cannot be checked,
since many documents may be retrieved that would fail
to meet category constraints. In this work, we study
how different types of evidence from multiple sources
can be intelligently fused to improve classification of
text documents into predefined categories. We present a
classification framework based on an inductive learning
method -- Genetic Programming (GP) -- to fuse evidence
from multiple sources. We show that good classification
is possible with documents which are noisy or which
have small amounts of text (e.g., short metadata
records) -- if multiple sources of evidence are fused
in an intelligent way. The framework is validated
through experiments performed on documents in two
testbeds. One is the ACM Digital Library (using a
subset available in connection with CITIDEL, part of
NSF's National Science Digital Library). The other is
Web data, in particular that portion associated with
the Cadê Web directory. Our studies have shown that
improvement can be achieved relative to other machine
learning approaches if genetic programming methods are
combined with classifiers such as kNN. Extensive
analysis was performed to study the results generated
through the GP-based fusion approach and to understand
key factors that promote good classification.
@phdthesis{oai:VTETD:etd-07032006-152103,
abstract = {Automatic text classification using current approaches
is known to perform poorly when documents are noisy or
when limited amounts of textual content is available.
Yet, many users need access to such documents, which
are found in large numbers in digital libraries and in
the WWW. If documents are not classified, they are
difficult to find when browsing. Further, searching
precision suffers when categories cannot be checked,
since many documents may be retrieved that would fail
to meet category constraints. In this work, we study
how different types of evidence from multiple sources
can be intelligently fused to improve classification of
text documents into predefined categories. We present a
classification framework based on an inductive learning
method -- Genetic Programming (GP) -- to fuse evidence
from multiple sources. We show that good classification
is possible with documents which are noisy or which
have small amounts of text (e.g., short metadata
records) -- if multiple sources of evidence are fused
in an intelligent way. The framework is validated
through experiments performed on documents in two
testbeds. One is the ACM Digital Library (using a
subset available in connection with CITIDEL, part of
NSF's National Science Digital Library). The other is
Web data, in particular that portion associated with
the Cad{\^e} Web directory. Our studies have shown that
improvement can be achieved relative to other machine
learning approaches if genetic programming methods are
combined with classifiers such as kNN. Extensive
analysis was performed to study the results generated
through the GP-based fusion approach and to understand
key factors that promote good classification.},
added-at = {2008-06-19T17:35:00.000+0200},
address = {USA},
author = {Zhang, Baoping},
bibsource = {OAI-PMH server at scholar.lib.vt.edu},
biburl = {https://www.bibsonomy.org/bibtex/24b967e54357b8713749a76a0056a75cb/brazovayeye},
contributor = {Dan Spitzner and Chang-Tien Lu and Edward A. Fox and
Weiguo Fan and P{\'a}vel Calado},
interhash = {7becab678fe051f4d439c10fca53d7b2},
intrahash = {4b967e54357b8713749a76a0056a75cb},
keywords = {algorithms, genetic programming},
language = {en},
month = {September~06},
notes = {URN etd-07032006-152103},
oai = {oai:VTETD:etd-07032006-152103},
rights = {unrestricted; I hereby certify that, if appropriate, I
have obtained and attached hereto a written permission
statement from the owner(s) of each third party
copyrighted matter to be included in my thesis,
dissertation, or project report, allowing distribution
as specified below. I certify that the version I
submitted is the same as that approved by my advisory
committee. I hereby grant to Virginia Tech or its
agents the non-exclusive license to archive and make
accessible, under the conditions specified below, my
thesis, dissertation, or project report in whole or in
part in all forms of media, now or hereafter known. I
retain all other ownership rights to the copyright of
the thesis, dissertation or project report. I also
retain the right to use in future works (such as
articles or books) all or part of this thesis,
dissertation, or project report.},
school = {Virginia Polytechnic Institute and State University},
size = {146 pages},
timestamp = {2008-06-19T17:55:13.000+0200},
title = {Intelligent Fusion of Evidence from Multiple Sources
for Text Classification},
type = {Doctor of Philosophy in Computer Science and
Applications},
url = {http://scholar.lib.vt.edu/theses/available/etd-07032006-152103/},
year = 2006
}