copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Semi-supervised Text Classification Using Partitioned EM

G. Cong, W. Lee, H. Wu, and B. Liu. 11 th Int. Conference on Database Systems for Advanced Applications (DASFAA, page 482493. (2004)

Abstract

Abstract. Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In 16 it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in 16 assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the oneto-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the oneto-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.

Description

Semi-supervised Text Classification Using Partitioned EM

Links and resources

BibTeX key: Cong2004
entry type: inproceedings
booktitle: 11 th Int. Conference on Database Systems for Advanced Applications (DASFAA
year: 2004
pages: 482493
url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2811

@georg.oettl's tags highlighted

Cite this publication

@inproceedings{Cong2004, abstract = {Abstract. Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the oneto-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the oneto-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance.}, added-at = {2009-12-07T10:38:14.000+0100}, author = {Cong, Gao and Lee, Wee Sun and Wu, Haoran and Liu, Bing}, biburl = {https://www.bibsonomy.org/bibtex/26224b4f55ae5d9a63ee1e9855b0f33d3/georg.oettl}, booktitle = {11 th Int. Conference on Database Systems for Advanced Applications (DASFAA}, description = {Semi-supervised Text Classification Using Partitioned EM}, interhash = {ec20d14c439b49d4f734ff0306d36676}, intrahash = {6224b4f55ae5d9a63ee1e9855b0f33d3}, keywords = {classification maths nlp}, pages = 482493, timestamp = {2009-12-07T10:38:14.000+0100}, title = {Semi-supervised Text Classification Using Partitioned EM}, url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2811}, year = 2004 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Semi-supervised Text Classification Using Partitioned EM

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Semi-supervised Text Classification Using Partitioned EM

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Semi-supervised Text Classification Using Partitioned EM

Comments and Reviews
(0)