Minimum entropy clustering and applications to gene expression analysis.
H. Li, K. Zhang, und T. Jiang. Proc. IEEE Comput. Syst. Bioinform. Conf., (Januar 2004)
Zusammenfassung
Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.
%0 Journal Article
%1 Li2004a
%A Li, Haifeng
%A Zhang, Keshu
%A Jiang, Tao
%D 2004
%J Proc. IEEE Comput. Syst. Bioinform. Conf.
%K Algorithms Automated Automated:_methods Cluster_Analysis Entropy Gene_Expression Gene_Expression:_genetics Gene_Expression_Profiling Gene_Expression_Profiling:_methods Genetic Models Oligonucleotide_Array_Sequence_Analysis Oligonucleotide_Array_Sequence_Analysis:_methods Pattern_Recognition phd
%P 142--51
%T Minimum entropy clustering and applications to gene expression analysis.
%U http://www.ncbi.nlm.nih.gov/pubmed/16448008
%X Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.
@article{Li2004a,
abstract = {Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.},
added-at = {2013-12-17T10:10:31.000+0100},
author = {Li, Haifeng and Zhang, Keshu and Jiang, Tao},
biburl = {https://www.bibsonomy.org/bibtex/247af575ee13a624775290901763198d6/jullybobble},
interhash = {c358752d1dad6bba794ceafd0b7f9ec5},
intrahash = {47af575ee13a624775290901763198d6},
issn = {1551-7497},
journal = {Proc. IEEE Comput. Syst. Bioinform. Conf.},
keywords = {Algorithms Automated Automated:_methods Cluster_Analysis Entropy Gene_Expression Gene_Expression:_genetics Gene_Expression_Profiling Gene_Expression_Profiling:_methods Genetic Models Oligonucleotide_Array_Sequence_Analysis Oligonucleotide_Array_Sequence_Analysis:_methods Pattern_Recognition phd},
month = jan,
pages = {142--51},
pmid = {16448008},
timestamp = {2014-07-27T15:43:19.000+0200},
title = {{Minimum entropy clustering and applications to gene expression analysis.}},
url = {http://www.ncbi.nlm.nih.gov/pubmed/16448008},
year = 2004
}