Inproceedings,

Topic Modeling of Short Texts: A Pseudo-Document View

Y. Zuo, J. Wu, H. Zhang, H. Lin, F. Wang, K. Xu, and H. Xiong.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 2105–2114. New York, NY, USA, Association for Computing Machinery, (2016)
DOI: 10.1145/2939672.2939880

Abstract

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.

BibTeX key: 10.1145/2939672.2939880
entry type: inproceedings
address: New York, NY, USA
booktitle: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
year: 2016
pages: 2105–2114
publisher: Association for Computing Machinery
series: KDD '16
isbn: 9781450342322
numpages: 10
location: San Francisco, California, USA
DOI: 10.1145/2939672.2939880
url: https://doi.org/10.1145/2939672.2939880

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

%0 Conference Paper %1 10.1145/2939672.2939880 %A Zuo, Yuan %A Wu, Junjie %A Zhang, Hui %A Lin, Hao %A Wang, Fei %A Xu, Ke %A Xiong, Hui %B Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining %C New York, NY, USA %D 2016 %I Association for Computing Machinery %K topic %P 2105–2114 %R 10.1145/2939672.2939880 %T Topic Modeling of Short Texts: A Pseudo-Document View %U https://doi.org/10.1145/2939672.2939880 %X Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling. %@ 9781450342322

@inproceedings{10.1145/2939672.2939880, abstract = {Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.}, added-at = {2021-11-10T15:04:49.000+0100}, address = {New York, NY, USA}, author = {Zuo, Yuan and Wu, Junjie and Zhang, Hui and Lin, Hao and Wang, Fei and Xu, Ke and Xiong, Hui}, biburl = {https://www.bibsonomy.org/bibtex/25caa3990ca25e4474da32692bd3fc603/bsc}, booktitle = {Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, doi = {10.1145/2939672.2939880}, interhash = {887a2fc73553f5caa192becf69cf1c79}, intrahash = {5caa3990ca25e4474da32692bd3fc603}, isbn = {9781450342322}, keywords = {topic}, location = {San Francisco, California, USA}, numpages = {10}, pages = {2105–2114}, publisher = {Association for Computing Machinery}, series = {KDD '16}, timestamp = {2021-11-10T15:04:49.000+0100}, title = {Topic Modeling of Short Texts: A Pseudo-Document View}, url = {https://doi.org/10.1145/2939672.2939880}, year = 2016 }

BibSonomy

Topic Modeling of Short Texts: A Pseudo-Document View

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on