Abstract
Recent years have witnessed the unprecedented growth of online social media, which
empower short texts as the prevalent format for information of Internet. Given the
nature of sparsity, however, short text topic modeling remains a critical yet much-watched
challenge in both academy and industry. Rich research efforts have been put on building
different types of probabilistic topic models for short texts, among which the self
aggregation methods without using auxiliary information become an emerging solution
for providing informative cross-text word co-occurrences. However, models along this
line are still rarely seen, and the representative one Self-Aggregation Topic Model
(SATM) is prone to overfitting and computationally expensive. In light of this, in
this paper, we propose a novel probabilistic model called Pseudo-document-based Topic
Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document
to implicitly aggregate short texts against data sparsity. By modeling the topic distributions
of latent pseudo documents rather than short texts, PTM is expected to gain excellent
performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short)
is also proposed by applying Spike and Slab prior, with the purpose of eliminating
undesired correlations between pseudo documents and latent topics. Extensive experiments
on various real-world data sets with state-of-the-art baselines demonstrate the high
quality of topics learned by PTM and its robustness with reduced training samples.
It is also interesting to show that i) SPTM gains a clear edge over PTM when the number
of pseudo documents is relatively small, and ii) the constraint that a short text
belongs to only one pseudo document is critically important for the success of PTM.
We finally take an in-depth semantic analysis to unveil directly the fabulous function
of pseudo documents in finding cross-text word co-occurrences for topic modeling.
Users
Please
log in to take part in the discussion (add own reviews or comments).