Abstract

Online social and news media generate rich and timely information about real-world events of all kinds. However, the huge amount of data available, along with the breadth of the user base, requires a substantial effort of information filtering to successfully drill down to relevant topics and events. Trending topic detection is therefore a fundamental building block to monitor and summarize information originating from social sources. There are a wide variety of methods and variables and they greatly affect the quality of results. We compare six topic detection methods on three Twitter datasets related to major events, which differ in their time scale and topic churn rate. We observe how the nature of the event considered, the volume of activity over time, the sampling procedure and the pre-processing of the data all greatly affect the quality of detected topics, which also depends on the type of detection method used. We find that standard natural language processing techniques can perform well for social streams on very focused topics, but novel techniques designed to mine the temporal distribution of concepts are needed to handle more heterogeneous streams containing multiple stories evolving in parallel. One of the novel topic detection methods we propose, based on <formula formulatype="inline"> <tex Notation="TeX">$n$-grams cooccurrence and <formula formulatype="inline"> <tex Notation="TeX">$df-idf_t$ topic ranking, consistently achieves the best performance across all these conditions, thus being more reliable than other state-of-the-art techniques.

Links and resources

Tags

community

  • @jaeschke
  • @asmelash
@jaeschke's tags highlighted