This data set consists of 20000 messages taken from 20 Usenet newsgroups.
description of the data
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
Congnan Luo, Yanjun Li, and Soon M. Chung. Data & Knowledge Engineering68(11):1271 - 1288 (2009)Including Special Section: Conference on Privacy in Statistical Databases (PSD 2008) - Six selected and extended papers on Database Privacy.