This data set consists of 20000 messages taken from 20 Usenet newsgroups.
description of the data
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
D. Nguyen, N. Smith, and C. Rosé. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, page 115--123. Stroudsburg, PA, USA, Association for Computational Linguistics, (2011)
X. Zhang, and Y. LeCun. (2015)cite arxiv:1502.01710Comment: This technical report is superseded by a paper entitled "Character-level Convolutional Networks for Text Classification", arXiv:1509.01626. It has considerably more experimental results and a rewritten introduction.