20 Newsgroups
Abstract
This data set consists of 20000 messages taken from 20 Usenet newsgroups.
Information files:
description of the data
Data files:
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
The nonsense which follows is a Markov Chain based upon patterns in some pieces of English text. Word-Unit Nonsense uses patterns about words that tend to follow one another. Character-Unit Nonsense uses letters.
Beautiful visualizations of how language differs among document types. - GitHub - JasonKessler/scattertext: Beautiful visualizations of how language differs among document types.