20 Newsgroups
Abstract
This data set consists of 20000 messages taken from 20 Usenet newsgroups.
Information files:
description of the data
Data files:
20_newsgroups.tar.gz (17.3M; 61.6M uncompressed)
mini_newsgroups.tar.gz A subset composed of 100 articles from each newsgroup. (1.9M; 6.2M uncompressed)
To help researchers investigate relation extraction, we’re releasing a human-judged dataset of two relations about public figures on Wikipedia: nearly 10,000 examples of “place of birth”, and over 40,000 examples of “attended or graduated from an institution”. Each of these was judged by at least 5 raters, and can be used to train or evaluate relation extraction systems. We also plan to release more relations of new types in the coming months.
We have released over a million images onto Flickr Commons for anyone to use, remix and repurpose. These images were taken from the pages of 17th, 18th and 19th century books digitised by Microsoft who then generously gifted the scanned images to us, allowing us to release them back into...
Y. Song, L. Zhang, and C. Giles. CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge mining, page 93--102. New York, NY, USA, ACM, (2008)
A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, and K. Wang. Proceedings of the 24th International Conference on World Wide Web, page 243--246. Republic and Canton of Geneva, Switzerland, International World Wide Web Conferences Steering Committee, (2015)
H. Zhang, A. Santos, and J. Freire. Proceedings of the 30th ACM International Conference on Information &$\mathsemicolon$ Knowledge Management, ACM, (October 2021)
A. Dulny, A. Hotho, and A. Krause. Machine Learning and Knowledge Discovery in Databases: Research Track, page 438--455. Cham, Springer Nature Switzerland, (2023)
N. Dehouche, and A. Wongkitrungrueng. Proceedings of ANZMAC 2018: The 20th Conference of the Australian and New Zealand Marketing Academy. Adelaide (Australia), page 3--5 December. (2018)