Deutscher Wortschatz contains data generated from newspapers and web resources that are publicly available. The data were collected per language and encompass statistics about co-occurrences of words in randomly selected sentences.
Welcome to the Networked Digital Library of Theses and Dissertations (NDLTD), an international organization dedicated to promoting the adoption, creation, use, dissemination, and preservation of electronic theses and dissertations (ETDs). We support electronic publishing and open access to scholarship in order to enhance the sharing of knowledge worldwide. Our website includes resources for university administrators, librarians, faculty, students, and the general public.
The dataset genres.json contains (sub)genre classifications for novels published between 1770 and 1915. The genres covered are
gothic novels
"silver fork" novels
national tale novels
NYT10 is originally released by the paper "Sebastian Riedel, Limin Yao, and Andrew McCallum. Modeling relations and their mentions without labeled text."
Our flagship collection, under development since 1987, covers the history, literature and culture of the Greco-Roman world. We are applying what we have learned from Classics to other subjects within the humanities and beyond.
Snorkel is a system for programmatically building and managing training datasets without manual labeling. In Snorkel, users can develop large training datasets in hours or days rather than hand-labeling them over weeks or months.
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.