Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.
The dataset genres.json contains (sub)genre classifications for novels published between 1770 and 1915. The genres covered are
gothic novels
"silver fork" novels
national tale novels
The Net Data Directory collects and shares information on different sources of data about the Internet. For more about the project, see our about page. To get started, use the search box below, or check out our quick start guide.