Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.
Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
X. Wang, Z. Wang, X. Han, W. Jiang, R. Han, Z. Liu, J. Li, P. Li, Y. Lin, und J. Zhou. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Seite 1652--1671. Online, Association for Computational Linguistics, (November 2020)
O. Kashefi, und R. Hwa. Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Seite 200--208. Online, Association for Computational Linguistics, (November 2020)