Scientext is a new, on-line French and English corpus of scientific texts. The corpus includes 4.8 million running tokens in French, 13 million words of research articles in English (medicine and biology), and an English-language sub-corpus of French undergraduate students’ texts (1,1 million words). The corpus is organized to facilitate the linguistic study of authorial position and reasoning in scientific articles through phraseology and lexico-grammatical markers linked to causality.
Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
X. Wang, Z. Wang, X. Han, W. Jiang, R. Han, Z. Liu, J. Li, P. Li, Y. Lin, and J. Zhou. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 1652--1671. Online, Association for Computational Linguistics, (November 2020)
O. Kashefi, and R. Hwa. Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), page 200--208. Online, Association for Computational Linguistics, (November 2020)
R. Bommasani, and C. Cardie. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), page 8075--8096. Online, Association for Computational Linguistics, (November 2020)
T. McCoy, E. Pavlick, and T. Linzen. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, page 3428--3448. Florence, Italy, Association for Computational Linguistics, (July 2019)
S. Wunderlich, M. Ring, D. Landes, and A. Hotho. International Joint Conference: 12th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2019) and 10th International Conference on EUropean Transnational Education (ICEUTE 2019) - Seville, Spain, May 13-15, 2019, Proceedings, volume 951 of Advances in Intelligent Systems and Computing, page 14--24. Springer, (2019)
K. Jiang, D. Wu, and H. Jiang. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), page 318--323. (2019)
N. Dehouche, and A. Wongkitrungrueng. Proceedings of ANZMAC 2018: The 20th Conference of the Australian and New Zealand Marketing Academy. Adelaide (Australia), page 3--5 December. (2018)
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. Manning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, page 2369--2380. Brussels, Belgium, Association for Computational Linguistics, (2018)
M. Braun, S. Krebs, F. Flohr, and D. Gavrila. (2018)cite arxiv:1805.07193Comment: Submitted to IEEE Trans. on Pattern Analysis and Machine Intelligence.