Deutscher Wortschatz contains data generated from newspapers and web resources that are publicly available. The data were collected per language and encompass statistics about co-occurrences of words in randomly selected sentences.
The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. We also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation. We are located in the LINC Laboratory of the Computer and Information Science Department at the University of Pennsylvania. All data produced by the Treebank is released through the Linguistic Data Consortium.
the Google Books corpus of American English, 155 billion words in size. limited to what you can do via the website at Brigham Young University. The easy thing to do is type in a word or phrase and see its frequency by decade, going back to the 1810s. The interface allows you to look for collocates (words that go with other words), view charts showing relative word frequency in the corpus by decade, handles parts of speech, and gives you various limits and display options. Other kinds of analysis that might be done with text corpora can’t be done through the interface.
SourceForge presents the Xaira project. Xaira is an open source application. SourceForge provides the world's largest selection of Open Source Software. XAIRA (XML Aware Indexing and Retrieval Architecture) supports indexing and analysis of large XML textual resources such as natural language corpora.
Unitex is a corpus processing system, based on automata-oriented technology. The concept of this software was born at LADL (Laboratoire d'Automatique Documentaire et Linguistique), under the direction of its director, Maurice Gross. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax.
S. Müller, M. Brunzel, D. Kaun, R. Biswas, M. Koutraki, T. Tietz, and H. Sack. The Semantic Web: ESWC 2019 Satellite Events - ESWC 2019 Satellite Events, Portoroz, Slovenia, June 2-6, 2019, Revised Selected Papers, volume 11762 of Lecture Notes in Computer Science, page 136--140. Springer, (2019)
B. Maia. I corpora nella didattica della traduzione: Corpus Use and Learning to Translate., Bologna: Cooperativa Libraria Universitaria Editrice Bologna, (2000)