As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
A number of resources have been compiled within the context of the MuchMore project. These include: a bilingual, parallel medical corpus; corresponding queries and relevance assessments; evaluation sets of disambiguated terms for GermaNet and UMLS; an evaluation list for morphological decomposition of medical terms.