Tweets2011
As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
A social database about things you know and love, spanning millions of topics in thousands of categories. Explore Freebase, add to it, or build applications with it.
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usu
Expect to see an emphasis on the scholarly and research implications of the acquisition. I’m no Ph.D., but it boggles my mind to think what we might be able to learn about ourselves and the world around us from this wealth of data. And I’m certain we’ll learn things that none of us now can even possibly conceive.