The new TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and is initially made available to participants of the blog track at the Text Retrieval Conference (TREC), to supplement the BLOG08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow), which is the main corpus used at the TREC Blog Track.
MemeTracker builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from 1 million online sources, ranging from mass media to personal blogs. We track the quotes and phrases that appear most frequently over time across this entire online news spectrum. This makes it possible to see how different stories compete for news and blog coverage each day, and how certain stories persist while others fade quickly.