The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
The project combines two sources of information. The word counts themselves come from the HathiTrust Research Center (HTRC), which has tabulated them at the page level in 4.8 million public-domain volumes. Information about genre comes from a parallel project led by Ted Underwood, and supported by the National Endowment for the Humanities and the American Council of Learned Societies.
T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. SIGIR '07: Proceedings of the Learning to Rank workshop in the 30th annual international ACM SIGIR conference on Research and development in information retrieval, (2007)
J. Kamps, S. Geva, A. Trotman, A. Woodley, and M. Koolen. Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2008), volume 5631 of LNCS, page 1--28. Springer Verlag, Berlin, Heidelberg, (2009)