Researchers at Google annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. The annotation process was automatic, and hence imperfect. However, the annotations are of generally high quality, as they strove for high precision (and, by necessity, lower recall). For each entity they recognized with high confidence, they provide the beginning and end byte offsets of the entity mention in the input text, its Freebase identifier (mid), and two confidence levels (computed differently, see below).
You might consider using this data in conjunction with the recently released Freebase annotations of several TREC query sets. ·
J. Jeon, V. Lavrenko, and R. Manmatha. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, page 119--126. New York, NY, USA, ACM, (2003)
Wei Wu, Bin Zhang, and Mari Ostendorf. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, page 689--692. Stroudsburg, PA, USA, Association for Computational Linguistics, (2010)