Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up
Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Jagadish. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, page 21--30. Honolulu, Hawaii, Association for Computational Linguistics, (October 2008)