bookmark

Text extraction from HTML pages - MetaOptimize Q+A

http://metaoptimize.com/qa/questions/3440/text-extraction-from-html-pages

Description

What would be a good way to extract headlines, dates, and authors from news articles? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.

Preview

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

BibSonomy