bookmark

Text extraction from HTML pages - MetaOptimize Q+A


Description

What would be a good way to extract headlines, dates, and authors from news articles? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.

Preview

Tags

Users

  • @kasimiro

Comments and Reviews