Lately I’ve been working on evaluating and comparing algorithms, capable of extracting useful content from arbitrary html documents. I have made a feature wise comparison of related software and APIs.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
most comprehensive free english dictionary organising nouns, verbs etc. into sets of cognitive synonyms. can be navigated with a browser. web services.