The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
This project started from my frustration that I could not find any simple, portable XML Parser to use inside my tools (see CONDOR for example). Let's look at the well-known Xerces C++ library: the complete Xerces project is 53 MB! (11 MB compressed in a zipfile). I am currently developping many small tools. I am using XML as standard for all my input /ouput configuration and data files. The source code of my small tools is usually around 600KB.
Get your Mac, a webcam, and Delicious Library and rediscover your home library. Just point any FireWire digital video camera at the barcode on the back of any book, movie, music, or video game. Delicious Library does the rest. The barcode is scanned and within seconds the item's cover appears on your digital shelves filled with tons of in-depth information downloaded from one of six different web sources from around the world.