ANTLR (ANother Tool for Language Recognition) is a parser and translator generator tool that lets one define language grammars in either ANTLR syntax (which is YACC and EBNF(Extended Backus-Naur Form) like) or a special AST(Abstract Syntax Tree) syntax. ANTLR can create lexers, parsers and AST's. ANTLR is more than just a grammar definition language however, the tools provided allow one to implement the ANTLR defined grammar by automatically generating lexers and parsers (and tree parsers) in either Java (http://java.sun.com/, C++ (http://anubis.dkuug.dk/jtc1/sc22/wg21/ or Sather (http://www.icsi.berkeley.edu/~sather/.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides
Softwareentwicklung braucht Profis. Was aber sind Profis? Menschen die mit der Softwareentwicklung Geld verdienen? Nein, wir meinen, es gehört mehr und anderes dazu. Professionalität in der Softwareentwicklung hat nichts mit Geld zu tun. Sie hat auch nur bedingt mit einem bestimmten Ausbildungsweg zu tun. Wir kennen professionelle Softwareentwickler, die wenig oder gar kein Geld mit ihrer Software verdienen und wir kennen professionelle Softwareentwickler, die weder Diplom noch Doktortitel haben.
M. D'Ambros. Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, page 529--530. New York, NY, USA, ACM, (2010)