BibSonomy now automatically detects if you are on a site it has a screen scraper for, and offers the possibility to choose whether you want a bookmark or publication post.
Todays feature of the week post will point you to one of the hidden features of the system. As most of you certainly know one way to acquire the meta data of a publication is to use the screen scraping facility of BibSonomy.
At the moment it is possible to select a BibTeX entry on a web page and via pressing the postPublication button inserting it into BibSonomy. The next feature we will release next week allows to extract references from ACM or Citeseer without selecting a BibTeX entry. What we can already provide today is an interface for Scrapers and some helper classes which allow you to implement scrapers for other services. If you are interested in developing a BibSonomy-compliant scraper which we can include into the project, have a look into this JAR-file which contains the source code for the needed classes: scraper-0.1.jar.
The update we released today includes scrapers for the ACM Digital Library and Citeseer. More Scrapers will follow and smaller ones are already included. If you have suggestions for scrapers or already implementations (see last post) we would be pleased to know so.
Additionally we improved the tag editing through the edit link which now appears on every page which shows bookmarks or publications. Since it now also appears on pages which contain resources not owned by you (and therefore you're of course not allowed to change their tags), the page for tag editing shows only the resources which you own. A nice drawback is that now also the download page has an edit link.
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.
Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.
The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides
The Mozenda Scraper provides web data extraction software, Web Screen Scraping tools that makes it easy to capture nearly any content from the web. See how you can start getting data from the web in minutes.
M. Granitzer, M. Hristakeva, R. Knight, K. Jack, and R. Kern. Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, page 19:1--19:8. New York, NY, USA, ACM, (2012)
M. Neumann, P. Schaer, C. Michels, and R. Schenkel. Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, page 45--48. New York, NY, USA, ACM, (2018)