help · blog · about
A blue social bookmark and publication sharing system.

Inside BibSonomy

Hashes of Publication Posts

Motivation

In particular for literature references there is the problem of detecting duplicate posts, because there are big variations in how users enter fields such as journal name or author. On the one hand it is desirable to allow a user to have several posts which differ only slightly. On the other hand one might want to find other users posts which refer to the same paper or book even if they are not completely identical.

To fulfill both goals we implemented two hashes to compare publication posts. One is for comparing the posts of a single user (intra user hash) and one for comparing the posts of different users (interuser hash). Comparison is accomplished by normalizing and concatenating BibTeX fields, hashing the result with the MD5 message digest algorithm and comparing the resulting hashes. MD5 hashing is done for efficiency reasons only, since this allows for a fixed length storage in the database. Storing the hashes along with the resources in the posts table enables fast comparison and search of posts.

The intra user hash is relatively strict and takes into account the fields title, author, editor, year, entrytype, journal, booktitle, volume, and number. This allows one to have articles with the same title from the same authors in the same year but in different volumes (e.g., a technical report and the corresponding journal article).

In contrast, the inter user hash is less specific and includes only the title, year and author or editor (depending on what the user has entered).

In both hashes all fields which are taken into account are normalized, i.e., certain special characters are removed, whitespace and author/editor names normalized. The latter is done by concatenating the first letter of the first name by a dot with the last name, both in lower case. Persons are then sorted alphabetically by this string and concatenated by a colon.

Demo

To demonstrate the generation of the inter- and intra-user hash, you can fill out and submit the following form. BibSonomy then will calculate both hashes.

used in interhash and intrahash
title:
author:
editor:
year:
used only in intrahash
entrytype:
journal:
booktitle:
volume:
number:

If you just want to have an example to play around, follow this link.

Sourcecode

The complete source code to compute the hashes is included in org.bibsonomy.hashes_0.2.jar. It includes the class org.bibsonomy.util.BibTeXHashCalculator, which contains a main method you can call with java -jar org.bibsonomy.hashes_0.1.jar to see an example output and to use it as starting point for browsing and understanding the source.

The computation of the hashes is done in class org.bibsonomy.model.util.SimHash. It contains the following code to compute the intra hash:

public static String getIntraHash(final BibTeX bibtex) {
   return StringUtils.getMD5Hash(StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getTitle())     + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getAuthor())    + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getEditor())    + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getYear())      + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getEntrytype()) + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getJournal())   + " " + 
      StringUtils.removeNonNumbersOrLettersOrDotsOrSpace(bibtex.getBooktitle()) + " " +
      StringUtils.removeNonNumbersOrLetters(bibtex.getVolume())                 + " " +
      StringUtils.removeNonNumbersOrLetters(bibtex.getNumber())
   );
}

The following code is responsible to compute the inter hash:

public static String getInterHash(final BibTeX bibtex) {	
   if (StringUtils.removeNonNumbersOrLetters(bibtex.getAuthor()).equals("")) {
      // no author set --> take editor
      return StringUtils.getMD5Hash(getNormalizedTitle(bibtex.getTitle()) + " " +
         getNormalizedEditor(bibtex.getEditor())            + " " +
         getNormalizedYear(bibtex.getYear()));				
   }
   // author set
   return StringUtils.getMD5Hash(getNormalizedTitle(bibtex.getTitle()) + " " + 
      getNormalizedAuthor(bibtex.getAuthor())            + " " + 
      getNormalizedYear(bibtex.getYear()));
}

To see how the helper functions (e.g., removeNonNumbersOrLetters) work, have a look at the JAR file.

Screen Scrapers for Digital Libraries

An overview of BibSonomy's screen scrapers is given on the scraperinfo page. We also provide a service which can be used for scraping without the need to post to BibSonomy. The source code of the scrapers can be found in the Maven repository in the bibsonomy-scraper module.