Adapting the tf idf vector-space model to domain specific information retrieval
C. Fautsch, and J. Savoy. Proceedings of the 2010 ACM Symposium on Applied Computing, page 1708--1712. New York, NY, USA, ACM, (2010)
The default implementation in Lucene, an open-source search engine, is the well-known vector-space model with <i>tf idf</i> weighting. The objective of this paper is to propose and evaluate additional techniques that can be adapted to this search model, in order to meet the particular needs of domainspecific information retrieval (IR). In this paper, we suggest certain specificity measures derived from either information theory or corpus-based linguistics. As an additional feature we suggest accounting for the number of search terms that a query and retrieved documents have in common. To integrate these methods we design and implement four extensions to the classical <i>tf idf</i> model and then evaluate the new IR models by applying them to four different domain-specific collections and comparing them to results found by a probabilistic retrieval model. The results tend to demonstrate that the adapted vector-space models clearly outperform the baseline approach (<i>tf idf</i>) and that performance levels obtained even surpass those found in the Okapi model.