Abstract

The huge volume of digital information collected automatically by internet technology has caused problems in information retrieval. Finding the right information from a large collection is very difficult. The difficulty in most search engines are caused by a string matching algorithm that return a match whenever an exact occurrence of the search term is found. To address this problem and considering that the document collection is not only a collection of words but also a collection of concepts, we promote a new technique of information retrieval that is based on concepts. The difference between word-based and concept-based technique are indexing and retrieval. During indexing, this technique classifies documents into concepts extracted from the collection via clustering technique to construct concept indexing besides term indexing. During retrieval, this techniques ranks document base on a combination of term and conceptual similarity, in the formulation of doc-score = β * conceptScore + (1-β)*TermScore where β is the weight of concept score. The clustering algorithm is chosen from partitional model that linear in complexity, that is Bisecting K-Means. Two kinds of test collections, i.e. text document of news (1000 and 3000 news documents), and text document of academic articles (1000 academic abstract in information technology) were used to conduct the experiment. Performance evaluation was measured using average precision and R-precision. The results of the research showed that by setting β =0.5 to β =0.9 would improve significantly the precision of concept-based approach over the word-based only (β =0). The improvements are about 5.2\% to 8,3\% for average precision and 16.9\% to 31.5\% for R-precision.

Links and resources

Tags