Abstract
In this paper, we improve the statistical ranking of multi-word terms using known terms. We make use of linguistic knowledge to extract noun phrases as candidate terms in Dutch. After converting them into bigrams, we compare the performances of eight statistical methods (frequency, dice, log-likelihood, pair-wise mutual information, true mutual information, t-score, chi-square, and C-value) in measuring the bigram association and then we select the best one (log-likelihood) as a baseline and for further improvement. We propose a new scoring method to improve its term ranking by incorporating known terms. For evaluation, we use Elsevier's Medical Encyclopedia and Merck Manual as corpora, and compare the extracted terms against those encoded in the encyclopedia and a list of Dutch health terms collected from the Internet. We also apply manual evaluation for new terms. The evaluation using accuracy and figure of merit indicates that our method improves the ranking and successfully assigns higher scores to new terms.
Users
Please
log in to take part in the discussion (add own reviews or comments).