Аннотация

The identification of appropriate text tokens (words or sequences of words representing concepts) is one of the most important tasks of text preprocessing and may have great influence on the final results of text analysis. In our paper, we introducea new approach to discovering compound nouns, including proper compound nouns. Our approach combines the data mining methods with shallow lexical analysis. We propose a simple pattern language for specifying grammatical patterns to be satisfied byextracted compound nouns. Our method requires annotating the words with part of speech tags, thus to this extent, it is language-dependent.Based on the data mining GSP algorithm, we propose T-GSP as its modification for extracting frequent text patterns, and in particular, frequent word sequences that satisfy givengrammatical rules. The obtained sequences are regarded as candidates for compound nouns. The experiments have proven veryhigh quality of the method.

Линки и ресурсы

тэги

сообщество

  • @dblp
  • @seandalai
@seandalai- тэги данного пользователя выделены