Abstract
In this paper we present a novel approach to minimally supervised synonym extraction.
The approach is based on the word embeddings and aims at presenting a
method for synonym extraction that is extensible to various languages.
We report experiments with word vectors trained by using both the continuous
bag-of-words model (CBoW) and the skip-gram model (SG) investigating the effects
of different settings with respect to the contextual window size, the number of dimensions
and the type of word vectors. We analyze the word categories that are (cosine)
similar in the vector space, showing that cosine similarity on its own is a bad indicator
to determine if two words are synonymous. In this context, we propose a new measure,
relative cosine similarity, for calculating similarity relative to other cosine-similar
words in the corpus. We show that calculating similarity relative to other words boosts
the precision of the extraction. We also experiment with combining similarity scores
from differently-trained vectors and explore the advantages of using a part-of-speech
tagger as a way of introducing some light supervision, thus aiding extraction.
We perform both intrinsic and extrinsic evaluation on our final system: intrinsic
evaluation is carried out manually by two human evaluators and we use the output
of our system in a machine translation task for extrinsic evaluation, showing that the
extracted synonyms improve the evaluation metric
Users
Please
log in to take part in the discussion (add own reviews or comments).