More is not always better: balancing sense distributions for all-words Word Sense Disambiguation
نویسندگان
چکیده
Current Word Sense Disambiguation systems show an extremely poor performance on low frequent senses, which is mainly caused by the difference in sense distributions between training and test data. The main focus in tackling this problem has been on acquiring more data or selecting a single predominant sense and not necessarily on the meta properties of the data itself. We demonstrate that these properties, such as the volume, provenance, and balancing, play an important role with respect to system performance. In this paper, we describe a set of experiments to analyze these meta properties in the framework of a state-of-the-art WSD system when evaluated on the SemEval-2013 English all-words dataset. We show that volume and provenance are indeed important, but that approximating the perfect balancing of the selected training data leads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points while using only simple features. We therefore conclude that unsupervised acquisition of training data should be guided by strategies aimed at matching meta properties.
منابع مشابه
رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA
Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...
متن کاملWord sense disambiguation criteria: a systematic study
This article describes the results of a systematic indepth study of the criteria used for word sense disambiguation. Our study is based on 60 target words: 20 nouns, 20 adjectives and 20 verbs. Our results are not always in line with some practices in the field. For example, we show that omitting noncontent words decreases performance and that bigrams yield better results than unigrams.
متن کاملDomain-Specific Sense Distributions and Predominant Sense Acquisition
Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant sense of a word when contextual clues are not strong enough. The domain of a document has a strong influence on the sense distribution of words, but it is not feasible to produce large manually annotated corpora for every domain of int...
متن کاملEvaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation
In this paper, we evaluate the results of the Antwerp University word sense disambiguation system in the English all words task of SENSEVAL-2. In this approach, specialized memory-based wordexperts were trained per word-POS combination. Through optimization by crossvalidation of the individual component classifiers and the voting scheme for combining them, the best possible word-expert was dete...
متن کاملExploring the Effect of Bag-of-words and Bag-of-bigram Features on Turkish Word Sense Disambiguation
Feature selection in Word Sense Disambiguation (WSD) is as important as the selection of algorithm to remove sense ambiguity. Bag-of-word (BoW) features comprise the information of neighbors around the ambiguous target word without considering any relation between words. In this study, we investigate the effect of BoW features and Bag-of-bigrams (BoB) on Turkish WSD and compare the results with...
متن کامل