Automatic Idiom Identification in Wiktionary
نویسندگان
چکیده
Online resources, such as Wiktionary, provide an accurate but incomplete source of idiomatic phrases. In this paper, we study the problem of automatically identifying idiomatic dictionary entries with such resources. We train an idiom classifier on a newly gathered corpus of over 60,000 Wiktionary multi-word definitions, incorporating features that model whether phrase meanings are constructed compositionally. Experiments demonstrate that the learned classifier can provide high quality idiom labels, more than doubling the number of idiomatic entries from 7,764 to 18,155 at precision levels of over 65%. These gains also translate to idiom detection in sentences, by simply using known word sense disambiguation algorithms to match phrases to their definitions. In a set of Wiktionary definition example sentences, the more complete set of idioms boosts detection recall by over 28 percentage points.
منابع مشابه
Wiktionary as a source for automatic pronunciation extraction
In this paper, we analyze whether dictionaries from the World Wide Web which contain phonetic notations, may support the rapid creation of pronunciation dictionaries within the speech recognition and speech synthesis system building process. As a representative dictionary, we selected Wiktionary [1] since it is at hand in multiple languages and, in addition to the definitions of the words, many...
متن کاملSemi-automatic enrichment of crowdsourced synonymy networks: the WISIGOTH system applied to Wiktionary
Semantic lexical resources are a mainstay of various Natural Language Processing applications. However, comprehensive and reliable resources are rare and not often freely available. Handcrafted resources are too costly for being a general solution while automatically-built resources need to be validated by experts or at least thoroughly evaluated. We propose in this paper a picture of the curre...
متن کاملA New Approach for Idiom Identification Using Meanings and the Web
There is a great deal of knowledge available on the Web, which represents a great opportunity for automatic, intelligent text processing and understanding, but the major problems are finding the legitimate sources of information and the fact that search engines provide page statistics not occurrences. This paper presents a new, domain independent, general-purpose idiom identification approach. ...
متن کاملConstruction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features
Some phrases can be interpreted either idiomatically (figuratively) or literally in context, and the precise identification of idioms is indispensable for full-fledged natural language processing (NLP). To this end, we have constructed an idiom corpus for Japanese. This paper reports on the corpus and the results of an idiom identification experiment using the corpus. The corpus targets 146 amb...
متن کاملAutomatic Error Recovery for Pronunciation Dictionaries
In this paper, we present our latest investigations on pronunciation modeling and its impact on ASR. We propose completely automatic methods to detect, remove, and substitute inconsistent or flawed entries in pronunciation dictionaries. The experiments were conducted on different tasks, namely (1) word-pronunciation pairs from the Czech, English, French, German, Polish, and Spanish Wiktionary [...
متن کامل