An N-gram Frequency Database Reference to Handle MWE Extraction in NLP Applications
نویسندگان
چکیده
The identification and extraction of Multiword Expressions (MWEs) currently deliver satisfactory results. However, the integration of these results into a wider application remains an issue. This is mainly due to the fact that the association measures (AMs) used to detect MWEs require a critical amount of data and that the MWE dictionaries cannot account for all the lexical and syntactic variations inherent in MWEs. In this study, we use an alternative technique to overcome these limitations. It consists in defining an n-gram frequency database that can be used to compute AMs on-thefly, allowing the extraction procedure to efficiently process all the MWEs in a text, even if they have not been previously observed.
منابع مشابه
Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
Multiword expressions (MWE), a known nuisance for both linguistics and NLP, blur the lines between syntax and semantics. Previous work onMWE identification has relied primarily on surface statistics, which perform poorly for longer MWEs and cannot model discontinuous expressions. To address these problems, we show that even the simplest parsing models can effectively identify MWEs of arbitrary ...
متن کاملA Bio-Inspired Approach for Multi-Word Expression Extraction
This paper proposes a new approach for Multi-word Expression (MWE)extraction on the motivation of gene sequence alignment because textual sequence is similar to gene sequence in pattern analysis. Theory of Longest Common Subsequence (LCS) originates from computer science and has been established as affine gap model in Bioinformatics. We perform this developed LCS technique combined with linguis...
متن کاملAn evaluation of the role of statistical measures and frequency for MWE identification
We report on an experiment to evaluate the role of statistical association measures and frequency for the identification of MWE. We base our evaluation on a lexicon of 14.000 MWE comprising different types of word combinations: collocations, nominal compounds, light verbs + predicate, idioms, etc. These MWE were manually validated from a list of n-grams extracted from a 50 million word corpus o...
متن کاملFast and Flexible MWE Candidate Generation with the mwetoolkit
We present an experimental environment for computer-assisted extraction of Multiword Expressions (MWEs) from corpora. Candidate extraction works in two steps: generation and filtering. We focus on recent improvements in the former, for which we increased speed and flexibility. We present examples that show the potential gains for users and applications. 1 Project Description The mwetoolkitwas p...
متن کاملExtracting Multiword Expressions With A Semantic Tagger
Automatic extraction of multiword expressions (MWE) presents a tough challenge for the NLP community and corpus linguistics. Although various statistically driven or knowledge-based approaches have been proposed and tested, efficient MWE extraction still remains an unsolved issue. In this paper, we present our research work in which we tested approaching the MWE issue using a semantic field ann...
متن کامل