Extraction and Recognition of Polish Multiword Expressions using Wikipedia and Finite-State Automata
نویسنده
چکیده
Linguistic resources for Polish are often missing multiword expressions (MWEs) – idioms, compound nouns and other expressions which have their own distinct meaning as a whole. This paper describes an effort to extract and recognize nominal MWEs in Polish text using Wikipedia, inflection dictionaries and finite-state automata. Wikipedia is used as a lexicon of MWEs and as a corpus annotated with links to articles. Incoming links for each article are used to determine the inflection pattern of the headword – this approach helps eliminate invalid inflected forms. The goal is to recognize known MWEs as well as to find more expressions sharing similar grammatical structure and occurring in similar context.
منابع مشابه
Terminology Finite-State Preprocessing for Computational LFG
This paper presents a technique to deal with multiword nominal terminology in a computational Lexical Functional Grammar. This method treats multiword terms as single tokens by modifying the preprocessing stage of the grammar (tokenization and morphological analysis), which consists of a cascade of two-level finite-state automata (transducers). We present here how we build the transducers to ta...
متن کاملDiscriminative Strategies to Integrate Multiword Expression Recognition and Parsing
The integration of multiword expressions in a parsing procedure has been shown to improve accuracy in an artificial context where such expressions have been perfectly pre-identified. This paper evaluates two empirical strategies to integrate multiword units in a real constituency parsing context and shows that the results are not as promising as has sometimes been suggested. Firstly, we show th...
متن کاملRecognition of Polish Temporal Expressions
In this article we present the result of the recent research in the recognition of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We prepared a broad description of Polish temporal expressions, called PLIMEX. It is based on the state-of-the-ar...
متن کاملReduction of Computational Complexity in Finite State Automata Explosion of Networked System Diagnosis (RESEARCH NOTE)
This research puts forward rough finite state automata which have been represented by two variants of BDD called ROBDD and ZBDD. The proposed structures have been used in networked system diagnosis and can overcome cominatorial explosion. In implementation the CUDD - Colorado University Decision Diagrams package is used. A mathematical proof for claimed complexity are provided which shows ZBDD ...
متن کاملMultiword Expressions and Named Entities in the Wiki50 Corpus
Multiword expressions (MWEs) and named entities (NEs) exhibit unique and idiosyncratic features, thus, they often pose a problem to NLP systems. In order to facilitate their identification we developed the first corpus of Wikipedia articles in which several types of multiword expressions and named entities are manually annotated at the same time. The corpus can be used for training or testing M...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016