Гармонизация Систем Помет Для Многоязычных Корпусов Посредством Решетки Понятий Harmonizing Tagsets for Multilingual Corpora via Concept Lattice

ثبت نشده
چکیده

Multilingual corpora can be annotated with morphosyntactic tags by monolingual tools. However, each of the tools is typically bundled with a specific tagset. This variety of tagging schemes may be a problem for the user: InterCorp, a parallel corpus, currently offers on-line concordances in 22 languages, 11 of them tagged with 11 different tagsets.1 Fig. 1 illustrates the tagset variety using comparable examples of prepositional phrases in all of the 11 presently tagged languages.2 We are aiming at a solution that would delegate the task of dealing with multiple tagsets to the system, allowing the user to interact with an abstract interlingual hierarchy of linguistic categories. In order to reflect the differences between various tagsets, the common “tagset” takes three different perspectives of word class. Thus, the tag for the Czech relative pronoun který ‘which’ is decoded as a category with the properties of lexical pronoun, inflectional adjective and syntactic noun, each with its appropriate morphological characteristics. Tags in all tagsets can be described as objects with properties and the methods of Formal Concept Analysis [2] can be used to construct the hierarchy automatically as a concept lattice and to (partially) resolve tag queries that do not quite match the tags used for the specific language, in a way similar to that employed by Janssen [3] for dealing with lexical gaps in a multilingual lexical database. This is certainly not the first attempt to design an interlingual representation of linguistic categories in the context of multilingual corpora. We wish to mention at least MULTEXT-East [4], whose tagging scheme became a de facto standard for inflectional languages, and Interset,

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Morphological Tags in Parallel Corpora

Multilingual parallel corpora can be annotated with morphosyntactic tags by monolingual tools, freely available for a number of different languages. However, each of the tools is typically bundled with a specific tagset and assumes a specific way of tokenization. The variety of tagging schemes and tag formats may be a problem for the user: a relatively simple tag query in a multilingual setting...

متن کامل

Метод определения подобия информационных единиц по неявным пользовательским предпочтениям в рекомендательных системах поддержки жизнеобеспечения (Determination of Similarity of Information Entities Based on Implicit User Preferences in Life-Support Recommender Systems)

Целью данной работы является описание метода определения подобия информационных единиц посредством анализа данных о пользовательских предпочтениях. Метод является реализацией подхода Item-Item CF (коллаборативная фильтрация на основе подобия информационных единиц), который в свою очередь является одним из наиболее популярных подходов к построению современных рекомендательных систем. Исходными д...

متن کامل

On the Art of Taming and Exploiting Parallel Tags in a Multilingual Corpus1

Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages use incompatible tagsets, which results in a conceptual and formal variety of tags. Retraining taggers on data annotated with a common tagset is not a realistic option. However, differences between tagsets are often rooted in different...

متن کامل

Поиск иерархических звездных систем максимальной кратности (Search for Hierarchical Stellar Systems of Maximal Multiplicity)

Аннотация В астрофизике кратных иерархических звездных систем существует противоречие между их максимальной наблюдаемой кратностью (6-7) и теоретическим ограничением на эту величину (до пятисот). Для поиска иерархических систем большой кратности проведен анализ современных каталогов как широких, так и тесных пар. Результатом работы является список объектов – кандидатов в звездные системы максим...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010