Creation of a Tagged Corpus for Less-Processed Languages with CLaRK System

نویسندگان

Kiril Simov

Petya Osenova

Alexander Simov

Krasimira Ivanova

Ilko Grigorov

Hristo Ganev

چکیده

This paper addresses the problem of efficient resources compilation for less-processed languages. It presents a strategy for the creation of a morpho-syntactically tagged corpus with respect to such languages. Due to the fact that human languages are morphologically nonhomogenous, we mainly focus on inflecting ones. With certain modifications, the model can be applied to the other types as well. The strategy is described within a certain implementational environment the CLaRK System. First, the general architecture of the software is described. Then, the usual steps towards the creation of the language resource are outlined. After that, the concrete imlementational properties of the processing steps within CLaRK are discussed: text archive compilation, tokenization, frequency word list creation, morphological lexicon creation, morphological analyzer, semi-automatic disambiguation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems

The rapid development of language tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 m...

متن کامل

CLaRK - an XML-based System for Corpora Development

In this paper we describe the architecture and the intended applications of the CLaRK System. The development of the CLaRK System started under the T ubingen-So a International Graduate Programme in Computational Linguistics and Represented Knowledge (CLaRK). The main aim behind the design of the system is the minimization of human intervention during the creation of corpora. Creation of corpo...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Creation of a Tagged Corpus for Less-Processed Languages with CLaRK System

نویسندگان

چکیده

منابع مشابه

Corpus based coreference resolution for Farsi text

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

PAYMA: A Tagged Corpus of Persian Named Entities

Development of Bengali Named Entity Tagged Corpus and its Use in NER Systems

CLaRK - an XML-based System for Corpora Development

عنوان ژورنال:

اشتراک گذاری