Persian in MULTEXT-East Framework

نویسندگان

  • Behrang Q. Zadeh
  • Saeed Rahimi
چکیده

Farsi, also known as Persian, is the official language of Iran, Tajikistan and one of the two main languages spoken in Afghanistan. It is an Indo-European agglutinating language, written in Arabic script. This paper presents the first step in creating Farsi basic language resources kit. This Step comprises the specifications for morphosyntactic encoding, which is based on the EAGLES/MULTEXT model and specific resources of MULTEXT-East. This paper introduces the language i.e. Farsi, with an emphasis on its writing system and morphological properties, and its specifications. Two other important issues introduced in this paper are; one, a novel Part of Speech (PoS) categorization and, the other, a unified orthography of Farsi in digital environment. A lexicon and an annotated corpus are under preparation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK

In this article, we have introduced the first parallel corpus of Persian with more than 10 other European languages. This article describes primary steps toward preparing a Basic Language Resources Kit (BLARK) for Persian. Up to now, we have proposed morphosyntactic specification of Persian based on EAGLE/MULTEXT guidelines and specific resources of MULTEXT-East. The article introduces Persian ...

متن کامل

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora

The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...

متن کامل

MULTEXT-East Resources for Serbian

The paper presents the MULTEXT-East language resources for the Serbian language. MULTEXT-East is a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe wordlevel s...

متن کامل

Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages

The EU Copernicus project Multext-East has created a multi-lingual corpus of text and speech data, covering the six languages of the project: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition, wordform lexicons for each of the languages were developed. The corpus includes a parallel component consisting of Orwell’s Nineteen Eighty-Four, with versions in all six languages...

متن کامل

OWL/DL formalization of the MULTEXT-East morphosyntactic specifications

This paper describes the modeling of the morphosyntactic annotations of the MULTEXT-East corpora and lexicons as an OWL/DL ontology. Formalizing annotation schemes in OWL/DL has the advantages of enabling formally specifying interrelationships between the various features and making logical inferences based on the relationships between them. We show that this approach provides us with a top-dow...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006