The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus

نویسندگان

Mohamed Maamouri

Ann Bies

Tim Buckwalter

Wigdan Mekki

چکیده

From our three year experience of developing a large-scale corpus of annotated Arabic text, our paper will address the following: (a) review pertinent Arabic language issues as they relate to methodology choices, (b) explain our choice to use the Penn English Treebank style of guidelines, (requiring the Arabic-speaking annotators to deal with a new grammatical system) rather than doing the annotation in a more traditional Arabic grammar style (requiring NLP researchers to deal with a new system); (c) show several ways in which human annotation is important and automatic analysis difficult, including the handling of orthographic ambiguity by both the morphological analyzer and human annotators; (d) give an illustrative example of the Arabic Treebank methodology, focusing on a particular construction in both morphological analysis and tagging and syntactic analysis and following it in detail through the entire annotation process, and finally, (e) conclude with what has been achieved so far and what remains to be done.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing An Arabic Treebank: Methods, Guidelines, Procedures, And Tools

In this paper we address the following questions from our experience of the last two and a half years in developing a large-scale corpus of Arabic text annotated for morphological information, part-of-speech, English gloss, and syntactic structure: (a) How did we ‘leapfrog’ through the stumbling blocks of both methodology and training in setting up the Penn Arabic Treebank (ATB) annotation? (b)...

متن کامل

Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank

Much progress in natural language processing (NLP) over the last decade has come from the combination of using corpora of annotated naturally occurring text along with machine learning algorithms. Following this trend, corpora have been created for other languages, such as the Penn Arabic Treebank (PATB) (Maamouri et al.2003). However, the corpora almost invariably need to reinterpreted for the...

متن کامل

Penn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations

Large scale annotated corpora have played a critical role in speech and natural language research. However, while existing annotated corpora such as the Penn Treebank have been highly successful at the sentence-level, we also need large-scale annotated resources that reliably encode key aspects of discourse. In this paper, we detail (1) our plans for building the Penn Discourse Treebank (PDTB),...

متن کامل

Arabic Probabilistic Context Free Grammar Induction from a Treebank

Linguistic resources are very important to any natural language processing task. Unfortunately, the manual construction of these resources is laborious and time-consuming. The use of annotated corpora as a knowledge database might be a solution to a fast construction of a grammar for a given language. In this paper, we present our method to automatically induce a syntactic grammar from an Arabi...

متن کامل

A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality

We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus

نویسندگان

چکیده

منابع مشابه

Developing An Arabic Treebank: Methods, Guidelines, Procedures, And Tools

Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank

Penn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations

Arabic Probabilistic Context Free Grammar Induction from a Treebank

A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality

عنوان ژورنال:

اشتراک گذاری