The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus
نویسندگان
چکیده
From our three year experience of developing a large-scale corpus of annotated Arabic text, our paper will address the following: (a) review pertinent Arabic language issues as they relate to methodology choices, (b) explain our choice to use the Penn English Treebank style of guidelines, (requiring the Arabic-speaking annotators to deal with a new grammatical system) rather than doing the annotation in a more traditional Arabic grammar style (requiring NLP researchers to deal with a new system); (c) show several ways in which human annotation is important and automatic analysis difficult, including the handling of orthographic ambiguity by both the morphological analyzer and human annotators; (d) give an illustrative example of the Arabic Treebank methodology, focusing on a particular construction in both morphological analysis and tagging and syntactic analysis and following it in detail through the entire annotation process, and finally, (e) conclude with what has been achieved so far and what remains to be done.
منابع مشابه
Developing An Arabic Treebank: Methods, Guidelines, Procedures, And Tools
In this paper we address the following questions from our experience of the last two and a half years in developing a large-scale corpus of Arabic text annotated for morphological information, part-of-speech, English gloss, and syntactic structure: (a) How did we ‘leapfrog’ through the stumbling blocks of both methodology and training in setting up the Penn Arabic Treebank (ATB) annotation? (b)...
متن کاملExtracting a Tree Adjoining Grammar from the Penn Arabic Treebank
Much progress in natural language processing (NLP) over the last decade has come from the combination of using corpora of annotated naturally occurring text along with machine learning algorithms. Following this trend, corpora have been created for other languages, such as the Penn Arabic Treebank (PATB) (Maamouri et al.2003). However, the corpora almost invariably need to reinterpreted for the...
متن کاملPenn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations
Large scale annotated corpora have played a critical role in speech and natural language research. However, while existing annotated corpora such as the Penn Treebank have been highly successful at the sentence-level, we also need large-scale annotated resources that reliably encode key aspects of discourse. In this paper, we detail (1) our plans for building the Penn Discourse Treebank (PDTB),...
متن کاملArabic Probabilistic Context Free Grammar Induction from a Treebank
Linguistic resources are very important to any natural language processing task. Unfortunately, the manual construction of these resources is laborious and time-consuming. The use of annotated corpora as a knowledge database might be a solution to a fast construction of a grammar for a given language. In this paper, we present our method to automatically induce a syntactic grammar from an Arabi...
متن کاملA Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena.
متن کامل