Comparing Lexicalized Treebank Grammars Extracted From Chinese, Korean, And English Corpora
نویسندگان
چکیده
In this paper, we present a method for comparing Lexicalized Tree Adjoining Grammars extracted from annotated corpora for three languages: English, Chinese and Korean. This method makes it possible to do a quantitative comparison between the syntactic structures of each language, thereby providing a way of testing the Universal Grammar Hypothesis, the foundation of modern linguistic theories. 1 I n t r o d u c t i o n The comparison of the grammars extracted from annotated corpora (i.e., Treebanks) is important on both theoretical and engineering grounds. Theoretically, it allows us to do a quantitative testing of the Universal Grammar Hypothesis. One of the major concerns in modern linguistics is to establish an explanatory basis for the similarities and variations among languages. The working assumption is tha t languages of the world share a set of universal linguistic principles and the apparent structural differences attested among languages can be explained as variation in the way the universal principles are instantiated. Comparison of the extracted syntactic trees allows us to quantitatively evaluate how similar the syntactic structures of different languages are. From an engineering perspective the extracted grammars and the links between the syntactic structures in the grammars are valuable resources for NLP applications, such as parsing, computational lexicon development, and machine translation (MT), to name a few. In this paper we first briefly discuss some linguistic characteristics of English, Chinese, and Korean, and introduce the Treebanks for the three languages. We then describe a tool that extracts Lexicalized Tree Adjoining Grammars (LTAGs) from Treebanks and the results of its application to these three Treebanks. Next, we describe our methodology for automatic comparison of the extracted Treebank grammars, This consists primarily of matching syntactic structures (namely, templates and sub-templates) in each pair of Treebank grammars. The ability to perform this type of comparison for different languages has a definite positive impact on the possibility of sorting out the universal versus language-dependent features of languages. Therefore, our grammar extraction tool is not only an engineering tool of great value in improving the efficiency and accuracy of grammar development, but it is also very useful for investigating theoretical linguistics. 2 T h r e e L a n g u a g e s a n d T h r e e ' r r e e b a n k s In this section, we briefly discuss some linguistic characteristics of English, Chinese, and Korean, and introduce the Treebanks for these languages. 2.1 T h r e e Languages These three languages belong to different language families: English is Germanic, Chinese is Sino-Tibetan, and Korean is Altaic (Comrie, 1987). There are several major differences between these languages. First, both English
منابع مشابه
Automatically Extracting and Comparing Lexicalized Grammars for Different Languages
In this paper, we present a quantitative comparison between the syntactic structures of three languages: English, Chinese and Korean. This is made possible by first extracting Lexicalized Tree Adjoining Grammars from annotated corpora for each language and then performing the comparison on the extracted grammars. We found that the majority of the core grammar structures for these three language...
متن کاملExtraction of Tree Adjoining Grammars from a Treebank for Korean
We present the implementation of a system which extracts not only lexicalized grammars but also feature-based lexicalized grammars from Korean Sejong Treebank. We report on some practical experiments where we extract TAG grammars and tree schemata. Above all, full-scale syntactic tags and well-formed morphological analysis in Sejong Treebank allow us to extract syntactic features. In addition, ...
متن کاملComparing and integrating Tree Adjoining Grammars
Grammars are core elements of many NLP applications. Grammars can be developed in two ways: built by hand or extracted from corpora. In this paper, we compare a handcrajted grammar with a Treebank grammar. We contend that recognizing substructures of the grammars' basic units is necessary tures and semantic information which are rarely represented in the corpora. lt would be ideal if we could c...
متن کاملAutomated Extraction of Tags from the Penn Treebank
The accuracy of statistical parsing models can be improved with the use of lexical information. Statistical parsing using Lexicalized tree adjoining grammar (LTAG), a kind of lexicalized grammar, has remained relatively unexplored. We believe that is largely in part due to the absence of large corpora accurately bracketed in terms of a perspicuous yet broad coverage LTAG. Our work attempts to a...
متن کاملA Comparative Analysis of Extracted Grammars
The development of wide-coverage grammars is at the core of robust NLP systems. This paper addresses the problem of grammar extraction from treebanks with respect to the issue of broad coverage along three dimensions: the grammar formalism (contextfree grammar, dependency grammar, lexicalized tree adjoining grammar), the domain of the annotated corpus (press reports, civil law) and the language...
متن کامل