Human Judgements in Parallel Treebank Alignment
نویسندگان
چکیده
We have built a parallel treebank that includes word and phrase alignment. The alignment information was manually checked using a graphical tool that allows the annotator to view a pair of trees from parallel sentences. We found the compilation of clear alignment guidelines to be a difficult task. However, experiments with a group of students have shown that we are on the right track with up to 89% overlap between the student annotation and our own. At the same time these experiments have helped us to pin-point the weaknesses in the guidelines, many of which concerned unclear rules related to differences in grammatical forms between the languages.
منابع مشابه
Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information And Its Applications
This paper describes Japanese-English-Chinese aligned parallel treebank corpora of newspaper articles. They have been constructed by translating each sentence in the Penn Treebank and the Kyoto University text corpus into a corresponding natural sentence in a target language. Each sentence is translated so as to reflect its contextual information and is annotated with morphological and syntacti...
متن کاملCreating Arabic-English Parallel Word-Aligned Treebank Corpora at LDC
This contribution describes an Arabic-English parallel word aligned treebank corpus from the Linguistic Data Consortium that is currently under production. Herein we primarily focus on efforts required to assemble the package and instructions for using it. It was crucial that word alignment be performed on tokens produced during treebanking to ensure cohesion and greater utility of the corpus. ...
متن کاملBuilding the multilingual TUT parallel treebank
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French annotated in the pure dependency format of the Turin University Treebank, i.e. Parallel–TUT. We hypothesize that the major features of this annotation format can be of some help in addressing the typical issues related to parallel corpora, e.g. alignment at various levels. Therefor...
متن کاملThe Parallel-TUT: a multilingual and multiformat treebank
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. Parallel–TUT, or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration of Hu...
متن کاملAlignment Tools for Parallel Treebanks
This paper reports about our efforts in creating a tri-lingual parallel treebank. The focal points are consistency checking and all aspects of sub-sentential alignment. We discuss the alignment guidelines, the importance of quality checks, and special alignment problems. Then we look at alignment algorithms and alignment visualization tools and we compare our own TreeAligner with other alignmen...
متن کامل