FicTree: a Manually Annotated Treebank of Czech Fiction
نویسنده
چکیده
We present a manually annotated treebank of Czech fiction, intended to serve as an addendum to the Prague Dependency Treebank. The treebank has only 166,000 tokens, so it does not serve as a good basis for training of NLP tools, but added to the PDT training data, it can help improve the annotation of texts of fiction. We describe the composition of the corpus, the annotation process including inter-annotator agreement. On the newly created data and the data of the PDT, we performed a number of experiments with parsers (TurboParser, Parsito, MSTParser and MaltParser). We observe that the extension of PDT training data by a part of the new treebank actually does improve the results of the parsing of literary texts. We investigate cases where parsers agree on a different annotation than the manual one.
منابع مشابه
Studying Properties of Czech Complex Sentences from an Annotated Corpus
The paper deals with the problem of an analysis of complex sentences in Czech on the basis of manually annotated data. The availability of a specialized corpus explicitly describing mutual relationships between segments and clauses in Czech complex sentences, together with the availability of a thoroughly syntactically annotated corpus, the Prague Dependency Treebank, provide a solid background...
متن کاملTwo Tectogrammatical Realizers Side by Side: Case of English and Czech
We present a work in progress on a pair of morphosyntactic realizers sharing the same architecture. We provide description of input tree structures, describe our procedural approach on two typologically different languages and finally present preliminary evaluation results conducted on manually annotated treebank.
متن کاملDesigning CzeDLex - A Lexicon of Czech Discourse Connectives
We present a design for a new electronic lexicon of Czech discourse connectives. The data format and the annotation scheme are based on a study of similar existing resources, and we discuss arguments for choosing the data structure and selecting features of the lexicon entries. A special attention is paid to a consistent encoding of both primary and secondary connectives. The data itself comes ...
متن کاملAnaphora in Czech: Large Data and Experiments with Automatic Anaphora Resolution
The aim of this paper is two-fold. First, we want to present a part of the annotation scheme of the Prague Dependency Treebank 2.0 related to the annotation of coreference on the tectogrammatical layer of sentence representation (more than 45,000 textual and grammatical coreference links in almost 50,000 manually annotated Czech sentences). Second, we report a new pronoun resolution system deve...
متن کاملAn exploitation of the Prague Dependency Treebank: a valency case
The Prague Dependency Treebank (PDT) is a manually annotated part of the Czech National Corpus (Čermák 1997). Its size is approx. 90,000 sentences, i.e. 1.5 million words (tokens). Three layers of annotation (Hajič 2002) are used: the morphological layer, where lemmas and tags are annotated, the analytical layer, which roughly corresponds to the surface (shallow) syntactic structure of the sent...
متن کامل