Prague Dependency Treebank: Restoration of Deletions
نویسندگان
چکیده
The use of the treebank as a resource for linguistic research has led us to look for an annotation scheme representing not only surface syntactic information (in ‘analytic trees’, ATS) but also the underlying syntactic structure of sentences and at least some aspects of intersentential links (in ‘tectogrammatical tree structures’, TGTS). We focus in this paper on some of the issues of the transduction of ATSs into TGTSs. 1 Two steps of syntactic tagging in PDT In the Prague Dependency Treebank (PDT) project, the structure of sentences is made explicit by means of two steps of syntactic tagging resulting in: (i) ‘analytic’ tree structures (ATSs), in which every word form and punctuation mark is represented as a node of the tree, and the edges of the tree correspond to (surface) syntactic dependency relations; and, (ii) tectogrammatical tree structures (TGTSs) corresponding to underlying sentence representations and having the shape of dependency trees with the verb as the root of the tree.1 In TGTSs the functional (synsemantic) words (such as prepositions, auxiliaries, subordinating conjunctions) as well as punctuation marks are principally not represented by nodes of their own; their functions are captured as parts of complex tags of the nodes standing for autosemantic (content) words. Surface deletions are ‘restored’ in TGTSs. The syntactic information which is absent in the surface (morphemic) shape of the sentence is introduced at least for the time being in the manual phase of the transduction procedure ([Hajičová et al. 1998]), translating (in a ‘userfriendly’ environment) ATSs to TGTSs. Every added (restored) node gets the index ELEX (if its antecedent is an expanded head node) or ELID (if this is not so). The added nodes always depend on their governors from the left-hand side, except for certain cases in coordinated constructions (cf. (2) below). ? The work reported on in this paper has been supported by the grant of the Czech Ministry of Education VS 96/151 and by the Czech Grant Agency GAČR 405/96/K214. 1 With the exception of TGTSs for coordinated constructions, see below.
منابع مشابه
Spanish Phoneme Classification by Means of a Hierarchy of Kohonen Self-Organizing Maps
Research Issues for the Next Generation Spoken Dialogue Systems p. 1 Data-Driven Analysis of Speech p. 10 Towards a Road Map for Machine Translation Research p. 19 The Prague Dependency Treebank: Crossing the Sentence Boundary p. 20 Text Tiered Tagging and Combined Language Models Classifiers p. 28 Syntactic Tagging p. 34 Information, Language, Corpus and Linguistics p. 39 Prague Dependency Tre...
متن کاملDifference between Written and Spoken Czech: The Case of Verbal Nouns Denoting an Action
The present paper extends understanding of differences in expressing actions by verbal nouns in corpora of written vs. spoken Czech, namely in the Czech part of the Prague CzechEnglish Dependency Treebank and in the Prague Dependency Treebank of Spoken Czech. We show that while the written corpus includes more complex noun phrases with more explicit expression of adnominal participants, noun ph...
متن کاملComplex Corpus Annotation: The Prague Dependency Treebank
The Prague Dependency Treebank (Hajič et al., 2001) is approaching the publication of its second version in which the tectogrammatical annotation is being added to the morphological and analytical (surface-syntactic) one. In this article, the Prague Dependency Treebank as a whole is being described, including its brief history. In this volume, there are three more papers with a detailed account...
متن کاملLearning to Search in Prague Dependency Treebank
We present Netgraph – an easy to use tool for searching in linguistically annotated treebanks. On several examples from the Prague Dependency Treebank we introduce the features of the searching language and show how to search for some frequent linguistic phenomena.
متن کاملThe Theory of Control Applied to the Prague Dependency Treebank (PDT)
One of the most difficult issues within corpora annotation on an underlying syntactic level is the restoration of nodes omitted in the surface shape of the sentence, but present on the underlying or deep syntactic level. In the present paper we concentrate on such type of nodes which are omitted due to the phenomenon usually called grammatical control with regard to their respective anaph...
متن کامل