Which XML Standards for Multilevel Corpus Annotation?
نویسندگان
چکیده
The paper attempts to answer the question: Which XML standard(s) should be used for multilevel corpus annotation? Various more or less specific standards and best practices are reviewed: TEI P5, XCES, work within ISO TC 37 / SC 4, TIGER-XML and PAULA. The conclusion of the paper is that the approach with the best claim to following text encoding standards consists in 1) using TEI-conformant schemata that are 2) designed in a way compatible with other standards and data models.
منابع مشابه
Discontinuous Constituents: a Problematic Case for Parallel Corpora Annotation and Querying
In this paper, we discuss some linguistic phenomena that pose potential problems for multilevel linguistic annotation of parallel corpora in general and specifically for data encoding with state-of-art multilevel corpus querying tools such as CQP. We describe the strategy we use for integrating the standard hierarchical XML representation used to annotate such phenomena in our aligned bilingual...
متن کاملA Framework for Multilevel linguistic Annotations
This article presents a 3-step model for multilayer annotations of corpora. Each kind of annotation for a textual corporacorresponds to a di erent view on the same document. This principle can be expressed rst with a general relational model dedicated to the organisation of LR. This abstract model is then implemented as an application of the XML formalism for the encoding of large corpora. The ...
متن کاملRepresenting and Accessing Multilevel Linguistic Annotation using the MEANING Format
We present an XML annotation format (MEANING Annotation Format, MAF) specifically designed to represent and integrate different levels of linguistic annotations and a tool that provides flexible access to them (MEANING Browser). We describe our experience in integrating linguistic annotations coming from different sources, and the solutions we adopted to implement efficient access to corpora an...
متن کاملQuery Language for Access to Speech Corpora
With more and more speech corpora at hand the unit selection technique is a promising approach in concatenative speech synthesis. What is missing are models of optimal parameters that sufficiently describe utterances to be produced and their corresponding counterparts in collections of speech data. Prior to this, existing corpora have to be annotated on possibly relevant linguistic and signal l...
متن کاملA Framework For Annotating Information Structure In Discourse
We present a framework for the integrated analysis of the textual and prosodic characteristics of information structure in the Switchboard corpus of conversational English. Information structure describes the availability, organisation and salience of entities in a discourse model. We present standards for the annotation of information status (old, mediated and new), and give guidelines for ann...
متن کامل