A Registry of Standard Data Categories for Linguistic Annotation
نویسندگان
چکیده
In this paper we describe the most recent work within ISO TC37/SC 4, and in particular the development of a Data Category Registry (DCR) component of the Linguistic Annotation Framework. The DCR will contain a formally defined set of linguistic categories in common use within the language engineering community for reference and use in linguistically annotated resources. We outline the first proposals for creation and management of the DCR, as a solicitation for input from the community.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملCross-linguistic annotation of modality: a data-driven hierarchical model
We present an annotation model of modality which is (i) cross-linguistic, relying on a wide, strongly typologically motivated approach, and (ii) hierarchical and layered, accounting for both factuality and speaker’s attitude, while modelling these two aspects through separate annotation schemes. Modality is defined through cross-linguistic categories, but the classification of actual linguistic...
متن کاملAutomatic Annotation and Information Retrieval
Textual information extraction, and particularly the extraction of information from web-based text, requires the annotation of a great number of documents very quickly, using standard categories: syntactical, grammatical (identifying tenses and aspects), lexical (the identification of " transfer " verbs, " donation " verbs, " localization " verbs …) and communicative categories (identifying rel...
متن کاملMethodological Aspects of Semantic Annotation
This paper constitutes a preliminary report on the work carried out on semantic content annotation in the LIRICS project, in close collaboration with the activities of ISO TC 37/SC 4/TDG 3. This consists primarily of: (1) identifying commonalities in alternative approaches to the annotation and representation of various types of semantic information; and (2) developing methodological principles...
متن کاملISOcat: Corralling Data Categories in the Wild
To achieve true interoperability for valuable linguistic resources different levels of variation need to be addressed. ISO Technical Committee 37, Terminology and other language and content resources, is developing a Data Category Registry. This registry will provide a reusable set of data categories. A new implementation, dubbed ISOcat, of the registry is currently under construction. This pap...
متن کامل