IGT-XML: An XML Format for Interlinearized Glossed Text
نویسندگان
چکیده
We propose a new XML format for representing interlinearized glossed text (IGT), particularly in the context of the documentation and description of endangered languages. The proposed representation, which we call IGT-XML, builds on previous models but provides a more loosely coupled and flexible representation of different annotation layers. Designed to accommodate both selective manual reannotation of individual layers and semi-automatic extension of annotation, IGT-XML is a first step toward partial automation of the production of IGT.
منابع مشابه
Xigt: extensible interlinear glossed text for natural language processing
This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the u...
متن کاملExtracting Interlinear Glossed Text from LaTeX Documents
We present texigt, a command-line tool for the extraction of structured linguistic data from LTEX source documents, and a language resource that has been generated using this tool: a corpus of interlinear glossed text (IGT) extracted from open access books published by Language Science Press. Extracted examples are represented in a simple XML format that is easy to process and can be used to va...
متن کاملA Model for Interoperability: XML Documents as an RDF Database
We propose a model for a Resource Description Format (RDF) database for interlinear glossed text (IGT) created from documents encoded in the Extensible Markup Language (XML) using markup metaschemas. A metaschema, constructed using the Semantic Interpretation Language (SIL) (Simons 2004) maps XML-encoded documents to a common semantically rich RDF database. The RDF database in turn can be searc...
متن کاملEnriching ODIN
In this paper, we describe the expansion of the ODIN resource, a database containing many thousands of instances of Interlinear Glossed Text (IGT) for over a thousand languages. A database containing a large number of instances of IGT, which are effectively richly annotated and heuristically aligned bitexts, provides a unique resource for bootstrapping NLP tools for resource-poor languages. To ...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کامل