corpora creation

Standardisation Efforts On The Level Of Dialogue Act In The MATE Project

1999

Marion Klein

This paper describes the state of the art of coding schemes for dialogue acts and the efforts to establish a standard in this field. We present a review and comparison of currently available schemes and outline the comparison problems we had due to domain , task, and language dependencies of schemes. We discuss solution strategies which have in mind the reusability of corpora. Reusability is a ...

متن کامل

Coding for Demographic Categories in the Creation of Legacy Corpora: Asian American Ethnic Identities

Journal: :Language and Linguistics Compass 2014

متن کامل

Specialized vocabulary across languages: The case of traditional Chinese medicine

Journal: :Studies in Second Language Learning and Teaching 2023

This paper reports on the creation of specialized word lists in traditional Chinese medicine (TCM), which is a discipline using vocabulary across languages (i.e., and English) involves learners with different L1 backgrounds. First, TCM Word List 2,778 words was established from corpora textbooks journal articles. Selection criteria included meaning, keyness corpus general written English compar...

متن کامل

Grammar Extraction and Refinement from an HPSG Corpus

2002

Kiril Simov

Grammar learning and refinement on the basis of language resources is very appealing in comparison with manual development of formal grammar. But in order to learn a complex grammar a complex resource is needed. Thus the creation of language resources and learning of grammars from them have to be aware of each other. In this paper we define a formal basis for annotation of corpora with respect ...

متن کامل

A 3-Steps Algorithm for Morphological Disambiguation Using Untagged Corpora

2003

Anna Pappa

This article presents a three steps algorithm for morphological disambiguation between the definite article and the personal pronoun in French language. Tested accuracy in a large untagged corpora exceeds 98% with less than 1% of error. Our method has been also experimented on unlabeled Greek corpora and the results prove the system’s portability to other languages with similar structure. Not a...

متن کامل

Introduction to the special issue on processing under-resourced languages

Journal: :Speech Communication 2014

Laurent Besacier Etienne Barnard Alexey Karpov Tanja Schultz

The creation of language and acoustic resources, for any given spoken language, is typically a costly task. For example, a large amount of time and money is required to properly create annotated speech corpora for automatic speech recognition (ASR), domain-specific text corpora for language modeling (LM), etc. The development of speech technologies (ASR, Text-to-Speech) for the already highreso...

متن کامل

Etiquetario morfosintáctico del SLI para corpus de lengua gallega: aplicación al corpus paralelo TECTRA

Journal: :Procesamiento del Lenguaje Natural 2002

José Luis Aguirre Moreno Alberto Álvarez Lugrís Xavier Gómez Guinovart

In this article we present a complete and normalized morphosyntactic tagset for the annotation of linguistic corpora in Galician. The elaboration of this tagset, designed by the Computational Linguistics Group (SLI) of the University of Vigo, following strictly the EAGLES recommendations (Leech and Wilson, 1996), includes the creation of an intermediate tagset that allows us to establish a corr...

متن کامل

Fast Syntactic Searching in Very Large Corpora for Many Languages

2010

Milos Jakubícek Adam Kilgarriff Diana McCarthy Pavel Rychlý

For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL ’Corpus Query Language’ for intuitive creation of ...

متن کامل

Annotating Events, Temporal Expressions and Relations in Italian: the It-Timeml Experience for the Ita-TimeBank

2011

Tommaso Caselli Valentina Bartalesi Lenzi Rachele Sprugnoli Emanuele Pianta Irina Prodanof

This paper presents the annotation guidelines and specifications which have been developed for the creation of the Italian TimeBank, a language resource composed of two corpora manually annotated with temporal and event information. In particular, the adaptation of the TimeML scheme to Italian is described, and a special attention is given to the methodology used for the realization of the anno...

متن کامل

Making Monolingual Corpora Comparable: a Case Study of Bulgarian and Croatian

2004

Bozo Bekavac Petya Osenova Kiril Ivanov Simov Marko Tadic

This paper describes the first steps towards the creation of a Bulgarian-Croatian comparable corpus. Its base are two newspaper subcorpora from larger reference corpora of Bulgarian and Croatian. In the beginning we rely on more extralinguistically-oriented, but methodologically cleaner parameters of similarity like: specific topics, pre-defined time span and data size. The idea of ‘light’ and ...

متن کامل