Balanced corpus of informal spoken Czech: compilation, design and findings
نویسندگان
چکیده
The paper presents ORAL2008, a new 1-million corpus of spoken Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic spoken language used in informal situations and it is balanced in the main sociolinguistic categories of speakers. The paper concentrates also on the data collection, its broad coverage and the transcription system that registers variability of spoken Czech. Possible findings based on the provided data are finally outlined.
منابع مشابه
Oral2008: New Balanced Corpus of Spoken Czech 1
Attention paid to spoken language has increased in the last decades, as well as its importance for linguistic research and natural language processing in general. However, compilation of spoken corpora as an indispensable source of data is very laborious and thus expensive. Nevertheless, more and more spoken corpora are being created currently. There are various approaches to their design, dept...
متن کاملBalanced data repository of spontaneous spoken Czech
The paper presents data repository that will be used as a source of data for ORAL2013, a new corpus of spontaneous spoken Czech. The corpus is planned to be published in 2013 within the framework of the Czech National Corpus and it will contain both the audio recordings and their transcriptions manually aligned with time stamps. The corpus will be designed as a representation of contemporary sp...
متن کاملMapping Diatopic and Diachronic Variation in Spoken Czech: the Ortofon and Dialekt Corpora
ORTOFON and DIALEKT are two corpora of spoken Czech (recordings + transcripts) which are currently being built at the Institute of the Czech National Corpus. The first one (ORTOFON) continues the tradition of the CNC’s ORAL series of spoken corpora by focusing on collecting recordings of unscripted informal spoken interactions (“prototypically spoken texts”), but also provides new features, mos...
متن کاملAdapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech
The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we int...
متن کاملThe Spoken Dutch Corpus. Overview and First Evaluation
In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, i...
متن کامل