Regional Bias in the Broad Phonetic Transcriptions of the Spoken Dutch Corpus
نویسندگان
چکیده
In this paper, we assess an aspect of the quality of the broad phonetic transcriptions in the Spoken Dutch Corpus (CGN). The corpus contains speech from native speakers of Dutch originating from The Netherlands and the Dutch speaking part of Belgium. The phonetic transcriptions were made by transcribers from both regions. In previous research, we have identified regional differences in the transcribers' behaviour. In this paper, we explore the precise sources of the regional bias in the CGN transcriptions and we evaluate its impact on the phonetic transcriptions. More specifically, (1) the regional bias in the canonical transcriptions that served as the basis for the verification task of the transcribers is critically analysed, and (2) we verify in an experiment the regional bias introduced by the transcribers themselves. The possible effects of this inherent regional bias in the CGN transcriptions on subsequent linguistic analyses are briefly discussed.
منابع مشابه
Title : Automatic Phonetic Transcription of Large Speech Corpora
Most large speech corpora are delivered with a lexicon that contains a canonical transcription of every word in the orthographic transcription. Such a lexicon can be used for generating a hypothetical ‘canonical’ phonetic transcription from the orthography. In addition, time and money permitting, some speech corpora are provided with a manually verified broad phonetic transcription of at least ...
متن کاملThe Influence of the Labeller's Regional Background on Phonetic Transcriptions: Implications for the Evaluation of Spoken Language Resources
Phonetic transcriptions of spoken language corpora are not an exact written reproduction of the speech signal. They are influenced by a variety of factors such as the transcriber s native categorical perception. What remains unexplored is to what extent variation of perception within the same language exerts any influence on phonetic transcriptions. We report a case study of the labelling of vo...
متن کاملAssessing Manually Corrected Broad Phonetic Transcriptions in the Spoken Dutch Corpus
For research and development purposes in the areas of phonetics and speech technology, phonetically transcribed speech may be of great value. In the near future, the Spoken Dutch Corpus (CGN) is going to offer such transcriptions for about one thousand hours of spoken Dutch, of which 90% will consist of automatic transcriptions and 10% of manually produced transcriptions. An advantage of automa...
متن کاملAutomatic phonetic transcription of large speech corpora
This study is aimed at investigating whether automatic phonetic transcription procedures can approximate manual transcriptions typically delivered with contemporary large speech corpora. To this end, ten automatic procedures were used to generate a broad phonetic transcription of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus. T...
متن کاملGerman Today: a really extensive Corpus of Spoken Standard German
The research project “German Today” aims to determine the amount of regional variation in (near-)standard German spoken by young and older educated adults and to identify and locate regional features. To this end, we compile an areally extensive corpus of read and spontaneous German speech. Secondary school students and 50-to-60-year-old locals are recorded in 160 cities throughout the German s...
متن کامل