Assessing Manually Corrected Broad Phonetic Transcriptions in the Spoken Dutch Corpus
نویسندگان
چکیده
For research and development purposes in the areas of phonetics and speech technology, phonetically transcribed speech may be of great value. In the near future, the Spoken Dutch Corpus (CGN) is going to offer such transcriptions for about one thousand hours of spoken Dutch, of which 90% will consist of automatic transcriptions and 10% of manually produced transcriptions. An advantage of automatically produced transcriptions is that they are maximally reliable; they are however not necessarily maximally accurate. One way of making them more accurate is having them checked and modified manually, but it is widely accepted that human transcriptions tend to be subjective and unreliable. The goal of this paper is to establish if human CGN transcribers succeeded in making accurate transcriptions by correcting automatic transcriptions, while maintaining a high level of reliability.
منابع مشابه
Title : Automatic Phonetic Transcription of Large Speech Corpora
Most large speech corpora are delivered with a lexicon that contains a canonical transcription of every word in the orthographic transcription. Such a lexicon can be used for generating a hypothetical ‘canonical’ phonetic transcription from the orthography. In addition, time and money permitting, some speech corpora are provided with a manually verified broad phonetic transcription of at least ...
متن کاملAutomatic phonetic transcription of large speech corpora
This study is aimed at investigating whether automatic phonetic transcription procedures can approximate manual transcriptions typically delivered with contemporary large speech corpora. To this end, ten automatic procedures were used to generate a broad phonetic transcription of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus. T...
متن کاملRegional Bias in the Broad Phonetic Transcriptions of the Spoken Dutch Corpus
In this paper, we assess an aspect of the quality of the broad phonetic transcriptions in the Spoken Dutch Corpus (CGN). The corpus contains speech from native speakers of Dutch originating from The Netherlands and the Dutch speaking part of Belgium. The phonetic transcriptions were made by transcribers from both regions. In previous research, we have identified regional differences in the tran...
متن کاملValidation of phonetic transcriptions in the context of automatic speech recognition
Some of the speech databases and large spoken language corpora that have been collected during the last fifteen years have been (at least partly) annotated with a broad phonetic transcription. Such phonetic transcriptions are often validated in terms of their resemblance to a handcrafted reference transcription. However, there are at least two methodological issues questioning this validation m...
متن کاملHow to Improve Human and Machine Transcriptions of Spontaneous Speech
This paper reports on an experiment aimed at measuring the quality o f automatic and human phonetic transcriptions of different speech styles that were produced within the framework o f a large speech corpus project for Dutch, the Spoken Dutch Corpus (C orpus Gesproken Nederlands, CGN). The results indicate that the procedure adopted in the CGN to improve the quality o f phonetic transcriptions...
متن کامل