Automatic Phonemic Labeling and Segmentation of Spoken Dutch
نویسندگان
چکیده
The CGN corpus (Oostdijk, 2000) (Corpus Gesproken Nederlands/Corpus Spoken Dutch) is a large speech corpus of contemporary Dutch as spoken in Belgium (3.3 million words) and in the Netherlands (5.6 million words). Due to its size, manual phonemic annotation was limited to 10% of the data and automatic systems were used to complement this data. This paper describes the automatic generation of the phonemic annotations and the corresponding segmentations. First, we detail the processes used to generate possible pronunciations for each sentence and to select to most likely one. Next, we identify the remaining difficulties when handling the CGN data and explain how we solved them. We conclude with an evaluation of the quality of the resulting transcriptions and segmentations.
منابع مشابه
A statistical phonemic segment model for speech recognition based on automatic phonemic segmentation
This paper presents a method of constructing a statistical phonemic segment model (SPSM) for a speech recognition system based on speaker-independent context-independent automatic phonemic segmentation. In our recent research, we proposed the phoneme recognition system using the template matching method with the same segmentation, and confirmed that 5-frame-fixed time sequence of feature vector...
متن کاملBALDEY: A database of auditory lexical decisions.
In an auditory lexical decision experiment, 5541 spoken content words and pseudowords were presented to 20 native speakers of Dutch. The words vary in phonological make-up and in number of syllables and stress pattern, and are further representative of the native Dutch vocabulary in that most are morphologically complex, comprising two stems or one stem plus derivational and inflectional suffix...
متن کاملAssessing Segmentations: Two Methods for Confidence Scoring Automatic HMM-Based Word Segmentations
The Dutch-Flemish project Spoken Dutch Corpus (1998-2003) aims at the development of an annotated corpus of 10 million spoken words. In order to make the speech data easily accessible, a word segmentation couples the orthographic transcription to the speech signal by means of time stamps. Generally, such segmentations are produced manually. Since this manual procedure is a time-consuming effort...
متن کاملWord Segmentation in the Spoken Dutch Corpus
ELIS, University of Ghent, Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium martens,odul,rvparijs @elis.rug.ac.be Dept Language & Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands [email protected] ESAT, K.U.Leuven, Kasteelpark Arenberg 10, B-3001 Heverlee, Belgium kris.demuynck,tom.laureys,jacques.duchateau @esat.kuleuven.ac.be Abstract This paper describe...
متن کاملThe AUTONOMATA Spoken Names Corpus
In the Autonomata project we have collected a corpus of spoken name utterances with manually corrected phonemic transcriptions of these utterances. The corpus was designed with the intention to become a major resource for the development of automatic speech recognition engines that can achieve a high accuracy on the recognition of person and geographical names spoken in Dutch. The recorded name...
متن کامل