Articulatory features for conversational speech recognition
نویسنده
چکیده
While the overall performance of speech recognition systems continues to improve, they still show a dramatic increase in word error rate when tested on different speaking styles, i.e. when speakers for example want to make an important point during a meeting and change from sloppy speech to clear speech. Today’s speech recognizers are therefore not robust with respect to speaking style, although “conversational” speech, as present in the “Meeting” task, contains several, distinctly different, speaking styles. Therefore, methods have to be developed that allow adapting systems to an individual speaker and his or her speaking styles. The approach presented in this thesis models important phonetic distinctions in speech better than phone based systems, and is based on detectors for phonologically distinctive “articulatory features” such as Rounded or Voiced. These properties can be identified robustly in speech and can be used to discriminate between words, even when these have become confusable, because the phone based models are generally mis-matched due to differing speaking styles. This thesis revisits how human speakers contrast these broad, phonological classes when making distinctions in clear speech, shows how these classes can be detected in the acoustic signal and presents an algorithm that allows to combine articulatory features with an existing state-of-theart recognizer in a multi-stream set-up. The needed feature stream weights are automatically and discriminatively learned on adaptation data, which is more versatile and can be handled more efficiently than previous approaches. This thesis therefore presents a new acoustic model for automatic speech recognition, in which phone and feature models are combined with a discriminative approach, so that an existing baseline system is improved. This multi-stream model approach captures phonetic knowledge about speech production and perception differently than a purely phone based system. We evaluated this approach on the multi-lingual “GlobalPhone” task and on conversational speech, i.e. the English Spontaneous Scheduling Task (ESST) and RT-04S “Meeting” data, which is one of the most difficult tasks in Automatic Speech Recognition today. The algorithm is applied to generate context-independent and context-dependent combination weights. Improvements of up to 20% for the case of speaker specific adaptation outperform conventional adaptation methods.
منابع مشابه
Conversational speech recognition using acoustic and articulatory input
The combination of multiple speech recognizers based on different signal representations is increasingly attracting interest in the speech community. In previous work we presented a hybrid speech recognition system based on the combination of acoustic and articulatory information which achieved significant word error rate reductions under highly noisy conditions on a small-vocabulary numbers re...
متن کاملCombining acoustic and articulatory feature information for robust speech recognition
The idea of using articulatory representations for automatic speech recognition (ASR) continues to attract much attention in the speech community. Representations which are grouped under the label ‘‘articulatory’’ include articulatory parameters derived by means of acoustic-articulatory transformations (inverse filtering), direct physical measurements or classification scores for pseudo-articul...
متن کاملImitating Conversational Laughter with an Articulatory Speech Synthesizer
In this study we present initial efforts to model laughter with an articulatory speech synthesizer. We aimed at imitating a real laugh taken from a spontaneous speech database and created several synthetic versions of it using articulatory synthesis and diphone synthesis. In modeling laughter with articulatory synthesis, we also approximated features like breathing noises that do not normally o...
متن کاملTonal articulatory feature for Mandarin and its application to conversational LVCSR
This paper presents our recent work on the development of a tonal Articulatory Feature (AF) for Mandarin and its application to conversational LVCSR. Motivated by the theory of Mandarin phonology, eight features for classifying the acoustic units and one feature for classifying the tone are investigated and constructed in the paper, and the AF-based tandem approach is used to improve speech rec...
متن کاملArticulatory information and Multiview Features for Large Vocabulary Continuous Speech Recognition
This paper explores the use of multi-view features and their discriminative transforms in a convolutional deep neural network (CNN) architecture for a continuous large vocabulary speech recognition task. Mel-filterbank energies and perceptually motivated forced damped oscillator coefficient (DOC) features are used after feature-space maximum-likelihood linear regression (fMLLR) transforms, whic...
متن کاملArticulatory feature-based pronunciation modeling
Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phonebased approaches to speech recognition. An altern...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005