Articulatory features for conversational speech recognition

نویسنده

  • Florian Metze
چکیده

While the overall performance of speech recognition systems continues to improve, they still show a dramatic increase in word error rate when tested on different speaking styles, i.e. when speakers for example want to make an important point during a meeting and change from sloppy speech to clear speech. Today’s speech recognizers are therefore not robust with respect to speaking style, although “conversational” speech, as present in the “Meeting” task, contains several, distinctly different, speaking styles. Therefore, methods have to be developed that allow adapting systems to an individual speaker and his or her speaking styles. The approach presented in this thesis models important phonetic distinctions in speech better than phone based systems, and is based on detectors for phonologically distinctive “articulatory features” such as Rounded or Voiced. These properties can be identified robustly in speech and can be used to discriminate between words, even when these have become confusable, because the phone based models are generally mis-matched due to differing speaking styles. This thesis revisits how human speakers contrast these broad, phonological classes when making distinctions in clear speech, shows how these classes can be detected in the acoustic signal and presents an algorithm that allows to combine articulatory features with an existing state-of-theart recognizer in a multi-stream set-up. The needed feature stream weights are automatically and discriminatively learned on adaptation data, which is more versatile and can be handled more efficiently than previous approaches. This thesis therefore presents a new acoustic model for automatic speech recognition, in which phone and feature models are combined with a discriminative approach, so that an existing baseline system is improved. This multi-stream model approach captures phonetic knowledge about speech production and perception differently than a purely phone based system. We evaluated this approach on the multi-lingual “GlobalPhone” task and on conversational speech, i.e. the English Spontaneous Scheduling Task (ESST) and RT-04S “Meeting” data, which is one of the most difficult tasks in Automatic Speech Recognition today. The algorithm is applied to generate context-independent and context-dependent combination weights. Improvements of up to 20% for the case of speaker specific adaptation outperform conventional adaptation methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Conversational speech recognition using acoustic and articulatory input

The combination of multiple speech recognizers based on different signal representations is increasingly attracting interest in the speech community. In previous work we presented a hybrid speech recognition system based on the combination of acoustic and articulatory information which achieved significant word error rate reductions under highly noisy conditions on a small-vocabulary numbers re...

متن کامل

Combining acoustic and articulatory feature information for robust speech recognition

The idea of using articulatory representations for automatic speech recognition (ASR) continues to attract much attention in the speech community. Representations which are grouped under the label ‘‘articulatory’’ include articulatory parameters derived by means of acoustic-articulatory transformations (inverse filtering), direct physical measurements or classification scores for pseudo-articul...

متن کامل

Imitating Conversational Laughter with an Articulatory Speech Synthesizer

In this study we present initial efforts to model laughter with an articulatory speech synthesizer. We aimed at imitating a real laugh taken from a spontaneous speech database and created several synthetic versions of it using articulatory synthesis and diphone synthesis. In modeling laughter with articulatory synthesis, we also approximated features like breathing noises that do not normally o...

متن کامل

Tonal articulatory feature for Mandarin and its application to conversational LVCSR

This paper presents our recent work on the development of a tonal Articulatory Feature (AF) for Mandarin and its application to conversational LVCSR. Motivated by the theory of Mandarin phonology, eight features for classifying the acoustic units and one feature for classifying the tone are investigated and constructed in the paper, and the AF-based tandem approach is used to improve speech rec...

متن کامل

Articulatory information and Multiview Features for Large Vocabulary Continuous Speech Recognition

This paper explores the use of multi-view features and their discriminative transforms in a convolutional deep neural network (CNN) architecture for a continuous large vocabulary speech recognition task. Mel-filterbank energies and perceptually motivated forced damped oscillator coefficient (DOC) features are used after feature-space maximum-likelihood linear regression (fMLLR) transforms, whic...

متن کامل

Articulatory feature-based pronunciation modeling

Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phonebased approaches to speech recognition. An altern...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005