Audiovisual Speech Recognition with Articulator Positions as Hidden Variables
نویسندگان
چکیده
Speech recognition, by both humans and machines, benefits from visual observation of the face, especially at low signal-to-noise ratios (SNRs). It has often been noticed, however, that the audible and visible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recognition structures that allow asynchrony between the audible phoneme and the visible viseme outperform recognizers that allow no such asynchrony. This paper proposes, and tests using experimental speech recognition systems, a new explanation for audio-visual asynchrony. Specifically, we propose that audio-visual asynchrony may be the result of asynchrony between the gestures implemented by different articulators, such that the most visibly salient articulator (e.g., the lips) and the most audibly salient articulator (e.g., the glottis) may, at any given time, be dominated by gestures associated with different phonemes. The proposed model of audio-visual asynchrony is tested by implementing an “articulatory-feature model” audiovisual speech recognizer: a system with multiple hidden state variables, each representing the gestures of one articulator. The proposed system performs as well as a standard audiovisual recognizer on a digit recognition task; the best results are achieved by combining the outputs of the two systems.
منابع مشابه
An articulation model for audiovisual speech synthesis - Determination, adjustment, evaluation
The authors present a visual articulation model for speech synthesis and a method to obtain it from measured data. This visual articulation model is integrated into MASSY, the Modular Audiovisual Speech SYnthesizer, and used to control visible articulator movements described by six motion parameters: one for the up-down movement of the lower jaw, three for the lips and two for the tongue. The v...
متن کاملVisual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech
This paper reports results of a study investigating the visual information conveyed by the dynamics of internal articulators. Intelligibility of synthetic audiovisual speech with and without visualization of the internal articulator movements was compared. Additionally speech recognition scores were contrasted before and after a short learning lesson in which articulator trajectories were expla...
متن کاملA Stochastic Articulatory-to-acoustic Mapping as a Basis for Speech Recognition
Hidden Markov models (HMMs) of speech acoustics are the current state-of-the-art in speech recognition, but these models bear little resemblance to the processes underlying speech production (Lee, 1989). In this respect, using an HMM to model speech acoustics is like using a Gaussian distribution to model data generated by a Poisson process – to the extent that the model is not an accurate repr...
متن کاملImproving on Hidden Markov Models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding
The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve o...
متن کاملA Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition
In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies between them. However, the lack of efficient algorit...
متن کامل