Multi-Angle Lipreading with Angle Classification-Based Feature Extraction and Its Application to Audio-Visual Speech Recognition

نویسندگان

چکیده

Recently, automatic speech recognition (ASR) and visual (VSR) have been widely researched owing to the development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a system should correctly recognize spoken contents from not but also diagonal or profile faces. In this paper, we propose novel method applicable faces taken at any angle. Firstly, view classification carried out estimate angles. Based results, feature extraction then conducted using best combination of pre-trained models. Next, lipreading features. We developed audio-visual (AVSR) addition conventional ASR. Audio results were obtained ASR, followed by incorporating audio decision fusion manner. evaluated our methods OuluVS2, multi-angle database. confirmed approach achieved performance among schemes phrase task. addition, found AVSR are better than ASR results.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-pose lipreading and audio-visual speech recognition

In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization b...

متن کامل

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

The Aduio-visual Speech Recognition (AVSR) which employs both the video and audio information to do Automatic Speech Recognition (ASR) is one of the application of multimodal leaning making ASR system more robust and accuracy. The traditional models usually treated AVSR as inference or projection but strict prior limits its ability. As the revival of deep learning, Deep Neural Networks (DNN) be...

متن کامل

Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition

This work focuses on a novel affine-invariant lipreading method, and its optimal combination with an audio subsystem to implement an audio-visual automatic speech recognition (AV-ASR) system. The lipreading method is based on outer lip contour description which is transformed to the Fourier domain and normalized there to eliminate dependencies on the affine transformation (translation, rotation...

متن کامل

An extension 2DPCA based visual feature extraction method for audio-visual speech recognition

Two dimensional principal component analysis (2DPCA) has been proposed for face recognition as an alternative to traditional PCA transform [1]. In this paper, we extend this approach to the visual feature extraction for audio-visual speech recognition (AVSR). First, a two-stage 2DPCA transform is conducted to extract the visual features. Then, the visemic linear discriminant analysis (LDA) is a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Future Internet

سال: 2021

ISSN: ['1999-5903']

DOI: https://doi.org/10.3390/fi13070182