Using audio and visual information for single channel speaker separation

نویسندگان

Faheem Khan

Ben P. Milner

چکیده

This work proposes a method to exploit both audio and visual speech information to extract a target speaker from a mixture of competing speakers. The work begins by taking an effective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The audio input is taken from a single channel and includes the mixture of speakers, where as a separate set of visual features are extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimental results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods using both speech quality and speech intelligibility metrics.

متن کامل

منابع مشابه

Speaker separation using visual speech features and single-channel audio

This work proposes a method of single-channel speaker separation that uses visual speech information to extract a target speaker’s speech from a mixture of speakers. The method requires a single audio input and visual features extracted from the mouth region of each speaker in the mixture. The visual information from speakers is used to create a visually-derived Wiener filter. The Wiener filter...

متن کامل

Speaker separation using visually-derived binary masks

This paper is concerned with the problem of single-channel speaker separation and exploits visual speech information to aid the separation process. Audio from a mixture of speakers is received from a single microphone and to supplement this, video from each speaker in the mixture is also captured. The visual features are used to create a time-frequency binary mask that identifies regions where ...

متن کامل

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We...

متن کامل

Robust audio-visual speech synchrony detection by generalized bimodal linear prediction

We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed ti...

متن کامل

Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. Building upon powerful machine learning methodology, we develop a Deep Neural Network (DNN) that estimates a speaker count. While DNNs efficientl...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Using audio and visual information for single channel speaker separation

نویسندگان

چکیده

منابع مشابه

Speaker separation using visual speech features and single-channel audio

Speaker separation using visually-derived binary masks

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Robust audio-visual speech synchrony detection by generalized bimodal linear prediction

Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

عنوان ژورنال:

اشتراک گذاری