Modulation spectral features for speech emotion recognition using deep neural networks
نویسندگان
چکیده
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis sound comprise two important cognitive parts: early auditory cortex-based processing. considers spectrogram-based representation whereas includes extraction temporal modulations from spectrogram. spectrogram is called feature (MSF). As (CQT) provides higher resolution at salient low-frequency regions speech, we find that CQT-based spectrogram, together with its modulations, a enriched emotion-specific information. We argue CQT-MSF when used 2-dimensional convolutional network can provide time-shift invariant deformation insensitive SER. Our results show outperforms standard mel-scale on popular SER databases, Berlin EmoDB RAVDESS. also our proposed shift scattering coefficients, hence, showing importance joint hand-crafted self-learned instead reliance complete features. Finally, perform Grad-CAM to visually inspect contribution over
منابع مشابه
Automatic speech emotion recognition using modulation spectral features
In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation ...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملMultimodal Emotion Recognition Using Deep Neural Networks
The change of emotions is a temporal dependent process. In this paper, a Bimodal-LSTM model is introduced to take temporal information into account for emotion recognition with multimodal signals. We extend the implementation of denoising autoencoders and adopt the Bimodal Deep Denoising AutoEncoder modal. Both models are evaluated on a public dataset, SEED, using EEG features and eye movement ...
متن کاملRecognition of Human Emotion in Speech Using Modulation Spectral Features and Support Vector Machines
Automatic recognition of human emotion in speech aims at recognizing the underlying emotional state of a speaker from the speech signal. The area has received rapidly increasing research interest over the past few years. However, designing powerful spectral features for high-performance speech emotion recognition (SER) remains an open challenge. Most spectral features employed in current SER te...
متن کاملBinary Deep Neural Networks for Speech Recognition
Deep neural networks (DNNs) are widely used in most current automatic speech recognition (ASR) systems. To guarantee good recognition performance, DNNs usually require significant computational resources, which limits their application to low-power devices. Thus, it is appealing to reduce the computational cost while keeping the accuracy. In this work, in light of the success in image recogniti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Speech Communication
سال: 2023
ISSN: ['1872-7182', '0167-6393']
DOI: https://doi.org/10.1016/j.specom.2022.11.005