Multitask Learning From Augmented Auxiliary Data for Improving Speech Emotion Recognition
نویسندگان
چکیده
Despite the recent progress in speech emotion recognition (SER), state-of-the-art systems lack generalisation across different conditions. A key underlying reason for poor is scarcity of datasets, which a significant roadblock to designing robust machine learning (ML) models. Recent works SER focus on utilising multitask (MTL) methods improve by shared representations. However, most these studies propose MTL solutions with requirement meta labels auxiliary tasks, limits training systems. This paper proposes an framework (MTL-AUG) that learns generalised representations from augmented data. We utilise augmentation-type classification and unsupervised reconstruction as allow data without requiring any tasks. The semi-supervised nature MTL-AUG allows exploitation abundant unlabelled further boost performance SER. comprehensively evaluate proposed following settings: (1) within corpus, (2) cross-corpus cross-language, (3) noisy speech, (4) adversarial attacks. Our evaluations using widely used IEMOCAP, MSP-IMPROV, EMODB datasets show improved results compared existing methods.
منابع مشابه
Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition
End-to-end training of deep learning-based models allows for implicit learning of intermediate representations based on the final task loss. However, the end-to-end approach ignores the useful domain knowledge encoded in explicit intermediate-level supervision. We hypothesize that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of c...
متن کاملImproving automatic emotion recognition from speech signals
We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. The challenge includes classifier and feature sub-challenges with five-class and two-class classification problems. We investigate prosody related, spectral and HMM-based features for the ev...
متن کاملMultitask Learning with CTC and Segmental CRF for Speech Recognition
Segmental conditional random fields (SCRFs) and connectionist temporal classification (CTC) are two sequence labeling methods used for end-to-end training of speech recognition models. Both models define a transcription probability by marginalizing decisions about latent segmentation alternatives to derive a sequence probability: the former uses a globally normalized joint model of segment labe...
متن کاملActive learning for dimensional speech emotion recognition
State-of-the-art dimensional speech emotion recognition systems are trained using continuously labelled instances. The data labelling process is labour intensive and time-consuming. In this paper, we propose to apply active learning to reduce according efforts: The unlabelled instances are evaluated automatically, and only the most informative ones are intelligently picked by an informativeness...
متن کاملFeature Transfer Learning for Speech Emotion Recognition
Speech Emotion Recognition (SER) has achieved some substantial progress in the past few decades since the dawn of emotion and speech research. In many aspects, various research efforts have been made in an attempt to achieve human-like emotion recognition performance in real-life settings. However, with the availability of speech data obtained from different devices and varied acquisition condi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Affective Computing
سال: 2022
ISSN: ['1949-3045', '2371-9850']
DOI: https://doi.org/10.1109/taffc.2022.3221749