WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

نویسندگان

چکیده

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other processing tasks. As signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., universal representations all tasks is challenging. To tackle the problem, we propose a new pre-trained model, WavLM, to solve full-stack downstream WavLM jointly learns masked prediction and denoising pre-training. By this means, does not only keep content modeling capability by prediction, but also improves potential non-ASR denoising. In addition, employs gated relative position bias Transformer structure better capture sequence ordering of input speech. We scale up training dataset from 60k hours 94k hours. Large state-of-the-art performance on SUPERB benchmark, brings significant improvements various their representative benchmarks. The code models are available at https://aka.ms/wavlm.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Scale Discriminative Training for Speech Recognition

This paper describes, and evaluates on a large scale, the lattice based framework for discriminative training of large vocabulary speech recognition systems based on Gaussian mixture hidden Markov models (HMMs). The paper concentrates on the maximum mutual information estimation (MMIE) criterion which has been used to train HMM systems for conversational telephone speech transcription using up ...

متن کامل

Large - Scale Semi - Supervised Learning for Natural Language Processing

Natural Language Processing (NLP) develops computational approaches to processing language data. Supervised machine learning has become the dominant methodology of modern NLP. The performance of a supervised NLP system crucially depends on the amount of data available for training. In the standard supervised framework, if a sequence of words was not encountered in the training set, the system c...

متن کامل

Deep neural network based supervised speech segregation generalizes to novel noises through large-scale training

Deep neural network (DNN) based supervised speech segregation has been successful in improving human speech intelligibility in noise, especially when DNN is trained and tested on the same noise type. A simple and effective way for improving generalization is to train with multiple noises. This letter demonstrates that by training with a large number of different noises, the objective intelligib...

متن کامل

Large Scale Mmie Training for Conversational Telephone Speech Recognition

This paper describes a lattice-based framework for maximum mutual information estimation (MMIE) of HMM parameters which has been used to train HMM systems for conversational telephone speech transcription using up to 265 hours of training data. These experiments represent the largest-scale application of discriminative training techniques for speech recognition of which the authors are aware, a...

متن کامل

Supervised Speech Separation and Processing

In real-world environments, speech often occurs simultaneously with acoustic interference, such as background noise or reverberation. The interference usually leads to adverse effects on speech perception, and results in performance degradation in many speech applications, including automatic speech recognition and speaker identification. Monaural speech separation and processing aim to separat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Journal of Selected Topics in Signal Processing

سال: 2022

ISSN: ['1941-0484', '1932-4553']

DOI: https://doi.org/10.1109/jstsp.2022.3188113