SoundNet: Learning Sound Representations from Unlabeled Video

نویسندگان

Yusuf Aytar

Carl Vondrick

Antonio Torralba

چکیده

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised feature learning on monaural DOA estimation using convolutional deep belief networks

In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. Additionally, in the field of sound direction-of-arrival (DOA) estimation, the binaural features like interaural time or phase difference and interaural level difference, or monaural cues like spectral peaks and notches are often used to estimate soun...

متن کامل

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Sound event detection is the task of detecting the type, onset time, and offset time of sound events in audio streams. The mainstream solution is recurrent neural networks (RNNs), which usually predict the probability of each sound event at every time step. Connectionist temporal classification (CTC) has been applied in order to relax the need for exact annotations of onset and offset times; th...

متن کامل

Unsupervised Learning of Behavioural Patterns for Video-Surveillance

Unsupervised learning is a way to extract knowledge from noisy and complex sets of unlabeled data. The video-surveillance setting provides a potentially huge amount of unlabeled information on a given scene. In this paper we explore the use of spectral clustering to learn common behaviours from sets of dynamic events from a video-surveillance system. In particular we discuss how temporal data, ...

متن کامل

Unsupervised learning from videos using temporal coherency deep networks

In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal coherence of close frames is used as a free form of annotation, encouraging the learned representations to exhibit small differences between these frames. But ...

متن کامل

Object-Centric Representation Learning from Unlabeled Videos

Supervised (pre-)training currently yields state-of-the-art performance for representation learning for visual recognition, yet it comes at the cost of (1) intensive manual annotations and (2) an inherent restriction in the scope of data relevant for learning. In this work, we explore unsupervised feature learning from unlabeled video. We introduce a novel object-centric approach to temporal co...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

SoundNet: Learning Sound Representations from Unlabeled Video

نویسندگان

چکیده

منابع مشابه

Unsupervised feature learning on monaural DOA estimation using convolutional deep belief networks

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Unsupervised Learning of Behavioural Patterns for Video-Surveillance

Unsupervised learning from videos using temporal coherency deep networks

Object-Centric Representation Learning from Unlabeled Videos

عنوان ژورنال:

اشتراک گذاری