Unsupervised Audio Analysis for Categorizing Heterogeneous Consumer Domain Videos

نویسندگان

  • Pradeep Natarajan
  • Stavros Tsakalidis
  • Vasant Manohar
  • Rohit Prasad
  • Premkumar Natarajan
چکیده

The ever increasing volume of consumer domain videos on the Internet has led to a surge in interest in automatically analyzing such content. The audio signal in these videos contains salient information, but applying current automatic speech recognition (ASR) techniques is not viable due to high variability, noise and multilingual content. We present two unsupervised techniques which do not rely on ASR to address these challenges. The first method involves learning an unsupervised codebook by clustering audio features, and the second involves directly matching low-level features using the pyramid match kernel (PMK). Experimental results on a ≈200 hour audio corpus downloaded from YouTube show that both our approaches significantly outperform the traditional approach of first segmenting the audio stream to a set of mid-level classes (e.g. speech, non-speech, music, silence) and using the duration statistics of these classes to train high-level classifiers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an a...

متن کامل

Robust Event Detection From Spoken Content In Consumer Domain Videos

In this paper, we propose an innovative integrated approach to leverage available spoken content while detecting events in consumer-generated multimedia data (i.e., YouTube videos). Spoken content in consumer videos exhibits several challenges. For example, unlike Broadcast News, the spoken audio is typically not labeled. Also, the audio track in consumer videos tends to be noisy and the spoken...

متن کامل

Unsupervised Mining of Statistical Temporal Structures in Video

In this paper, we present algorithms for unsupervised mining of structures in video using multi-scale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores th...

متن کامل

Event Recognition in Videos by Learning from Heterogeneous Web Sources

In this work, we propose to leverage a large number of loosely labeled web videos (e.g., from YouTube) and web images (e.g., from Google/Bing image search) for visual event recognition in consumer videos without requiring any labeled consumer videos. We formulate this task as a new multi-domain adaptation problem with heterogeneous sources, in which the samples from different source domains can...

متن کامل

Chapter 10 UNSUPERVISED MINING OF STATISTICAL TEMPORAL STRUCTURES IN VIDEO

In this chapter we present algorithms for unsupervised mining of structures in video using multi-scale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011