Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

نویسندگان

Haoran Li

Junnan Zhu

Cong Ma

Jiajun Zhang

Chengqing Zong

چکیده

The rapid increase in multimedia data transmission over the Internet necessitates the multi-modal summarization (MMS) from collections of text, image, audio and video. In this work, we propose an extractive multi-modal summarization method that can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal content. For audio information, we design an approach to selectively use its transcription. For visual information, we learn the joint representations of text and images using a neural network. Finally, all of the multimodal aspects are considered to generate the textual summary by maximizing the salience, non-redundancy, readability and coverage through the budgeted optimization of submodular functions. We further introduce an MMS corpus in English and Chinese, which is released to the public1. The experimental results obtained on this dataset demonstrate that our method outperforms other competitive baseline methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos

Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio and video tracks, it is possible to extract text transcripts from audio, displayed text from vide...

متن کامل

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation

In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003. We applied and extended the Maximum Entropy statistical model to effectively fuse diverse features from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/spe...

متن کامل

News Video Story Segmentation Using Fusion of Multi-level Multi-modal Features in Trecvid

متن کامل

Layered Dynamic Mixture Model for Pattern Discovery in Asynchronous Multi-modal Streams

We propose a layered dynamic mixture model for asynchronous multi-modal fusion for unsupervised pattern discovery in video. The lower layer of the model uses generative temporal structures such as a hierarchical hidden Markov model to convert the audiovisual streams into mid-level labels, it also models the correlations in text with probabilistic latent semantic analysis. The upper layer fuses ...

متن کامل

Name-It: Naming and Detecting Faces in News Video

We have developed Name-It, a system that associates faces and names in news videos. The system is given news videos, which include image sequences and transcripts obtained from audio tracks or closed caption texts. The system can then either infer possible name candidates for a given face, or locate a face in news videos by name. To accomplish this task, the system takes a multi-modal video ana...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

نویسندگان

چکیده

منابع مشابه

Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation

News Video Story Segmentation Using Fusion of Multi-level Multi-modal Features in Trecvid

Layered Dynamic Mixture Model for Pattern Discovery in Asynchronous Multi-modal Streams

Name-It: Naming and Detecting Faces in News Video

عنوان ژورنال:

اشتراک گذاری