Diverse video captioning through latent variable expansion
نویسندگان
چکیده
• A diverse captioning model of full convolution design is proposed. We develop a new evaluation metric to assess the sentence diversity. Our method achieves superior performance compared state-of-the-art benchmarks. Automatically describing video content with text description challenging but important task, which has been attracting lot attention in computer vision community. Previous works mainly strive for accuracy generated sentences, while ignoring sentences diversity, inconsistent human behavior. In this paper, we aim caption each multiple descriptions and propose novel framework. Concretely, given video, intermediate latent variables conventional encode-decode process are utilized as input conditional generative adversarial network (CGAN) purpose generating sentences. adopt different Convolutional Neural Networks (CNNs) our generator that produces conditioned on discriminator assesses quality Simultaneously, DCE designed captions. evaluate benchmark datasets, where it demonstrates its ability generate results against other methods.
منابع مشابه
Nonlinear Latent Variable Models for Video Sequences
Many high-dimensional time-varying signals can be modeled as a sequence of noisy nonlinear observations of a low-dimensional dynamical process. Given high-dimensional observations and a distribution describing the dynamical process, we present a computationally inexpensive approximate algorithm for estimating the inverse of this mapping. Once this mapping is learned, we can invert it to constru...
متن کاملDiverse Image Captioning via GroupTalk
Generally speaking, different persons tend to describe images from various aspects due to their individually subjective perception. As a result, generating the appropriate descriptions of images with both diversity and high quality is of great importance. In this paper, we propose a framework called GroupTalk to learn multiple image caption distributions simultaneously and effectively mimic the...
متن کاملReconstruction Network for Video Captioning
In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...
متن کاملCaptioning Images with Diverse Objects Supplementary Material
We present additional examples of the NOC model’s descriptions on Imagenet images. We first present some examples where the model is able to generate descriptions of an object in different contexts. Then we present several examples to demonstrate the diversity of objects that NOC can describe. We then present examples where the model generates erroneous descriptions and categorize these errors.
متن کاملConsensus-based Sequence Training for Video Captioning
Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Pattern Recognition Letters
سال: 2022
ISSN: ['1872-7344', '0167-8655']
DOI: https://doi.org/10.1016/j.patrec.2022.05.021