Supplementary Material: Multi-Task Video Captioning with Video and Entailment Generation
نویسندگان
چکیده
1.1.1 Video Captioning Datasets YouTube2Text or MSVD The Microsoft Research Video Description Corpus (MSVD) or YouTube2Text (Chen and Dolan, 2011) is used for our primary video captioning experiments. It has 1970 YouTube videos in the wild with many diverse captions in multiple languages for each video. Caption annotations to these videos are collected using Amazon Mechanical Turk (AMT). All our experiments use only English captions. On average, each video has 40 captions, and the overall dataset has about 80, 000 unique video-caption pairs. The average clip duration is roughly 10 seconds. We used the standard split as stated in Venugopalan et al. (2015), i.e., 1200 videos for training, 100 videos for validation, and 670 for testing.
منابع مشابه
Multi-Task Video Captioning with Video and Entailment Generation
Video captioning, the task of describing the content of a video, has seen some promising improvements in recent years with sequence-to-sequence models, but accurately learning the temporal and logical dynamics involved in the task still remains a challenge, especially given the lack of sufficient annotated data. We improve video captioning by sharing knowledge with two related directed-generati...
متن کاملSupplementary Material: Reinforced Video Captioning with Entailment Rewards
Our attention baseline model is similar to the Bahdanau et al. (2015) architecture, where we encode input frame level video features to a bi-directional LSTM-RNN and then generate the caption using a single layer LSTM-RNN, with an attention mechanism. Let {f1, f2, ..., fn} be the frame-level features of a video clip and {w1, w2, ..., wm} be the sequence of words forming a caption. The distribut...
متن کاملReinforced Video Captioning with Entailment Rewards
Sequence-to-sequence models have shown promising improvements on the temporal task of video captioning, but they optimize word-level cross-entropy loss during training. First, using policy gradient and mixed-loss methods for reinforcement learning, we directly optimize sentence-level task-based metrics (as rewards), achieving significant improvements over the baseline, based on both automatic m...
متن کاملJoint Event Detection and Description in Continuous Video Streams
As a fine-grained video understanding task, dense video captioning involves first localizing events in a video and then generating captions for the identified events. We present the Joint Event Detection and Description Network (JEDDi-Net) that solves the dense captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional la...
متن کاملThe Feedback Based Mechanism for Video Streaming Over Multipath Ad Hoc Networks
Ad hoc networks are multi-hop wireless networks without a pre-installed infrastructure. Such networks are widely used in military applications and in emergency situations as they permit the establishment of a communication network at very short notice with a very low cost. Video is very sensitive for packet loss and wireless ad-hoc networks are error prone due to node mobility and weak links. H...
متن کامل