captioning order

Effect of Speech Recognition Errors on Text Understandability for People who are Deaf or Hard of Hearing

2017

Sushant Kafle Matt Huenerfauth

Recent advancements in the accuracy of Automated Speech Recognition (ASR) technologies have made them a potential candidate for the task of captioning. However, the presence of errors in the output may present challenges in their use in a fully automatic system. In this research, we are looking more closely into the impact of different inaccurate transcriptions from the ASR system on the unders...

متن کامل

Unpaired Image Captioning by Language Pivoting

2018

Jiuxiang Gu Shafiq Joty Jianfei Cai Gang Wang

Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description. In general, the mapping function is learned from a training set of image-caption pairs. However, for some language, large scale image-caption paired corpus might not be available. We present an approach to this ...

متن کامل

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2017

Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould Lei Zhang

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image region...

متن کامل

Teaching Machines to Describe Images via Natural Language Feedback

Journal: :CoRR 2017

Huan Ling Sanja Fidler

Robots will eventually be part of every household. It is thus critical to enable algorithms to learn from and be guided by non-expert users. In this paper, we bring a human in the loop, and enable a human teacher to give feedback to a learning agent in the form of natural language. We argue that a descriptive sentence can provide a much stronger learning signal than a numeric reward in that it ...

متن کامل

Excitation Backprop for RNNs

Journal: :CoRR 2017

Sarah Adel Bargal Andrea Zunino Donghyun Kim Jianming Zhang Vittorio Murino Stan Sclaroff

Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such ...

متن کامل

Review Networks for Caption Generation

2016

Zhilin Yang Ye Yuan Yuexin Wu William W. Cohen Ruslan Salakhutdinov

We propose a novel extension of the encoder-decoder framework, called a review network. The review network is generic and can enhance any existing encoderdecoder model: in this paper, we consider RNN decoders with both CNN and RNN encoders. The review network performs a number of review steps with attention mechanism on the encoder hidden states, and outputs a thought vector after each review s...

متن کامل

Text-Guided Attention Model for Image Captioning

2017

Jonghwan Mun Minsu Cho Bohyung Han

Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns t...

متن کامل

Adaptive Feature Abstraction for Translating Video to Language

Journal: :CoRR 2016

Yunchen Pu Martin Renqiang Min Zhe Gan Lawrence Carin

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video representations, preventing them from modeling rich, varying context-dependent semantics in video descriptions. In this paper, we propose a new approach to generating adaptive spatiotemporal representations of videos for a captioning task. For this purpose, novel atte...

متن کامل

Step by Step: A Gradual Approach for Dense Video Captioning

Journal: :IEEE Access 2023

Dense video captioning aims to localize and describe events for storytelling in untrimmed videos. It is a conceptually very challenging task that requires concise, relevant, coherent based on high-quality event localization. Unlike simple temporal action localization tasks without overlapping events, dense detecting multiple/overlapping regions order branch out the story. Most existing methods ...

متن کامل

New improvements in decoding speed and latency for automatic captioning

2006

Jian Xue Rusheng Hu Yunxin Zhao

In this paper, we present new improvements in decoding speed and latency for automatic captioning in telehealth. Complementary local word confidence scores are used to prune uncompetitive search paths. Subspace distribution clustering hidden Markov modeling (SDCHMM) is used for fast generation of acoustic and local confidence scores, where overlap accumulative probability (OAP) is used to measu...

متن کامل