Learning to Guide Decoding for Image Captioning
نویسندگان
چکیده
Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component called guiding network. The guiding network models the attribute properties of input images, and its output is leveraged to compose the input of the decoder at each time step. The guiding network can be plugged into the current encoder-decoder framework and trained in an end-to-end manner. Hence, the guiding vector can be adaptively learned according to the signal from the decoder, making itself to embed information from both image and language. Additionally, discriminative supervision can be employed to further improve the quality of guidance. The advantages of our proposed approach are verified by experiments carried out on the MS COCO dataset.
منابع مشابه
Changes on the Horizon for the Multimedia Community
The Impact of Deep Learning The development of AI algorithms, represented by deep learning, has bolstered multimedia research. In particular, deep learning has led to a multimodality-based algorithm framework, enabling the effective fusion and use of cross-domain data. Take image and video captioning, for example. A couple of years ago, tagging was the only way to describe images and videos. Bu...
متن کاملAutomated Image Captioning for Rapid Prototyping and Resource Constrained Environments
Significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Ensuring scalability of automated image captioning systems with respect to the ever increasing volume of image and video data is a significant challenge. This paper provides a valuable insight in t...
متن کاملStack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملAutomated Image Captioning Using Nearest-Neighbors Approach Driven by Top-Object Detections
The significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Two broad paradigms have emerged in automated image captioning, i.e., generative model-based approaches and retrieval-based approaches. Although generative model-based approaches that use the r...
متن کاملDeepDiary: Automatic Caption Generation for Lifelogging Image Streams
Lifelogging cameras capture everyday life from a firstperson perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate caption...
متن کامل