Deep Learning for Visual Question Answering
نویسنده
چکیده
This project deals with the problem of Visual Question Answering (VQA). We develop neural network-based models to answer open-ended questions that are grounded in images. We used the newly released VQA dataset (with about 750K questions) to carry out our experiments. Our model makes use of two popular neural network architecture: Convolutional Neural Nets (CNN) and Long Short Term Memory Networks (LSTM). We use state-of-the-art CNN features for encoding images, and word embeddings to encode the words. Our Bag-of-word + CNN model obtained an accuracy of 44.47%, while our CNN+LSTM model obtained an accuracy of 47.80% on the validation set of the VQA dataset. The code has been open sourced under the MIT License, and is the first open-source project to work with the VQA dataset.
منابع مشابه
Visual Question Answering Using Various Methods
This project tries to apply deep learning tools to enable computer answering question by looking at images. In this project, the visual question answering dataset[1] is introduced. This dataset consists of 204,721 real images, 614,164 question and 50,000 abstract scenes, 150,000 questions. Various methods are reproduced. The analysis on different models are presented.
متن کاملLearning Convolutional Text Representations for Visual Question Answering
Visual question answering is a recently proposed articial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are typically modeled through recurrent neural networks. While the requirement for modeling images is similar to traditional computer vision tasks, such as object ...
متن کاملVisual Question Answering using Deep Learning
Multimodal learning between images and language has gained attention of researchers over the past few years. Using recent deep learning techniques, specifically end-to-end trainable artificial neural networks, performance in tasks like automatic image captioning, bidirectional sentence and image retrieval have been significantly improved. Recently, as a further exploration of present artificial...
متن کاملSurvey of Visual Question Answering: Datasets and Techniques
Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The first part of this survey details the various datasets for VQA and compares them along some common factors. The second part of this survey details the differe...
متن کاملDeep learning evaluation using deep linguistic processing
We discuss problems with the standard approaches to evaluation for tasks like visual question answering, and argue that artificial data can be used to address these as a complement to current practice. We demonstrate that with the help of existing ‘deep’ linguistic processing technology we are able to create challenging abstract datasets, which enable us to investigate the language understandin...
متن کاملVisual Madlibs: Fill in the blank Image Generation and Question Answering
In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene o...
متن کامل