Visual Madlibs: Fill in the blank Image Generation and Question Answering
نویسندگان
چکیده
In this paper, we introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks.
منابع مشابه
Solving VIsual Madlibs with Multiple Cues
This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also presen...
متن کاملMean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
Question answering about real-world images is a relatively new research direction that requires a chain of machine visual perception, natural language understanding, and deductive capabilities to successfully come up with an answer on a question about visual content. In contrast to many classical Computer Vision problems such as recognition or detection, this task does not evaluate any internal...
متن کاملLearning Models for Actions and Person-Object Interactions with Transfer to Question Answering
In this paper, we propose a convolutional deep network model which utilizes local and global context through feature fusion to make human activity label predictions and achieve state-of-the-art performance on two different activity recognition datasets, the HICO and MPII Human Pose Dataset. We use Multiple Instance Learning to handle the lack of full person instance-label supervision and weight...
متن کاملWhat is in that picture ? Visual Question Answering System
Visual Question Answering is a complex task which aims at answering a question about an image. This task requires multimodal representation learning for both images and text. To solve this problem, an image and text representation is required and high level interactions between the two must be carefully encoded into the model in order to provide the correct answer. The answers can be a word, a ...
متن کاملBoosting Passage Retrieval through Reuse in Question Answering
Question Answering (QA) is an emerging important field in Information Retrieval. In a QA system the archive of previous questions asked from the system makes a collection full of useful factual nuggets. This paper makes an initial attempt to investigate the reuse of facts contained in the archive of previous questions to help and gain performance in answering future related factoid questions. I...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1506.00278 شماره
صفحات -
تاریخ انتشار 2015