Connecting Language and Vision Using Crowdsourced Dense Image Annotations
نویسندگان
چکیده
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an Ranjay Krishna Stanford University, Stanford, CA, USA E-mail: [email protected] Yuke Zhu Stanford University, Stanford, CA, USA Oliver Groth Dresden University of Technology, Dresden, Germany Justin Johnson Stanford University, Stanford, CA, USA Kenji Hata Stanford University, Stanford, CA, USA Joshua Kravitz Stanford University, Stanford, CA, USA Stephanie Chen Stanford University, Stanford, CA, USA Yannis Kalantidis Yahoo Inc., San Francisco, CA, USA Li-Jia Li Snapchat Inc., Los Angeles, CA, USA David A. Shamma Centrum Wiskunde & Informatica (CWI), Amsterdam Michael S. Bernstein Stanford University, Stanford, CA, USA Li Fei-Fei Stanford University, Stanford, CA, USA image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
منابع مشابه
Quality Assessment for Crowdsourced Object Annotations
As computer vision datasets grow larger the community is increasingly relying on crowdsourced annotations to train and test our algorithms. Due to the heterogeneous and unpredictable capability of online annotators, various strategies have been proposed to “clean” crowdsourced annotations. However, these strategies typically involve getting more annotations, perhaps different types of annotatio...
متن کاملThe Effects of Multimedia Annotations on Iranian EFL Learners’ L2 Vocabulary Learning
In our modern technological world, Computer-Assisted Language learning (CALL) is a new realm towards learning a language in general, and learning L2 vocabulary in particular. It is assumed that the use of multimedia annotations promotes language learners’ vocabulary acquisition. Therefore, this study set out to investigate the effects of different multimedia annotations (still picture annotatio...
متن کاملCLAD: A Complex and Long Activities Dataset with Rich Crowdsourced Annotations
This paper introduces a novel activity dataset which exhibits real-life and diverse scenarios of complex, temporallyextended human activities and actions. The dataset presents a set of videos of actors performing everyday activities in a natural and unscripted manner. The dataset was recorded using a static Kinect 2 sensor which is commonly used on many robotic platforms. The dataset comprises ...
متن کاملApples to Oranges: Evaluating Image Annotations from Natural Language Processing Systems
We examine evaluation methods for systems that automatically annotate images using cooccurring text. We compare previous datasets for this task using a series of baseline measures inspired by those used in information retrieval, computer vision, and extractive summarization. Some of our baselines match or exceed the best published scores for those datasets. These results illuminate incorrect as...
متن کاملCrowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd.
The development of tools in computational pathology to assist physicians and biomedical scientists in the diagnosis of disease requires access to high-quality annotated images for algorithm learning and evaluation. Generating high-quality expert-derived annotations is time-consuming and expensive. We explore the use of crowdsourcing for rapidly obtaining annotations for two core tasks in com- p...
متن کامل