Describing Common Human Visual Actions in Images
نویسندگان
چکیده
Which common human actions and interactions are recognizable in monocular still images? Which involve objects and/or other people? How many is a person performing at a time? We address these questions by exploring the actions and interactions that are detectable in the images of the MS COCO dataset. We make two main contributions. First, a list of 140 common ‘visual actions’, obtained by analyzing the largest on-line verb lexicon currently available for English (VerbNet) and human sentences used to describe images in MS COCO. Second, a complete set of annotations for those ‘visual actions’, composed of subject-object and associated verb, which we call COCO-a (a for ‘actions’). COCO-a is larger than existing action datasets in terms of number of actions and instances of these actions, and is unique because it is data-driven, rather than experimenter-biased. Other unique features are that it is exhaustive, and that all subjects and objects are localized. A statistical analysis of the accuracy of our annotations and of each action, interaction and subject-object combination is provided.
منابع مشابه
RONCHI AND PERONA: DESCRIBING COMMON HUMAN VISUAL ACTIONS IN IMAGES 1 Describing Common Human Visual Actions in Images
Which common human actions and interactions are recognizable in monocular still images? Which involve objects and/or other people? How many is a person performing at a time? We address these questions by exploring the actions and interactions that are detectable in the images of the MS COCO dataset. We make two main contributions. First, a list of 140 common ‘visual actions’, obtained by analyz...
متن کاملThe Rhetorical - Aesthetic Approach to Constructing the Relation between Images and Visual Inventions with Global Politics
Images and photos play an important role in our understanding of domestic and international events. Today we are living in the age of the visualization of politics. The images are vague, rhetorical, and aesthetic components of political and social phenomena and can give them a beautiful or detestable structure. In the digital age, images in and of themselves can define our structure and vision ...
متن کاملDescribing Images using Inferred Visual Dependency Representations
The Visual Dependency Representation (VDR) is an explicit model of the spatial relationships between objects in an image. In this paper we present an approach to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-the-art object detector, and to use successful detections...
متن کاملGrounding Action Descriptions in Videos
Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high qualit...
متن کاملReduced-Reference Image Quality Assessment based on saliency region extraction
In this paper, a novel saliency theory based RR-IQA metric is introduced. As the human visual system is sensitive to the salient region, evaluating the image quality based on the salient region could increase the accuracy of the algorithm. In order to extract the salient regions, we use blob decomposition (BD) tool as a texture component descriptor. A new method for blob decomposition is propos...
متن کامل