Hierarchical Scene Annotation
نویسندگان
چکیده
Supervised datasets play a central role as standards against which to benchmark long-term progress in computer vision. Over the past decade, the PASCAL [2] and Berkeley segmentation datasets (BSDS) [3] have filled these roles for the object detection and image segmentation tasks, respectively. The type of annotation available for each dataset determines the particular visual subtasks to which it is applicable. Object bounding boxes can benchmark detection algorithms, but are of limited use for training or evaluating segmentation. Segmented objects are more widely useful, but more time-consuming to annotate. What visual tasks are most important and what level of annotation detail is appropriate? We present an alternative to thinking about dataset annotation in terms of a restricted set of visual tasks. Our key observation is that a hierarchical groundtruth representation, in the form of a doubly ordered region tree, allows one to subsume disparate aspects of image labeling into a single framework. Specifically, we capture a nearly complete description of any scene in terms of objects, parts, object-part containment, segmentation, and figure-ground or occlusion ordering. Figure 1 illustrates the type of detail our annotation model encompasses for a typical scene. Our unifying abstraction regards a scene as a set S = {R1,R2, ...,Rn} where each Ri ⊆ I is a region in the image I. In general, Ri ∩R j may be nonempty. It then organizes regions {Ri} into a tree T . Let N(Ri) denote the node of T corresponding to region Ri. N(R j) is set to be the parent node of N(Ri) iff: (1) R j ⊃ Ri, (2) R j and Ri have an object-part relationship, and (3) @Rk : R j ⊃ Rk ⊃ Ri and R j, Rk, Ri have an objectpart-subpart relationship, respectively. If for Ri, no region R j satisfies all three conditions, then we set N(Ri) to be a child of the root node. Simply stated, T decomposes the the scene into a multilevel object-part hierarchy. We exploit one additional degree of freedom within T : the order O(·) in which nodes appear beneath a common parent encodes local occlusion relationships. Given sibling regions Ri and R j such that Ri∩R j 6= / 0, then Ri occludes R j if O(Ri)< O(R j), and R j occludes Ri if O(Ri)> O(R j). If Ri ∩R j = / 0, then they do not occlude one another and we disregard their relative ordering. Given T and O, preorder tree traversal uses the objectpart hierarchy to translate local occlusions into a global figure-ground ordering. It also recovers a groundtruth ultrametric contour map (UCM) [1] weighting visible boundaries by the structural importance (object, part, subpart) of the regions they enclose. Figure 2 shows a partial scene tree for a self-occluding object. Our full paper describes extensions for representing loops (shirt both behind and in front of arm). To mitigate the cost of creating rich groundtruth, we introduce a web-based annotation tool with a graphical interface for managing the region hierarchy. Our software eliminates tedious tracing of region boundaries through a dynamic paintbrush that snaps to the shape of underlying superpixels in a precomputed oversegmentation. Combined with a touch-up mode, it guides fast creation of pixel-perfect regions. Figure 2: Our model of a scene or object groups pixels into regions and maps regions to nodes in a tree. Parent-child links denote region containment and semantic object-part relationships. Relative ordering of sibling nodes resolves occlusion ambiguities. Object-Part ⇒ ⇐ O c c lu s io n O rd e ri n g man
منابع مشابه
Traffic Scene Analysis using Hierarchical Sparse Topical Coding
Analyzing motion patterns in traffic videos can be exploited directly to generate high-level descriptions of the video contents. Such descriptions may further be employed in different traffic applications such as traffic phase detection and abnormal event detection. One of the most recent and successful unsupervised methods for complex traffic scene analysis is based on topic models. In this pa...
متن کاملVideo Scene Retrieval Using Online Video Annotation
In this paper, we propose an efficient method for extracting scene tags from online video annotation (e.g., comments about video scenes). To evaluate this method by applying extracted information to video scene retrieval, we have developed a video scene retrieval system based on scene tags (i.e., tags associated with video scenes). We have also developed a tag selection system that enables onli...
متن کاملFuzzy Emotional Semantic Analysis and Automated Annotation of Scene Images
With the advances in electronic and imaging techniques, the production of digital images has rapidly increased, and the extraction and automated annotation of emotional semantics implied by images have become issues that must be urgently addressed. To better simulate human subjectivity and ambiguity for understanding scene images, the current study proposes an emotional semantic annotation meth...
متن کاملSampling Table Configurations for the Hierarchical Poisson-Dirichlet Process
•Discrete hierarchies are ubiquitous in intelligent systems. • The Poisson-Dirichlet process (PDP ) [1] allow statistical inference and learning on discrete hierarchies, e.g., hierarchy of Dirichlet distributions. • Applications of the PDP/HPDP include but not limited to: – Topic modeling: Finding meaningful topics discussed in large set of documents. Beneficial to automatic document analysis a...
متن کاملDeep Learning of Hierarchical Structure
Hierarchical and recursive structure is commonly found in inputs from the richest sensory modalities, including natural language sentences and scene images. But such hierarchical structure has traditionally been a strong point of both structured and supervised models (whether symbolic of probabilistic) and a weak point of both neural networks and unsupervised learning. I will present some of ou...
متن کامل