Knowledge-Based Spatial Reasoning for Scene Generation from Text Descriptions
نویسنده
چکیده
This system translates basic English descriptions of a wide range of objects in a simplistic zoo environment into plausible, three-dimensional, interactive visualizations of their positions, orientations, and dimensions. It combines a semantic network and contextually sensitive knowledge base as representations for explicit and implicit spatial knowledge, respectively. Its linguistic aspects address underspecification, vagueness, uncertainty, and context with respect to intrinsic, extrinsic, and deictic frames of spatial reference. The underlying, commonsense reasoning formalism is probability-based geometric fields that are solved through constraint satisfaction. The architecture serves as an extensible test-and-evaluation framework for a multitude of linguistic and artificialintelligence investigations. Introduction and Background A simple description like a large dog is in front of a cat and near a small tree explicitly specifies only a tiny fraction of the details that a corresponding image contains. Most of the content comes from an implicit, commonsense, contextual understanding of the words. Such spatial reasoning, like most intelligent processes, is a difficult computational task to emulate despite its apparent, intuitive simplicity for humans (Herskovits 1986, Tversky 2000). What makes the problem especially troublesome is that computers lack our intangible knowledge of the world and powerful abilities to reason intelligently over it. This work addresses the primary aspects of these issues in terms of what to represent and how to represent it. It uses a simple representation of a description in conjunction with a knowledge base of relevant spatial details to define the declarative form of a valid solution. A constraint satisfaction algorithm then generates any number of corresponding interpretations with plausible positions, orientations, and dimensions for the objects. Four knowledge-based spatial issues are the focus: underspecification, or the lack of complete details in a Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. description, requires background knowledge to supply implicit information; vagueness, or the imprecise nature of descriptions, requires knowledge that defines a range of plausible interpretations; uncertainty, or the lack of commitment to a particular interpretation, requires knowledge of preferences over this range; and context, or the different interpretation of objects in certain combinations, requires knowledge to identify and interpret such patterns. These issues are considered for three frames of spatial reference (Olivier and Tsujii 1994). The intrinsic (objectcentered) frame generally applies to objects that have an accepted front, like dog. The extrinsic (environmentcentered) frame and the deictic (viewer-centered) frames are generally the opposite case for objects without such a front, like tree. They correspond to the viewer's position being explicitly stated or loosely implied, respectively. Knowledge Representation A description consists of nouns, adjectives, prepositions, and various support words. The nouns refer primarily to animals and plants within a zoo scenario because they exhibit a variety of interesting and visually appealing spatial characteristics. The adjectives play a role in the contextually appropriate determination of size. The prepositions are 58 spatial relations for position (e.g., in front, left, north, between), distance (e.g., inside, near, far), and orientation (e.g., facing toward, away from, north). Explicit Representation The explicit knowledge in a description is represented with a semantic network of object nodes, attribute nodes, and directed relation arcs, which map closely to nouns, adjectives, and prepositions, respectively. For example, Figure 1a depicts the semantic network for Loki is a small retriever; the tree is north of Loki; Loki is facing the tree. Implicit Representation To understand the meaning of the description even superficially requires deeper analysis into what the objects are and how their spatial rules apply to them (Davis 1990). This implicit, commonsense background knowledge is represented in a knowledge base that is similar to an inheritance hierarchy in object-oriented programming. It currently contains 108 concepts that either inherit their contents from their ancestors or define/override them. A simplified example appears in Figure 1b. Linking the semantic network to the knowledge base provides the objects with the appropriate rules for interpreting their position and orientation relations and dimension attributes. Spatial Relations Each spatial relation is associated with one or more circular, two-dimensional fields of 100 rings and 32 sectors that have two complementary parts (Yamada 1993, Gapp 1994, Olivier and Tsujii 1994, and Freska 1992). The geometry specifies where another object can and cannot appear with respect to the object in the center. Most relations use variants of the wedge and ring fields in Figures 2a-b. The topography overlays a probability distribution on the geometry to specify preferences in placement, as Figures 2c-d show. Fields may also be combined with the standard logical operators and, or, xor, and not to represent compositional linguistic expressions like in front of and far from. Spatial Reasoning The intelligent, commonsense aspects of the spatial reasoning are actually performed earlier by establishing their contextually appropriate, qualitative constraints. Generating a solution from them is now a straightforward, mechanical process of quantitative constraint satisfaction using a greedy, backtracking strategy to generate and test positions and orientations for every pair of objects in a relationship. Interactive Visualization The graphical output is a three-dimensional, interactive world, in which the viewer can move to any vantage point and perspective. It is also possible to query the objects on their underlying representations and constraints, etc. Various display modes depict supporting details like the geometry and topography of the fields, as well as alternative solutions. Figure 3 renders the dog is south of the tree and near the panther; the panther is to the right of the dog; and the elk is near the maple tree and midrange from and facing away from the pond. References Davis, E. 1990. Representations of Commonsense Knowledge. Morgan Kaufmann, San Mateo: CA. Freska, C. 1992. Using Orientation Information for Qualitative Spatial Reasoning. In Frank, A.; Campari, I.; and Formentini, U., eds. Theories and Methods of SpatioTemporal Reasoning in Geographic Space, LNCS 639, Springer-Verlag, Berlin. Gapp, K. 1994. Basic Meanings of Spatial Relations: Computation and Evaluation in 3D Space. In Proceedings of AAAI-94, 1393-1398. Seattle, WA. Herskovits, A. 1986. Language and Spatial Cognition: An interdisciplinary Study of the Prepositions in English. Cambridge: Cambridge University Press. Olivier, P.; and Tsujii, J. 1994. A computational view of the cognitive semantics of spatial prepositions. In Proceedings of 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), Las Cruces, New Mexico. Tversky, B. 2000. Levels and structure of spatial knowledge. In Cognitive Mapping: Past, present and future, Kitchin, R.; and Freundshuh, S., eds. London and New York: Routledge. Yamada, A. 1993. Studies on Spatial Description Understanding Based on Geometric Constraints Satisfaction. Ph.D. diss., University of Kyoto. Figure 2: Geometry and Topography of Wedge and Ring a b c d Figure 3: Sample Visualization Figure 1: Semantic Network and Knowledge Base DOG CANINE ANIMAL LIVINGTHING THING
منابع مشابه
Learning Spatial Knowledge for Text to 3D Scene Generation
We address the grounding of natural language to concrete spatial constraints, and inference of implicit pragmatics in 3D environments. We apply our approach to the task of text-to-3D scene generation. We present a representation for common sense spatial knowledge and an approach to extract it from 3D scene data. In text-to3D scene generation, a user provides as input natural language text from ...
متن کاملText-to-3D Scene Generation using Semantic Parsing and Spatial Knowledge with Rule Based System
Scene Generation plays an important role in digital media to represent a news or a specific domain to the viewers. It’s not easy to produce a scene from a text. Text may not completely express the whole situation in digital media. Most of the people are not conscious about the news until it's not visualized to them. Text to 3D scene generation is a process where people do not need to read a new...
متن کاملSpatial Relations in Text-to-Scene Conversion
Spatial relations play an important role in our understanding of language. In particular, they are a crucial component in descriptions of scenes in the world. WordsEye (www.wordseye.com) is a system for automatically converting natural language text into 3D scenes representing the meaning of that text. Natural language offers an interface to scene generation that is intuitive and immediately ap...
متن کاملSpatial Descriptions as Referring Expressions in the MapTask Domain
We discuss work-in-progress on a hybrid approach to the generation of spatial descriptions, using the maps of the Map Task dialogue corpus as domain models. We treat spatial descriptions as referring expressions that distinguish particular points on the maps from all other points (potential ‘distractors’). Our approach is based on rule-based overgeneration of spatial descriptions combined with ...
متن کاملSemantic Parsing for Text to 3D Scene Generation
We propose text-to-scene generation as an application for semantic parsing. This is an application that grounds semantics in a virtual world that requires understanding of common, everyday language. In text to scene generation, the user provides a textual description and the system generates a 3D scene. For example, Figure 1 shows the generated scene for the input text “there is a room with a c...
متن کاملUtilizing Interval-Based Event Representations for Incremental High-Level Scene Analysis
Within the project Vitra (VIsual TRAnslator) we are concerned with the design and construction of integrated knowledge-based systems capable of translating visual information into natural language descriptions. In this contribution the focus will be on high-level scene analysis, i.e., the step from a geometrical representation, as it might be provided by a vision component, into conceptual desc...
متن کامل