Enabling Domain-Awareness for a Generic Natural Language Interface
نویسندگان
چکیده
In this paper, we present a learning-based approach for enabling domain-awareness for a generic natural language interface. Our approach automatically acquires domain knowledge from user interactions and incorporates the knowledge learned to improve the generic system. We have embedded our approach in a generic natural language interface and evaluated the extended system against two benchmark datasets. We found that the performance of the original generic system can be substantially improved through automatic domain knowledge extraction and incorporation. We also show that the generic system with domain-awareness enabled by our approach can achieve performance similar to that of previous learning-based domain-specific systems. Introduction This work is intended to achieve the promise shown by NaLIX (Natural Language Interface to XML), a generic natural language interface for an XML database (Li 2005; 2006a). NaLIX can accept an arbitrary English language sentence as a query input. This query, which can include complex query semantics such as aggregation, nesting, and value joins, is then translated into an XQuery expression. The core system has limited linguistic capabilities. Whenever the system cannot properly translate a given query, it sends meaningful feedback to the user and asks her to reformulate the query into one that the system can understand. A previous study (Li 2006a) has shown that such a system is already usable in practice—users can issue complex database queries in plain English and obtain precise results back. While generic systems like NaLIX are widely praised for their portability—they can be universally deployed without expensive efforts for building domain knowledge bases— their portability also comes at a price. For instance, in NaLIX, reformulations are often required for queries containing domain-specific terms; additionally, terms with important domain semantics may simply be ignored by the generic system, resulting in loss of accuracy. In this paper, we describe our approach to enable domainawareness for such a generic natural language interface to improve its translation accuracy and reduce the need for reformulation without losing its portability. Whereas much ∗Supported in part by NSF grant IIS 0438909 and NIH grants R01 LM008106 and U54 DA021519. †Supported in part by the NSF grant CCF-0432027. Copyright c © 2007, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. research has been done on the use of learning to build a domain-specific natural language interface for databases (NLIDB), we believe ours is the first attempt to extend a generic NLIDB by automatically extracting domain knowledge from user interaction without explicitly burdening the users. Below, we briefly summarize related research, and then describe a motivational example, followed by our approach for extracting domain knowledge from user interactions and incorporating domain knowledge into a generic system. Finally, we report on experiments that evaluate our approach against two benchmark datasets. Related Work Mapping a given sentence into a structured data query or a general logical form is a central problem in designing natural language interface to databases (NLIDB). Extensive research has been done on NLIDBs. A comprehensive summary of earlier efforts is provided by (Androutsopoulos 1995). Recent research in this area has focused on leveraging advances in parsing techniques to design generic systems with easy portability (Popescu 2003; Li 2006a). Our approach can easily be adapted by such systems to enable domain-awareness. Previous learning-based NLIDBs (Androutsopoulos 1995), including recent efforts by Tang & Mooney (2001) and Zettlemoyer & Collins (2005), study how to learn the mapping from sentences to logical forms. Expensive training is required to learn such mapping rules, making it difficult to adapt these systems to a new domain. In contrast, the system with domain-awareness enabled by our approach retains the portability of the original generic system and thus requires no domain knowledge to be used in a new domain. Furthermore, previous systems require expensive manual creation of training data and demand expertise in both a specific domain and a formal query language. By comparison, building a knowledge base using our approach is much easier—no expertise in any formal query language is required: query pairs can easily be obtained as training data from actual user interaction with the system. Finally, unlike previous systems, where new training data typically cannot be exploited without re-training of the entire system, our approach allows incremental accumulation of the knowledge base. With the rise of semantic Web, growing attention has been paid to domain-aware systems, with the focus on the learning of ontology (Maedche & Staab 2001; Xu 2002; Gómez-pérez & Manzano-macho 2004; Buitelaar 2005). We view these learning approaches and our method as complementary to each other: while a generic system can take Table 1: Sample Types of Tokens and Marker Type Semantic Contribution Description Command Token(CMT) Return Clause Top main verb or wh-phrase (Quirk 1985) of parse tree, from an enum set of words and phrases Operator Token(OT) Operator A phrase from an enum set of preposition phrases Value Token(VT) Value A noun or noun phrase in quotation marks, a proper noun or noun phrase, or a number Name token(NT) Basic Variable A non-VT noun or noun phrase Connection Marker(CM) Connect two related tokens A preposition from an enumerated set, or non-token main verb Modifier Marker(MM) Distinguish two NTs An adjectives as determiner or a numeral as predetermine or postdeterminer (a) Invalid Query with a Domain-Specific Term (b) Valid Query after Reformulation Figure 1: Sample Iteration For Domain Knowledge Learning. advantage of such learning methods to improve domainawareness through the use of ontology-based term expansion, our learning approach can provide valuable training data for these methods. Approach Our approach to enable domain-awareness is implemented by extending the original NaLIX system with automatic domain knowledge extraction and exploitation based on a generic domain knowledge representation. Background and Motivation NaLIX consists of three main components for query translation: (a) given the dependency parse tree of a given English sentence, the classifier classifies words/phrases that can be mapped into XQuery components as tokens and those that cannot as markers (sample types are listed in Table 1); (b) the validator then determines whether the parse tree is one that the system knows how to map into XQuery; if it is, (c) the translator translates the parse tree into an XQuery expression; else, (d) the message generator sends feedback to the user and asks for reformulation. For a more detailed description of NaLIX, the reader is referred to (Li 2006a). NaLIX is purely generic and thus may not be able to correctly interpret terms with domain-specific semantics. Consider a geographic database with information about states and the rivers that run through their borders. Given the query “What are the states that share a watershed with California?”, the system cannot properly interpret “watershed,” as it is neither a generic term nor a term that can be found in the database. The system then sends feedback to the user and requests him to rephrase the query.1 The user might then respond by submitting “What are the states where a river of each state is the same as a river of California?”, which contains no domain specific term. Such an iteration results in a pair of queries with equivalent semantics, where the first requires domain knowledge and the second does not, providing a valuable opportunity for learning new domain knowledge. Such pairs of queries can be obtained automatically as a side effect of query reformulation by users. The goal of our approach is to take advantage of such query pairs and allow the system to learn from user actions while they are doing their normal work and thus become more responsive to users over time. To enable domain-awareness for NaLIX, we added the knowledge extractor and knowledge base to extract domain knowledge from user interactions and store the knowledge respectively. We also added the domain adapter to determine and apply applicable domain knowledge on a classified parse tree before sending it to the validator. The resulting system is Domain-aware NaLIX(DaNaLIX). Knowledge Representation The knowledge base must capture useful domain-specific semantics that can be used to improve query translation in a generic NLIDB. The model for knowledge representation needs to be generic enough to be able to capture such domain knowledge for any given domain. It should also be able to exploit what the generic NLIDB can already do—mapping domain-independent knowledge to query semantics. To do so, we choose a simple term mapping form, which expresses domain-specific knowledge in generic terms, over complex semantic logical forms such as lambda-calculus (Barendregt 1984). Specifically, we represent domain knowledge as a set of rules that can be used to transform the parse tree of a sentence that contains terms with domain-specific semantics into one that does not. The validator and translator Details of such feedback in NaLIX can be found in (Li 2006a). Algorithm 1: ApplyRule(Node n, Rule r) // check whether the left-hand side of the rule matches the // tree rooted at n if treeMatches(r.leftSide(),n) then // all matching conditions are
منابع مشابه
An Approach to Team Programming with Markup for Operator Interaction (Demonstration)
This paper presents a team plan specification language that combines work in the creation of generic team plans and design of intelligent interfaces. Two key motivations for developing the language are (1) to combine inter-agent cooperation and operator interaction of complex behaviors into a single plan, and (2) to separate plan design and UI design such that they are created by application do...
متن کاملAn approach to team programming with markup for operator interaction
This paper presents a team plan specification language that combines work in the creation of generic team plans and design of intelligent interfaces. Two motivations the language are (1) to combine inter-agent cooperation and operator interaction of complex behaviors into a single plan, and (2) to separate plan design and UI design such that they are created by application domain experts and hu...
متن کاملConstructing a Generic Natural Language Interface for an XML Database
We describe the construction of a generic natural language query interface to an XML database. Our interface can accept an arbitrary English sentence as a query, which can be quite complex and include aggregation, nesting, and value joins, among other things. This query is translated, potentially after reformulation, into an XQuery expression. The translation is based on mapping grammatical pro...
متن کاملEnabling Context-aware Services for Mobile Users
We present an architecture that offers a set of generic services for building context-aware mobile applications. We present two realized prototypes that recognize user’s context, learn his routines, use context information to constrain service discovery as well as to find relevant information, and manage connectivity on the basis of the given policy. Further, the user interfaces and application...
متن کاملMcDonnell Douglas Electronic Systems company: description of the TEXUS system as used for MUC-4
Unlike most natural language processing (NLP) systems, TexUS (Text Understanding System) is being developed as a domain-independent shell, to facilitate the application of language analysis to a variety of tasks and domains of discourse . Our work not only develops robust and generic language analysis capabilities, but also elaborate s knowledge representations, knowledge engineering methods, a...
متن کامل