Inducing Constraint-Based Grammars using a Domain Ontology
نویسنده
چکیده
In many knowledge intensive applications, there is a critical need to populate knowledge bases rapidly and to keep them up to date. Since the World Wide Web is a large source of information that continuously is being updated, a solution is to automatically acquire knowledge from text, which requires language understanding in a smaller or grater degree. The need for “rapid” text-to-knowledge acquisition imposes some critical conditions on the methods used: scalability and adaptability. Thus, there is a need to move from handcrafted grammars and hand-build systems to learning methods. However, most statistical and learning techniques have been applied only to restricted domains (e.g., air travel domain) or tasks (e.g., information extraction where the knowledge is limited to a priori relations and entities), reducing the variety of the acquired knowledge. This thesis presents a framework for domain specific textto-knowledge acquisition, with focus on medical domain. The main challenge of this domain is the abundance of linguistic phenomena that require both syntactic and semantic information in order to “understand” the meaning of the text, and thus to acquire knowledge. Examples include prepositional phrases, coordinations, noun-noun compounds and nominalizations, phenomena which are not well covered by existing syntactic or semantic parsers. In my thesis, I propose a relational learning framework for the induction of a constraint-based grammar able to capture both syntax and aspects of meaning in an interleaved manner from a small number of semantically annotated data. The novelty of this framework is the learning method based on an ordered set of examples. This approach to learning follows the argument that language acquisition is an incremental process, in which simpler rules are acquired prior to complex ones. Several new theoretical concepts need to be tied together in order to make the approach feasible and theoretically sound: 1) a type of constraint-based grammar, called lexicalized well-founded grammar, which is learnable and able to capture large fragments of natural language; 2) a semantic representation, which we call semantic molecule that can be linked to the grammar and is simple enough to allow the relational learning of the grammar; 3) a small ordered set of semantically annotated examples, called representative examples, which is used as our training data; and 4) an ontology-based semantic interpretation encoded as a constraint at the grammar rule level (Φonto), which refrains from full logical analysis of meaning, known to be intractable. On the application side, the grammar learning is used for rapid acquisition of medical terminological knowledge from text. Semantic molecule. Given a natural language expression, w, we denote by w′ = h ./ b the semantic molecule of w, where h is the head acting as a valence for semantic composition, and b is the body acting as the semantic representation of w. The head is represented as a one level feature structure (i.e., feature values are atomic), while the body is a Canonical Logical Form given as a flat semantic representation, similar with Minimal Recursion Semantics (MRS). Unlike MRS, it uses as semantic primitives a set of framebased atomic predicates of the form: concept.attr=concept, suitable for the interpretation on the ontology: concept corresponds to a frame in the ontology and attr is a slot of the frame, encoding either a property or a relation. For example, for the adjective “chronic” we have the following semantic molecule: cat a head X mod Y ./ [X .isa = chronic,Y.Has prop = X ]
منابع مشابه
Inducing Constraint-based Grammars from a Small Semantic Treebank
We present a relational learning framework for grammar induction that is able to learn meaning as well as syntax. We introduce a type of constraint-based grammar, lexicalized well-founded grammar (lwfg), and we prove that it can always be learned from a small set of semantically annotated examples, given a set of assumptions. The semantic representation chosen allows us to learn the constraints...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملLexicalized Well-Founded Grammars: Learnability and Merging
This paper presents the theoretical foundation of a new type of constraint-based grammars, Lexicalized Well-Founded Grammars, which are adequate for modeling human language and are learnable. These features make the grammars suitable for developing robust and scalable natural language understanding systems. Our grammars capture both syntax and semantics and have two types of constraints at the ...
متن کاملGenerating LTAG grammars from a lexicon/ontology interface
This paper shows how domain-specific grammars can be automatically generated from a declarative model of the lexicon-ontology interface and how those grammars can be used for question answering. We show a specific implementation of the approach using Lexicalized Tree Adjoining Grammars. The main characteristic of the generated elementary trees is that they constitute domains of locality that sp...
متن کاملA Logic Programming Based Approach to QA@CLEF05 Track
In this paper the methodology followed to build a questionanswering system for the Portuguese language is described. The system modules are built using computational linguistic tools such as: a Portuguese parser based on constraint grammars for the syntactic analysis of the documents sentences and the user questions; a semantic interpreter that rewrites sentences syntactic analysis into discour...
متن کامل