Information Extraction across Linguistic Barriers
نویسنده
چکیده
Information extraction (IE) systems have been tailored to extract fixed target information from documents in a fixed language. In order to be truly useful for information analysts, the target information must be user-definable and the source documents should cover multiple languages. We will map out the path toward such open-target multilingual IE systems, identifying necessary technological breakthroughs along the path. We also discuss a Japanese-English named entity extraction system under development, which represents a case of the next step along the path. Introduction: Toward Multilingual Information Extraction Systems The natural language processing field has witnessed a rapid development of the information extraction (IE) technology since the early 90’s, driven by the series of Message Understanding Conferences (MUC’s) the government-sponsored TIPSTER program. 1 This technology enables a rapid, robust, and automatic extraction of certain predefined target information from real-world on-line texts or speech transcripts accessible through computer networks. Information analysts, whose task is to keep track of changing states of affairs about particular topics such as microelectronic products and international terrorist activities, can use the IE technology for accomplishing their tasks more efficiently and effectively. IE systems, however, have so far been tailored to extract fixed target information from documents in a fixed language. In order for the IE technology to be truly useful for information analysts, the target information must be user-definable, or ’open,’ and it should also obtain information from documents in multiple languages. To appear in the Working Notes for the Workshop on Cross-Language Text and Speech Retrieval, AAAI Spring Symposium Series, Stanford, CA, 1997. 1 The TIPSTER web page is at http://www.tipster.org In this paper, we will map out the path toward such open-target multilingual IE systems, identifying necessary technological breakthroughs along the path. We also discuss a Japanese-English named entity extraction system under development, which represents an IE system that lies in the immediate future along the path. Information Extraction Technology Given a set of source documents, the input to an IE system is a description of the target information type, and the output is a set of target information instances found in the source documents. The target information, typically of the form "who did what to whom where when," is extracted from natural language sentences or formatted tables, and fills parts of predefined template data structures with slot values. Partially filled template data objects about the same entity or event instances are then merged to create a network of related data objects. These template data objects depicting instances of the target information are the raw output of information extraction, ready for a wide range of applications such as database updating or summary generation. An IE output can also take the form of SGML or other types of markups on the source documents, which need not go through template data objects. IE systems, however, have so far been tailored to extract fixed target information from documents in a fixed language, through a short but intense customization work by IE system experts. We can characterize these first-generation IE systems as systems for monolingual information extraction with closed target information. See Figure 1. Analysis of MUC evaluation results has led to a clearer understanding of the strengths and weaknesses of the technology. The MUC-6 evaluation results (Sundheim 1995) showed that name recognition is largely a solved problem, with a humanlike performance in the same task (higher F-measure around 95%). This IE subtask is thus ready for real-world applications. Extracting template entities (higher F-measure around 75%) and recognition 111 From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
منابع مشابه
Automatic Discovery of Linguistic Patterns for Information Extraction
Information Extraction (IE) systems typically rely on extraction patterns encoding domain-specific knowledge. When matched against natural language texts, these patterns recognize with high accuracy information relevant to the extraction task. Adapting an IE system to a new extraction scenario entails devising a new collection of extraction patterns a time-consuming and expensive process. To ov...
متن کاملA Universal Feature Schema for Rich Morphological Annotation and Fine-Grained Cross-Lingual Part-of-Speech Tagging
Semantically detailed and typologically-informed morphological analysis that is broadly applicable cross-linguistically has the potential to improve many NLP applications, including machine translation, n-gram language models, information extraction, and co-reference resolution. In this paper, we present a universal morphological feature schema, which is a set of features that represent the fin...
متن کاملInhibitive Factors on the Development of Critical Thinking in University Libraries: Students' Attitudes in Bushehr University of Medical Sciences and Persian Gulf University
Introduction: Critical thinking training is one of the most important missions of educational institutes. Hence, academic libraries as an inseparable operational unit of higher education must help their users to benefit from information resources and facilities providing them with the chance to develop and improve critical thinking capability. This study aimed to investigate significant barrier...
متن کاملOpen Information Extraction
Open Information Extraction (Open IE) systems aim to obtain relation tuples with highly scalable extraction in portable across domain by identifying a variety of relation phrases and their arguments in arbitrary sentences. The first generation of Open IE learns linear chain models based on unlexicalized features such as Part-of-Speech (POS) or shallow tags to label the intermediate words betwee...
متن کاملLinguistic Processing of Texts Using Geppetto
We describe the linguistic analyzer of a prototype for Information Extraction from texts. Such analyzer uses information derived from a shallow processor to limit the computational cost of the analysis. At the same time, shallow techniques are used to collapse parse fragments when a complete parse is not possible. The linguistic analyzer has been built using GePpeTto, an environment that allows...
متن کاملTextual Enhancement across Linguistic Structures: EFL Learners' Acquisition of English Forms
The benefits of textual input enhancement in the acquisition of linguistic forms have produced mixed results in SLA literature. The present study investigates the effects of textual enhancement on adult foreign language intake of two English linguistic forms-subjunctive mood and inversion structures-to explore the role of the type of linguistic items in input enhancement studies. It also invest...
متن کامل