On "deep" knowledge extraction from documents

نویسندگان

  • Udo Hahn
  • Martin Romacker
چکیده

SYNDIKATE comprises a family of natural language understanding systems for automatically acquiring knowledge from real-world texts (e.g., information technology test reports, medical finding reports), and for transferring their content to formal representation structures which constitute a corresponding text knowledge base. We present a general system architecture which integrates requirements from the analysis of single sentences, as well as those of referentially linked sentences forming cohesive texts. Properly accounting for text cohesion phenomena is a prerequisite for the soundness and validity of the generated text representation structures. It is also crucial for any information system application making use of automatically generated text knowledge bases in a reliable way.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extraction of Informative Expressions from Domain-specific Documents

What kinds of lexical resources are helpful for extracting useful information from domain-specific documents? Although domain-specific documents contain much useful knowledge, it is not obvious how to extract such knowledge efficiently from the documents. We need to develop techniques for extracting hidden information from such domain-specific documents. These techniques do not necessarily use ...

متن کامل

Feasibility Study for Procedural Knowledge Extraction in Biomedical Documents

We propose how to extract procedural knowledge rather than declarative knowledge utilizing machine learning method with deep language processing features in scientific documents, as well as how to model it. We show the representation of procedural knowledge in PubMed abstracts and provide experiments that are quite promising in that it shows 82%, 63%, 73%, and 70% performances of purpose/soluti...

متن کامل

Knowledge Extraction from Web Documents Using Self- Organizing Neural Networks

Knowledge discovery is defined as non-trivial extraction of implicit, previously unknown and potentially useful information from given data [1]. Knowledge extraction from web documents deals with unstructured, free-format documents whosenumberisenormousandrapidlygrowing.

متن کامل

A Framework for Extracting Biological Relations from Different Resources

The World Wide Web provides a vast source of information of almost all types. Biological data specifically have increased dramatically in the past years because of the exponential growth of knowledge in biological domain. It is very difficult to search for the required data in unstructured documents. Text documents often hide valuable structured data. This data can be exploited if available as ...

متن کامل

Sampling, information extraction and summarisation of Hidden Web databases

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated in response to users’ queries. The majority of these documents are generated through Web page templates, which contain information that is often irrelevant to queries. In this paper, we present a system designed to detect and extract query-related information from documents sampled from database...

متن کامل

Automatic Extraction of Knowledge from Web Documents

A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000