Exploring a Few Good Tuples From a Text Database

نویسندگان

  • Alpa Jain
  • Divesh Srivastava
چکیده

Information extraction from text databases is a useful paradigm to populate relational tables and unlock the considerable value hidden in plain-text documents. However, information extraction can be expensive, due to various complex text processing steps necessary in uncovering the hidden data. There are a large number of text databases available, and not every text database is necessarily relevant to every relation. Hence, it is important to be able to quickly explore the utility of running an extractor for a specific relation over a given text database before carrying out the expensive extraction task. In this paper, we present a novel exploration methodology of finding a few good tuples for a relation that can be extracted from a database which allows for judging the relevance of the database for the relation. Specifically, we propose the notion of a good(k, `) query as one that can return any k tuples for a relation among the top-` fraction of tuples ranked by their aggregated confidence scores, provided by the extractor; if these tuples have high scores, the database can be determined as relevant to the relation. We formalize the access model for information extraction, and investigate efficient query processing algorithms for good(k, `) queries, which do not rely on any prior knowledge about the extraction task or the database. We demonstrate the viability of our algorithms using a detailed experimental study with real text databases.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text database systems

A text database system, often called an information retrieval system, is designed to process a text model of the data, viewed as an ordered sequence of documents, paragraphs, sentences, words (i.e., as a list structure). Although relations are sets of tuples, and therefore unordered, the relational model can still be used successfully for text, but surprisingly it is shown that at the physical ...

متن کامل

INDUCING VALUABLE RULES FROM IMBALANCED DATA: THE CASE OF AN IRANIAN BANK EXPORT LOANS

<span style="color: #000000; font-family: Tahoma, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none; ba...

متن کامل

Using Semantic Similarity To Acquire Cooccurrence Restrictions From Corpora

We describe a method for acquiring semantic cooccurrence restrictions for tuples of syntactically related words (e.g. verb-object pairs) from text corpora automatically. This method uses the notion of semantic similarity to assign a sense from a dictionary database (e.g. WordNet) to ambiguous words occurring in a syntactic dependency. Semantic similarity is also used to merge disambiguated word...

متن کامل

INDUCING VALUABLE RULES FROM IMBALANCED DATA: THE CASE OF AN IRANIAN BANK EXPORT LOANS

<span style="color: #000000; font-family: Tahoma, sans-serif; font-size: 13px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: -webkit-left; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none; ba...

متن کامل

Keyword Search in Relational Databases: A Survey

The integration of DB and IR provides flexible ways for users to query information in the same platform [6, 2, 3, 7, 5, 28]. On one hand, the sophisticated DB facilities provided by RDBMSs assist users to query well-structured information using SQL. On the other hand, IR techniques allow users to search unstructured information using keywords based on scoring and ranking, and do not need users ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008