Learning Extractors from Unlabeled Text using Relevant Databases
نویسندگان
چکیده
Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.
منابع مشابه
A Multitask Learning Approach to Document Representation using Unlabeled Data
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller availabl...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملImproving Named Entity Extraction Accuracy using Unlabeled Data and Several Extractors (pp. 29-38)
This paper proposes feature augmentation methods using unlabeled data and several Named Entity (NE) extractors. We collect NE-related information of each word (which we call NE-related labels) from unlabeled data by using NE extractors. NE-related labels which we collect include candidate NE class labels of each word and NE class labels of co-occurring words. To accurately collect the NE-relate...
متن کاملLearning to Rank Biomedical Documents with only Positive and Unlabeled Examples: A Case Study
In the text mining field, obtaining training data requires human experts' labeling efforts, which is often time consuming and expensive. Supervised learning with only a small number of positive examples and a large amount of unlabeled data, which is easy to get, has attracted booming interests in the field. A recently proposed relabeling method, which assumes unlabeled data as negative data for...
متن کاملGeneralized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment
Traditionally, machine learning approaches for information extraction require human annotated data that can be costly and time-consuming to produce. However, in many cases, there already exists a database (DB) with schema related to the desired output, and records related to the expected input text. We present a conditional random field (CRF) that aligns tokens of a given DB record and its real...
متن کامل