Character-level Analysis of Semi-Structured Documents for Set Expansion

نویسندگان

  • Richard C. Wang
  • William W. Cohen
چکیده

Set expansion refers to expanding a partial set of “seed” objects into a more complete set. One system that does set expansion is SEAL (Set Expander for Any Language), which expands entities automatically by utilizing resources from the Web in a language-independent fashion. In this paper, we illustrated in detail the construction of character-level wrappers for set expansion implemented in SEAL. We also evaluated several kinds of wrappers for set expansion and showed that character-based wrappers perform better than HTML-based wrappers. In addition, we demonstrated a technique that extends SEAL to learn binary relational concepts (e.g., “x is the mayor of the city y”) from only two seeds. We also show that the extended SEAL has good performance on our evaluation datasets, which includes English and Chinese, thus demonstrating language-independence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-structured document image matching and recognition

This article presents a method to recognize and to localize semi-structured documents such as ID cards, tickets, invoices, etc. Object recognition methods based on interest points work well on natural images but fail on document images because of repetitive patterns like text. In this article, we propose an adaptation of object recognition for image documents. The advantages of our method is th...

متن کامل

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be ...

متن کامل

Designing coaching character pattern for managers in gas industry

Today’s organizations in order to survive in the competitive scene require more modern approaches to maintain and develop their human resources as the most important capitals of the organization.  Therefore, the purpose of this study is to design a coaching character pattern for managers in the gas industry. The research method in terms of data nature is qualitative and in terms of purpose fund...

متن کامل

An Algorithm for Constrained Association Rule Mining in Semi-structured Data

The need for sophisticated analysis of textual documents is becoming more apparent as data is being placed on the Web and digital libraries are surfacing. This paper presents an algorithm for generating constrained association rules from textual documents. The user speciies a set of constraints, concepts and/or structured values. Our algorithm creates matrices and lists based on these prespecii...

متن کامل

A Conceptual Model for Multidimensional Analysis of Documents

Data warehousing and OLAP are mainly used for the analysis of transactional data. Nowadays, with the evolution of Internet, and the development of semi-structured data exchange format (such as XML), it is possible to consider entire fragments of data such as documents as analysis sources. As a consequence, an adapted multidimensional analysis framework needs to be provided. In this paper, we in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009