Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

نویسندگان

  • Daisuke Ikeda
  • Yasuhiro Yamada
  • Sachio Hirokawa
چکیده

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent n-grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Association Rules from Semi-Structured Data

Despite the growing popularity of semi-structured data such as Web documents, most knowledge discovery research has focused on databases containing well structured data. In this paper, we try to find useful information from semistructured data. In our approach, we begin by representing semi-structured data in a prototype-based approach. We then detect the most typical common structure of semist...

متن کامل

ReQueSS: Relational Querying of Semi-Structured Data

We present a prototype of a Web querying interface which is capable of searching and querying unified Web sources of data that have sufficient hidden relational structure. The system converts query-related parts of Web pages into relational data and provides for SQL-like or QBE-like querying capability. The relational query is parsed for relevant information such as selection conditions and tab...

متن کامل

5 Semi-structured Document Classification

Document classification developed over the last 10 years, using techniques originating from the pattern recognition and machine-learning communities. All these methods operate on flat text representations, where word occurrences are considered independents. The recent paper by Sebastiani (2002) gives a very good survey on textual document classification. With the development of structured textu...

متن کامل

Information Extraction with and without Parsing Semi-structured Documents

Information extraction from semi-structured documents comprises contents detection, wrapper generation and schema extraction. The contents detection step corresponds to making training examples in wrapper induction based on machine learning and the schema extraction identifies extracted data types. We formulate the contents detection using the repetitive pattern introduced in this paper. That i...

متن کامل

On the Midpoint of a Set of XML Documents

The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001