Web Categorisation Using Distance-Based Decision Trees

نویسندگان

  • Vicent Estruch
  • César Ferri
  • José Hernández-Orallo
  • M. José Ramírez-Quintana
چکیده

In Web classification, web pages are assigned to pre-defined categories mainly according to their content (content mining). However, the structure of the web site might provide extra information about their category (structure mining). Traditionally, both approaches have been applied separately, or are dealt with techniques that do not generate a model, such as Bayesian techniques. Unfortunately, in some classification contexts, a comprehensible model becomes crucial. Thus, it would be interesting to apply rule-based techniques (rule learning, decision tree learning) for the web categorisation purpose. In this paper we propose an integrated web mining vision, both the content and the structure, using a distance based decision tree learning algorithm. This algorithm differs from traditional ones in the sense that the splitting criterion is defined by means of metric conditions (“is nearer than”). This change allows decision trees to handle structured attributes (lists, graphs, sets, etc.) along with the well-known nominal and numerical attributes. Generally speaking, these structured attributes will be employed to represent the content and the structure of the web-site.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Various Approaches to Web Information Processing

The paper focuses on the field of automatic extraction of information from texts and text document categorisation including pre-processing of text documents, which can be found on the Internet. In the frame of the presented work, we have devoted our attention to the following issues related to text categorisation: increasing the precision of categorisation algorithm results with the aid of a bo...

متن کامل

A procedure for Web Service Selection Using WS-Policy Semantic Matching

In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...

متن کامل

Web Corpus Cleaning using Content and Structure

This paper describes experiments on cleaning web corpora. While previously described approaches focus mainly on the visual representation of web pages, we evaluate approaches that rely on content and structure. We have evaluated a heuristics-based approach, as well as approaches based on decision trees, a genetic algorithm and language models. The best performance was achieved using the heurist...

متن کامل

Towards Integrating Decision Tree with Xml Technologies

The paper proposes a method for efficiently store collections of multi-purpose decision trees within a native distributed XML database. The predictive information for building the XML decision trees is gathered through Web mining techniques and methodologies. In order to share data from heterogeneous sources, the model employs semantic Web languages to describe and represent data sources. The u...

متن کامل

Pinktoe : putting S trees on the web

Tree based methods in S or R are extremely useful and popular. For simple trees and memorable variables it is easy to predict the outcome for a new case using only a standard decision tree diagram. However, for large trees or trees where the variable description is complex the decision tree diagram is often not enough. This article describes pinktoe: a tool that converts S tree objects into web...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Electr. Notes Theor. Comput. Sci.

دوره 157  شماره 

صفحات  -

تاریخ انتشار 2005