Web Categorisation Using Distance-Based Decision Trees
نویسندگان
چکیده
In Web classification, web pages are assigned to pre-defined categories mainly according to their content (content mining). However, the structure of the web site might provide extra information about their category (structure mining). Traditionally, both approaches have been applied separately, or are dealt with techniques that do not generate a model, such as Bayesian techniques. Unfortunately, in some classification contexts, a comprehensible model becomes crucial. Thus, it would be interesting to apply rule-based techniques (rule learning, decision tree learning) for the web categorisation purpose. In this paper we propose an integrated web mining vision, both the content and the structure, using a distance based decision tree learning algorithm. This algorithm differs from traditional ones in the sense that the splitting criterion is defined by means of metric conditions (“is nearer than”). This change allows decision trees to handle structured attributes (lists, graphs, sets, etc.) along with the well-known nominal and numerical attributes. Generally speaking, these structured attributes will be employed to represent the content and the structure of the web-site.
منابع مشابه
Various Approaches to Web Information Processing
The paper focuses on the field of automatic extraction of information from texts and text document categorisation including pre-processing of text documents, which can be found on the Internet. In the frame of the presented work, we have devoted our attention to the following issues related to text categorisation: increasing the precision of categorisation algorithm results with the aid of a bo...
متن کاملA procedure for Web Service Selection Using WS-Policy Semantic Matching
In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...
متن کاملWeb Corpus Cleaning using Content and Structure
This paper describes experiments on cleaning web corpora. While previously described approaches focus mainly on the visual representation of web pages, we evaluate approaches that rely on content and structure. We have evaluated a heuristics-based approach, as well as approaches based on decision trees, a genetic algorithm and language models. The best performance was achieved using the heurist...
متن کاملTowards Integrating Decision Tree with Xml Technologies
The paper proposes a method for efficiently store collections of multi-purpose decision trees within a native distributed XML database. The predictive information for building the XML decision trees is gathered through Web mining techniques and methodologies. In order to share data from heterogeneous sources, the model employs semantic Web languages to describe and represent data sources. The u...
متن کاملPinktoe : putting S trees on the web
Tree based methods in S or R are extremely useful and popular. For simple trees and memorable variables it is easy to predict the outcome for a new case using only a standard decision tree diagram. However, for large trees or trees where the variable description is complex the decision tree diagram is often not enough. This article describes pinktoe: a tool that converts S tree objects into web...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Electr. Notes Theor. Comput. Sci.
دوره 157 شماره
صفحات -
تاریخ انتشار 2005