Towards Structure-sensitive Hypertext Categorization
نویسندگان
چکیده
Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.
منابع مشابه
Towards Logical Hypertext Structure A Graph-Theoretic Perspective
Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bagof-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation s...
متن کاملRefined and Incremental Centroid-based approach for Genre Categorization of Web pages
In this paper, I propose a refined and incremental centroid-based approach for genre categorization of web pages. My approach is based on the construction of genre centroids using a set of training web pages. These centroids will be used to classify new web pages. The originality of my approach is the implementation of two new aspects, which are refining and incrementing. My approach is based o...
متن کاملClassification Techniques for Categorization of Hypertext Documents
In this paper we investigate techniques for categorization of hypertext documents. Recent years have witnessed a growing interest in applying text categorization techniques to the Web. However, the semi-structured nature of the Web along with diverse subject matter present in it pose interesting challenges for conventional classification techniques. In this paper, we review some of the techniqu...
متن کاملHypertext Classification Using Tensor Space Model and Rough Set Based Ensemble Classifier
As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we introduce tensor space model for representing hypertext documents. We exploit the local-structure and neighborhood recommendation encapsulate...
متن کاملDHCS: A Case of Knowledge Share in Cooperative Computing Environment
Large-scale hypertext categorization has become one of the key techniques in web-based information acquisition. How to implement efficient hypertext categorization is still an ongoing research issue. This paper introduces the Distributed Hypertext Categorization System (DHCS), in which the Directed Acyclic Graph Support Vector Machines (DAGSVM) for learning multiclass hypertext classifiers is i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005