Two-Phase Web Site Classification Based on Hidden Markov Tree Models
نویسندگان
چکیده
With the exponential growth of both the amount and diversity of the information that the web encompasses, automatic classification of topic-specific web sites is highly desirable. In this paper we propose a novel approach for web site classification based on the content, structure and context information of web sites. In our approach, the site structure is represented as a twolayered tree in which each page is modeled as a DOM (Document Object Model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the Hidden Markov Tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.
منابع مشابه
A Web Site Mining Algorithm Using the Multiscale Tree Representation Model
Web site mining, which aims at automatically discovering and classifying topic-specific web sites from the World Wide Web, has attracted increasing attention as indicated by the exponential growth of both the amount and the diversity of the web information. This paper describes a novel multiscale approach for web site mining, which represents a web site as a multiscale site tree, extending the ...
متن کاملIntroducing Busy Customer Portfolio Using Hidden Markov Model
Due to the effective role of Markov models in customer relationship management (CRM), there is a lack of comprehensive literature review which contains all related literatures. In this paper the focus is on academic databases to find all the articles that had been published in 2011 and earlier. One hundred articles were identified and reviewed to find direct relevance for applying Markov models...
متن کاملImproving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM
Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...
متن کاملSUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny
SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt. Protein domain assignments for over 900 genomes are included in the database, which can be accessed at http://supfam.org/. Hidden Markov models based on Structural Classification of Proteins (SCOP) domain definitions at the ...
متن کاملMarcov Models for Web Access Prediction
The problem of predicting user’s behavior on a Web site has fundamental significance due to the rapid growth of the World Wide Web. Although traditional Markov models have been found to be suited for addressing this problem, they have serious limitations. Thus, good predictions require new Markov models. Hybrid-order tree-like Markov models predict Web access precisely while providing high cove...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003