Two-Phase Web Site Classification Based on Hidden Markov Tree Models

نویسندگان

  • Yonghong Tian
  • Tiejun Huang
  • Wen Gao
  • Jun Cheng
  • PingBo Kang
چکیده

With the exponential growth of both the amount and diversity of the information that the web encompasses, automatic classification of topic-specific web sites is highly desirable. In this paper we propose a novel approach for web site classification based on the content, structure and context information of web sites. In our approach, the site structure is represented as a twolayered tree in which each page is modeled as a DOM (Document Object Model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the Hidden Markov Tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.

منابع مشابه

A Web Site Mining Algorithm Using the Multiscale Tree Representation Model

Web site mining, which aims at automatically discovering and classifying topic-specific web sites from the World Wide Web, has attracted increasing attention as indicated by the exponential growth of both the amount and the diversity of the web information. This paper describes a novel multiscale approach for web site mining, which represents a web site as a multiscale site tree, extending the ...

متن کامل

Introducing Busy Customer Portfolio Using Hidden Markov Model

Due to the effective role of Markov models in customer relationship management (CRM), there is a lack of comprehensive literature review which contains all related literatures. In this paper the focus is on academic databases to find all the articles that had been published in 2011 and earlier. One hundred articles were identified and reviewed to find direct relevance for applying Markov models...

متن کامل

Improving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM

Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...

متن کامل

SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny

SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt. Protein domain assignments for over 900 genomes are included in the database, which can be accessed at http://supfam.org/. Hidden Markov models based on Structural Classification of Proteins (SCOP) domain definitions at the ...

متن کامل

Marcov Models for Web Access Prediction

The problem of predicting user’s behavior on a Web site has fundamental significance due to the rapid growth of the World Wide Web. Although traditional Markov models have been found to be suited for addressing this problem, they have serious limitations. Thus, good predictions require new Markov models. Hybrid-order tree-like Markov models predict Web access precisely while providing high cove...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003