An n-gram Based Approach to the Classification of Web Pages by Genre
نویسنده
چکیده
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development and testing of a new model for the automatic identification of Web page genre; classification results using this model compare very favorably with those of other researchers.
منابع مشابه
Web Page Genre Classification: Impact of n-Gram Lengths
Web pages are discriminated based on their topic and genre. Web page genres are capable to improve the modern search engines to focus on the user's information need. In this paper, web pages are represented using character n-grams. Character n-gram representation is language independent and allows automatic extraction of features from a web page. Character n-gram representation of a web pa...
متن کاملClassifying Web Pages by Genre - A Distance Function Approach
The research reported in this paper is part of a larger project on the automatic classification of Web pages by their genres, using a distance function classification model. In this paper, we investigate the effect of several commonly used data preprocessing steps, explore the use of byte and word n-grams, and test our classification model on three Web page data sets. Our approach is to represe...
متن کاملA Combination based on OWA Operators for Multi-label Genre Classification of web pages Una combinación basada en operadores OWA para la Clasificación de Género Multi-etiqueta de páginas web
This paper presents a new method for genre identification that combines homogeneous classifiers using OWA (Ordered Weighted Averaging) operators. Our method uses character n-grams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages, we applied MLKNN as a multi-label classifier, in which a web page can be affected by mor...
متن کاملSemi-supervised Graph-based Genre Classification for Web Pages
Until now, it is still unclear which set of features produces the best result in automatic genre classification on the web. Therefore, in the first set of experiments, we compared a wide range of contentbased features which are extracted from the data appearing within the web pages. The results show that lexical features such as word unigrams and character n-grams have more discriminative power...
متن کاملA Combination based on OWA Operators for Multi-label Genre Classification of web pages
This paper presents a new method for genre identification that combines homogeneous classifiers using OWA (Ordered Weighted Averaging) operators. Our method uses character n-grams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages, we applied MLKNN as a multi-label classifier, in which a web page can be affected by mor...
متن کامل