Web Page Genre Classification: Impact of n-Gram Lengths
نویسندگان
چکیده
Web pages are discriminated based on their topic and genre. Web page genres are capable to improve the modern search engines to focus on the user's information need. In this paper, web pages are represented using character n-grams. Character n-gram representation is language independent and allows automatic extraction of features from a web page. Character n-gram representation of a web page can be used efficiently to classify a web page by genre. Support Vector Machine (SVM) classification model is used for classification and experiments were carried out on 7-Genre corpus by varying the length of n-grams. It is observed that the performance in terms of F-measure improves as n-gram lengths are varied from 3 to 5 and it is also observed that performance degrades as the n-gram length is further increased.
منابع مشابه
An n-gram Based Approach to the Classification of Web Pages by Genre
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development ...
متن کاملClassifying Web Pages by Genre - A Distance Function Approach
The research reported in this paper is part of a larger project on the automatic classification of Web pages by their genres, using a distance function classification model. In this paper, we investigate the effect of several commonly used data preprocessing steps, explore the use of byte and word n-grams, and test our classification model on three Web page data sets. Our approach is to represe...
متن کاملA Combination based on OWA Operators for Multi-label Genre Classification of web pages Una combinación basada en operadores OWA para la Clasificación de Género Multi-etiqueta de páginas web
This paper presents a new method for genre identification that combines homogeneous classifiers using OWA (Ordered Weighted Averaging) operators. Our method uses character n-grams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages, we applied MLKNN as a multi-label classifier, in which a web page can be affected by mor...
متن کاملDon't Use a Lot When Little Will Do: Genre Identification Using URLs
The ever increasing data on world wide web calls for the use of vertical search engines. Sandhan is one such search engine which offers search in tourism and health genres in more than 10 different Indian languages. In this work we build a URL based genre identification module for Sandhan. A direct impact of this work is on building focused crawlers to gather Indian language content. We conduct...
متن کاملA Combination based on OWA Operators for Multi-label Genre Classification of web pages
This paper presents a new method for genre identification that combines homogeneous classifiers using OWA (Ordered Weighted Averaging) operators. Our method uses character n-grams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages, we applied MLKNN as a multi-label classifier, in which a web page can be affected by mor...
متن کامل