Web Page Categorization using Multilayer Perceptron with Reduced Features

نویسنده

  • Kavitha S
چکیده

The web is a huge repository of knowledge and numerous hyperlinks. Web also serves a broad diversity of user communities and global information service centers. Every day the knowledge in web page upwards rapidly. Web pages can be used to convey the knowledge to web users. Such voluminous size of the web makes an intricacy of web information retrieval, web content filtering and web structure mining. Hence, it is essential for proper categorization of web pages. This paper demonstrates the web page categorization problem as the multi classification task and provides a suitable solution using a supervised learning technique namely multilayer perceptron. The classification model is generated by learning the features that have been extracted from HTML structure and URL of the web page. Feature reduction techniques have been applied to select optimum features and a model is learned. The experimental results of the multilayer perceptron models before and after feature reduction has been evaluated and observed that the multilayer perceptron model with reduced features performs well.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Documents Categorization Using Neural Networks

This paper shows, through experimental results, that artificial neural networks are good classifiers for the text categorization task. The paper compares the results of experiments on text categorization using Multilayer Perceptron, Self-organizing Maps, C4.5 decision tree and PART decision rules. The experiments were carried out with K1 collection of web documents.

متن کامل

Progress Report: Predicting Which Recommended Content Users Click

Machine learning models can be used to predict which recommended content users will click on a given website. The given dataset contains millions of samples that map some feature about an ad or web page to a number. We reduced this dataset to a more manageable size to minimize computation time, and we extract features based on this reduced set. The features we extracted are based on the adverti...

متن کامل

Subject Categorization for Web Educational Resources using MLP

The purpose of this study is to develop subject categorization methods for educational resources using multilayer perceptron (MLP) and to examine the performance of the test documents as an application system. To examine the performance two methods are examined: Latent Semantic Indexing method (LSI) and a three layer feedforward network as a simple MLP. The document vectors were estimated by th...

متن کامل

Keywords, k-NN and Neural Networks: a Support for Hierarchical Categorization of Texts in Brazilian Portuguese

A frequent problem in automatic categorization applications involving Portuguese language is the absence of large corpora of previously classified documents, which permit the validation of experiments carried out. Generally, the available corpora are not classified or, when they are, they contain a very reduced number of documents. The general goal of this study is to contribute to the developm...

متن کامل

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013