On the Automated Classification of Web Sites
نویسنده
چکیده
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata in relation to requirements for text features. We find that HTML metatags are a good source of text features, but are not in wide use despite their role in search engine rankings. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categories and present performance results based on different combinations of text features and training data. This system can serve as the basis for a generalized framework for automated metadata creation.
منابع مشابه
On Social Network Web Sites: Definition, Features, Architectures and Analysis Tools
Development and usage of online social networking web sites are growing rapidly. Millions members of these web sites publicly articulate mutual "friendship" relations and share user-created contents, such as photos, videos, files, and blogs. The advances in web designing technology and fast growing usage of online resources prompted web designers to improve features and architectures of social ...
متن کاملOn Social Network Web Sites: Definition, Features, Architectures and Analysis Tools
Development and usage of online social networking web sites are growing rapidly. Millions members of these web sites publicly articulate mutual "friendship" relations and share user-created contents, such as photos, videos, files, and blogs. The advances in web designing technology and fast growing usage of online resources prompted web designers to improve features and architectures of social ...
متن کاملPractical Issues for Automated Categorization of Web Sites
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata and requirements for text features. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categ...
متن کاملImage flip CAPTCHA
The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...
متن کاملAutomated classification of pulmonary nodules through a retrospective analysis of conventional CT and two-phase PET images in patients undergoing biopsy
Objective(s): Positron emission tomography/computed tomography (PET/CT) examination is commonly used for the evaluation of pulmonary nodules since it provides both anatomical and functional information. However, given the dependence of this evaluation on physician’s subjective judgment, the results could be variable. The purpose of this study was to develop an automated scheme for the classific...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره cs.IR/0102002 شماره
صفحات -
تاریخ انتشار 2001