Extracting Information from Web Content and Structure
نویسندگان
چکیده
Web is a vast data repository. By mining from this data efficiently, we can gain valuable knowledge. Unfortunately, in addition to useful content there are also many Web documents considered harmful (e.g. pornography, terrorism, illegal drugs). Web mining that includes three main areas – content, structure, and usage mining – may help us detect and eliminate these sites. In this paper, we concentrate on applications of Web content and Web structure mining. First, we introduce a system for detection of pornographic textual Web pages. We discuss its classification methods and depict its architecture. Second, we present analysis of relations among Czech academic computer science Web sites. We give an overview of ranking algorithms and determine importance of the sites we analyzed.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملOptimized Content Extraction from web pages using Composite Approaches
The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approac...
متن کاملExtracting Content Structure for Web Pages Based on Visual Representation
A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on ...
متن کاملIdentifying the technical requirements for designing health portals
Aim: Considering technical requirements in the design of health portals increases the validity of information. This study identified the technical and content structure required to create these portals. Methods: This was a qualitative study which was conducted in 2020. A combination of comprehensive review and interview was used. The search was performed in Elsevier, EBSCO, Scopus, Web of Scie...
متن کاملThe Content and Structure of Electronic Personal Health Records: A Systematic Review
Introduction: The electronic Personal Health Record (ePHR) improves people’s awareness and care management and leads to health promotion. One of the most important factors that contributes to the development of ePHR is identifying and understanding its content and structure. No comprehensive studies have so far been performed on the content and structure of ePHRs. Therefore, the purpose of this...
متن کامل