Web Corpus Mining By Instance Of Wikipedia
نویسندگان
چکیده
In this paper we present an approach on structure learning in the area of web documents. This is done in order to approach the goal of webgenre tagging in the area of web corpus linguistics. A central outcome of the paper is that purely structure oriented approaches to web document classification provide an information gain which may be utilized in combined approaches of web content and structure analysis.
منابع مشابه
Extracting Structured Knowledge for Semantic Web by Mining Wikipedia
Since Wikipedia has become a huge scale database storing wide-range of human knowledge, it is a promising corpus for knowledge extraction. A considerable number of researches on Wikipedia mining have been conducted and the fact that Wikipedia is an invaluable corpus has been confirmed. Wikipedia’s impressive characteristics are not limited to the scale, but also include the dense link structure...
متن کاملAisles through the Category Forest - Utilising the Wikipedia Category System for Corpus Building in Machine Learning
The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on gen...
متن کاملWikipedia Link Structure and Text Mining for Semantic Relation Extraction
Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. Since it is becoming a database storing all human knowledge, Wikipedia mining is a promising approach that bridges the Semantic Web and the Social Web (a. k. a. Web 2.0). In fact, i...
متن کاملAn Integrated Approach for Relation Extraction from Wikipedia Texts
Linguistic-based methods and web mining-based methods are two types of leading methods for semantic relation extraction task. By integrating linguistic analysis with frequent Web information, this paper presents an unsupervised relation extraction approach, for discovering and enhancing relations in which a specified concept participates. We focus on concepts described in Wikipedia articles. By...
متن کاملWikipedia Mining Wikipedia as a Corpus for Knowledge Extraction
Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as Arts, Geography, History, Science, Sports and Games. As a corpus for knowledge extraction, Wikipedia’s impressive characteristics are not limited to the scale, but also include the dense link structure, word sense disambiguation bas...
متن کامل