Web Corpus Cleaning using Content and Structure
نویسندگان
چکیده
This paper describes experiments on cleaning web corpora. While previously described approaches focus mainly on the visual representation of web pages, we evaluate approaches that rely on content and structure. We have evaluated a heuristics-based approach, as well as approaches based on decision trees, a genetic algorithm and language models. The best performance was achieved using the heuristics-based approach.
منابع مشابه
Victor: the Web-Page Cleaning Tool
In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing and computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content an...
متن کاملWeb Page Cleaning with Conditional Random Fields
This paper describes the participation of the Charles University in Cleaneval 2007, the shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system i...
متن کاملCleaneval: a Competition for Cleaning Web Pages
Cleaneval is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus for linguistic and language technology research and development. The first exercise took place in 2007. We describe how it was set up, results, and lessons learnt.
متن کاملBuilding a Web Corpus of Czech
Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to kee...
متن کاملData Preparation for Web Mining – A survey
An accepted trend is to categorize web mining into three main areas: web content mining, web structure mining and web usage mining. Web content mining involves extracting details/information from the contents of webpages and performing things like knowledge synthesis. Web structure mining involves the usage of graph theory to understand website structure/hierarchy. Web usage mining involves the...
متن کامل