Text Correction Using Domain Dependent Bigram Models from Web Crawls
نویسندگان
چکیده
The quality of text correction systems can be improved when using complex language models and by taking peculiarities of the garbled input text into account. We report on a series of experiments where we crawl domain dependent web corpora for a given garbled input text. From crawled corpora we derive dictionaries and language models, which are used to correct the input text. We show that correction accuracy is improved when integrating word bigram frequency values from the crawls as a new score into a baseline correction strategy based on word similarity and word (unigram) frequencies. In a second series of experiments we compare the quality of distinct language models, measuring how closely these models reflect the frequencies observed in a given input text. It is shown that crawled language models are superior to language models obtained from standard corpora.
منابع مشابه
Comparing Topic Coverage in Breadth-First and Depth-First Crawls Using Anchor Texts
Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depthfirst strat...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملDomain-Specific Corpus Expansion with Focused Webcrawling
This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able...
متن کاملSpelling Correction Based on User Search Contextual Analysis and Domain Knowledge
We propose a spelling correction algorithm that combines trusted domain knowledge and query log information for query spelling correction. This algorithm uses query reformulations in the query log and bigram language models built from queries for efficiently and effectively generating correction suggestions and ranking them to find valid corrections. Experimental results show that for both simp...
متن کاملSummarizing Disasters Over Time
We have developed a text summarization system that can generate summaries over time from web crawls on disasters. We show that our method of identifying exemplar sentences for a summary using affinity propagation clustering produces better summaries than clustering based on K-medoids as measured using Rouge on a small set of examples. A key component of our approach is the prediction of salient...
متن کامل