The impact of running headers and footers on proximity searching
نویسندگان
چکیده
Hundreds of experiments over the last decade on the retrieval of OCR documents performed by the Information Science Research Institute have shown that OCR errors do not significantly affect retrievability. We extend those results to show that in the case of proximity searching, the removal of running headers and footers from OCR text will not improve retrievability for such searches.
منابع مشابه
Header and footer extraction by page association
This paper introduces a robust algorithm to extract headers and footers from a variety of electronic documents such as image files, Adobe PDF files, and files generated OCR. Compared with the conventional methods based on the page-level layout and format, the proposed strategy considers a page in the context of neighboring pages. Through such page-association, the headers and footers on a varie...
متن کاملFast in-Place File Carving for Digital Forensics
Scalpel, a popular open source file recovery tool, performs file carving using the Boyer-Moore string search algorithm to locate headers and footers in a disk image. We show that the time required for file carving may be reduced significantly by employing multi-pattern search algorithms such as the multipattern Boyer-Moore and Aho-Corasick algorithms as well as asynchronous disk reads and multi...
متن کاملChapter 5 CONTEXT - BASED FILE BLOCK CLASSIFICATION
Because files are typically stored as sequences of data blocks, the file carving process in digital forensics involves the identification and collocation of the original blocks of files. Current file carving techniques that use the signatures of file headers and footers could be improved by first classifying each data block in the storage media as belonging to a given file type. Unfortunately, ...
متن کاملFIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabrück
The FIASCO system implements a machine-learning approach for the automatic removal of boilerplate (navigation bars, link lists, page headers and footers, etc.) from Web pages in order to make them available as a clean and useful corpus for linguistic purposes. The system parses an HTML document into a DOM tree representation and identifies a set of disjoint subtrees that correspond to text bloc...
متن کاملExtracting the Main Content from HTML Documents
A modern web document typically consists of many kinds of information. Besides the main content which conveys the primary information, a web document also contains noisy contents such as advertisements, headers, footers, decorations, copyright information, navigation menus etc. The presence of noisy contents may affect the performance of applications such as commercial search engines, web crawl...
متن کامل