Knight, Ian A. and Brailsford, David F. (2016) Enhancing the searchability of page-image PDF documents using an aligned hidden layer from a truth text. In: DocEng '16 Proceedings of the 2016 ACM Symposium on Document
نویسندگان
چکیده
The search accuracy achieved in a PDF image-plus-hiddentext (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a ‘truth text’. The alignment of the truth text with the image is guided by using OCRprovided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCRproduced hidden layer with no truth-text enhancement.
منابع مشابه
Macdonald, Alexander J. and Brailsford, David F. and Bagley, Steven R. (2005) Encapsulating and Manipulating Component Object Graphics (COGs) using SVG. In: ACM Symposium on Document Engineering
Scalable Vector Graphics (SVG) has an imaging model similar to that of PostScript and PDF but the XML basis of SVG allows it to participate fully, via namespaces, in generalised XML documents. There is increasing interest in using SVG as a Page Description Language and we examine ways in which SVG document components can be encapsulated in contexts where SVG will be used as a rendering technolo...
متن کاملHughes, Jacob and Brailsford, David F. and Bagley, Steven R. and Adams, Clive E. (2014) Generating summary documents for a variable-quality PDF document collection. In: ACM Symposium on Document
The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF,...
متن کاملProbets, Steve and Mong, Julius and Evans, David and Brailsford, David F. (2001) Vector Graphics: From PostScript and Flash to SVG. In: ACM Symposium on Document Engineering (DocEng
The XML-based specification for Scalable Vector Graphics(SVG), sponsored by the World Wide Web consortium, allows for compact and descriptive vector graphics for the Web. SVG's domain of discourse is that of graphic primitives whose optional attributes express line thickness, fill patterns, text size and so on. These primitives have very different properties from those of the traditional docume...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملDirectional Stroke Width Transform to Separate Text and Graphics in City Maps
One of the complex documents in the real world is city maps. In these kinds of maps, text labels overlap by graphics with having a variety of fonts and styles in different orientations. Usually, text and graphic colour is not predefined due to various map publishers. In most city maps, text and graphic lines form a single connected component. Moreover, the common regions of text and graphic lin...
متن کامل