OCR Alternatives for Electronic Publishing of Digitised Documents
نویسنده
چکیده
This paper describes a general approach on how digitised documents may be automatically prepared for being stored and processed on various digital platforms. The focus is on documents that are not suitable for optical character recognition (OCR) methods but provide regular structures in the form of text-like blocks. By extracting a document immanent alphabet, preserving the graphical representations by means of vectorisation and based on these steps encoding the original document, it is possible to gather benefits of encoded text without the effort and the possible mistakes that arise from recognition methods. The use of the Extensible Markup Language (XML) for structural descriptions and Scalable Vector Graphics (SVG) for graphical representations enables a seamless integration into style sheet based output workflows for producing system specific layouts.
منابع مشابه
MathDoc and the Electronic Publishing of Mathematics
France has a long tradition in the publication of mathematics. The very first “mathematics only” journal in the world, the Annales de Gergonne was published from 1810 to 1831 and several of the foremost current mathematical journals are published in France today. This paper will develop the original work of the “MathDoc” team to make accessible these and other journals, and more generally to pr...
متن کاملElectronic Publishing of Digitised Works
This paper describes the automated process to create structured master and access copies for the digitised works at the BND – National Digital Library. The BND created during 2004 and 2005 nearly half million of digitised images, from more than 25.000 titles of printed works, manuscripts, drawings and maps. The resulting of the digitisation process is a group of TIFF image files representing th...
متن کاملOptical Font Recognition from Projection Profiles
• Recognition of logical document structures [1], where knowledge of the font used in a word, line, or text block may be useful for defining its logical label (chapter title, section title or paragraph). • Document reproduction, where knowledge of the font is necessary in order to reproduce (reprint) the document. • Document indexing and information retrieval, where word indexes are generally p...
متن کاملExtracting anchorable information units from PDF files
Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can creat...
متن کاملElectronic Documents: What Lies Ahead?
The progressive research and development in computing and information processing technologies have fuelled its way to the emergence of a new range of electronic documents that are expected to fully exploit the electronic medium’s basic properties of added interactivity and flexibility. Coupled with the affiliation with metadata and additional layers of information, e-documents will engage a new...
متن کامل