OCR Alternatives for Electronic Publishing of Digitised Documents

نویسنده

  • Stefan Pletschacher
چکیده

This paper describes a general approach on how digitised documents may be automatically prepared for being stored and processed on various digital platforms. The focus is on documents that are not suitable for optical character recognition (OCR) methods but provide regular structures in the form of text-like blocks. By extracting a document immanent alphabet, preserving the graphical representations by means of vectorisation and based on these steps encoding the original document, it is possible to gather benefits of encoded text without the effort and the possible mistakes that arise from recognition methods. The use of the Extensible Markup Language (XML) for structural descriptions and Scalable Vector Graphics (SVG) for graphical representations enables a seamless integration into style sheet based output workflows for producing system specific layouts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MathDoc and the Electronic Publishing of Mathematics

France has a long tradition in the publication of mathematics. The very first “mathematics only” journal in the world, the Annales de Gergonne was published from 1810 to 1831 and several of the foremost current mathematical journals are published in France today. This paper will develop the original work of the “MathDoc” team to make accessible these and other journals, and more generally to pr...

متن کامل

Electronic Publishing of Digitised Works

This paper describes the automated process to create structured master and access copies for the digitised works at the BND – National Digital Library. The BND created during 2004 and 2005 nearly half million of digitised images, from more than 25.000 titles of printed works, manuscripts, drawings and maps. The resulting of the digitisation process is a group of TIFF image files representing th...

متن کامل

Optical Font Recognition from Projection Profiles

• Recognition of logical document structures [1], where knowledge of the font used in a word, line, or text block may be useful for defining its logical label (chapter title, section title or paragraph). • Document reproduction, where knowledge of the font is necessary in order to reproduce (reprint) the document. • Document indexing and information retrieval, where word indexes are generally p...

متن کامل

Extracting anchorable information units from PDF files

Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can creat...

متن کامل

Electronic Documents: What Lies Ahead?

The progressive research and development in computing and information processing technologies have fuelled its way to the emergence of a new range of electronic documents that are expected to fully exploit the electronic medium’s basic properties of added interactivity and flexibility. Coupled with the affiliation with metadata and additional layers of information, e-documents will engage a new...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005