Improved Typesetting Models for Historical OCR
نویسندگان
چکیده
We present richer typesetting models that extend the unsupervised historical document recognition system of BergKirkpatrick et al. (2013). The first model breaks the independence assumption between vertical offsets of neighboring glyphs and, in experiments, substantially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and nonitalic portions of documents. Richer models complicate inference so we present a new, streamlined procedure that is over 25x faster than the method used by BergKirkpatrick et al. (2013). Our final system achieves a relative word error reduction of 22% compared to state-of-the-art results on a dataset of historical newspapers.
منابع مشابه
Font group identification using reconstructed fonts
Ideally, digital versions of scanned documents should be represented in a format that is searchable, compressed, highly readable, and faithful to the original. These goals can theoretically be achieved through OCR and font recognition, re-typesetting the document text with original fonts. However, OCR and font recognition remain hard problems, and many historical documents use fonts that are no...
متن کاملOCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
This article describes the results of a case study that applies Neural Networkbased Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus (Breuel et al. 2013) on the RIDGES herbal text corpus (Odebrecht et al. 2017, in press). Training specific OCR models was possible because the necessary ground truth is available as err...
متن کاملOCR and post-correction of historical Finnish texts
This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.
متن کاملSegmentation of Handwritten Characters for Digitalizing Korean Historical Documents
The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancie...
متن کاملPainfree LaTeX with Optical Character Recognition and Machine Learning
Recent years have seen an increasing interest in harnessing advancements machine learning (ML) and optical character recognition (OCR) to convert physical and handwritten documents into digital versions. The increasing adoption of digital documents in academia, however, has provided a new layer of complexity to automatic digitization of physical documents. Compared to typical texts written in n...
متن کامل