historical picture

Building a historical corpus for Classical Portuguese: some technological aspects

2006

Maria Clara Paixão de Sousa Thorsten Trippel

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of ...

متن کامل

Input sensitive thresholding for ancient Hebrew manuscript

Journal: :Pattern Recognition Letters 2005

Itay Bar Yosef

In this paper, we describe an input sensitive thresholding algorithm for ancient Hebrew calligraphy documents. Usually, historical document images are of poor quality since the documents have degraded over time due to storage conditions. However, the distribution of noise in one document is not uniform and the characters quality may vary. We develop tools to identify noisy characters and apply ...

متن کامل

An Unsupervised Model of Orthographic Variation for Historical Document Transcription

2016

Dan Garrette Hannah Alpert-Abrams

Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions with normalized orthography. In this paper, we present a novel joint transcription model that learns, unsupervised, a probabili...

متن کامل

Parsing the Past - Identification of Verb Constructions in Historical Text

2012

Eva Pettersson Beáta Megyesi Joakim Nivre

Even though NLP tools are widely used for contemporary text today, there is a lack of tools that can handle historical documents. Such tools could greatly facilitate the work of researchers dealing with large volumes of historical texts. In this paper we propose a method for extracting verbs and their complements from historical Swedish text, using NLP tools and dictionaries developed for conte...

متن کامل

The Gamera framework for building custom recognition systems

2003

Michael Droettboom Karl MacMillan Ichiro Fujinaga

This paper describes the Gamera framework for building custom document recognition systems. This open-source system is designed to support the testand-refine development cycle: an important style for developing recognition systems that work with difficult historical documents, since the solutions are often non-obvious. This paper explains the overall architecture of the system, in addition to d...

متن کامل

Linguistically-Enhanced Search over an Open Diachronic Corpus

2015

Rafael C. Carrasco Isabel Martínez-Sempere Enrique Mollá-Gandía Felipe Sánchez-Martínez Gustavo Candela Romero Maria Pilar Escobar Esteban

The BVC section of the impact-es diachronic corpus of historical Spanish compiles 86 books —containing approximately 2 million words. About 27% of the words —providing a representative coverage of the most frequent word forms— have been annotated with their lemma, part of speech, and modern equivalent following the Text Encoding Initiative guidelines. We describe how this type of annotation can...

متن کامل

Value of Learning in Sponsored Search Auctions

2010

Sai-Ming Li Mohammad Mahdian R. Preston McAfee

The standard business model in the sponsored search marketplace is to sell click-throughs to the advertisers. This involves running an auction that allocates advertisement opportunities based on the value the advertiser is willing to pay per click, times the click-through rate of the advertiser. The click-through rate of an advertiser is the probability that if their ad is shown, it would be cl...

متن کامل

A Hybrid Binarization Technique for Document Images

2011

Sokratis Vavilis Ergina Kavallieratou Roberto Paredes Kostas Sotiropoulos

In this chapter, a binarization technique specifically designed for historical document images is presented. Existing binarization techniques focus either on finding an appropriate global threshold or adapting a local threshold for each area in order to remove smear, strains, uneven illumination etc. Here, a hybrid approach is presented that first applies a global thresholding technique and, th...

متن کامل

Modeling the Hebrew Bible: Potential of Topic Modeling Techniques for Semantic Annotation and Historical Analysis

2016

Mathias Coeckelbergs Seth van Hooland

Providing useful and efficient semantic annotations is a major challenge for knowledge design of any body of text, especially historical documents. In this article, we propose Topic Modeling as an important first step to gather semantic information beyond the lexicon which can be added as annotations in the SHEBANQ. By laying out a case study, we discuss both noise and structure found in compar...

متن کامل

Text line extraction for historical document images

Journal: :Pattern Recognition Letters 2014

Raid Saabni Abedelkadir Asi Jihad El-Sana

0167-8655/$ see front matter 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2013.07.007 ⇑ Corresponding author at: Department of Computer Science, Triangle Research & Development Center, Kafr Qarea, Israel. Fax: +972 4 6356168. E-mail addresses: [email protected] (R. Saabni), [email protected] (A. Asi), [email protected] (J. El-Sana). 1 These authors contribut...

متن کامل