Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain

نویسندگان

  • Guido Sautter
  • Klemens Böhm
  • Donat Agosti
  • Christiana Klingenberg
چکیده

Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analyzing registry, log files, and prefetch files in finding digital evidence in graphic design applications

The products of graphic design applications leave behind traces of digital information which can be used during a digital forensic investigation in cases where counterfeit documents have been created. This paper analyzes the digital forensics involved in the creation of counterfeit documents. This is achieved by first recognizing the digital forensic artifacts left behind from the use of graphi...

متن کامل

A combining approach to Find All taxon names (FAT) in legacy biosystematics literature

— Most of the literature on natural history is hidden in millions of pages stacked up in our libraries. Various initiatives aim now at making these publications digitally accessible and searchable, applying xmlmark up technologies. The unique biological names play a crucial role to link content related to a particular taxon. Thus discovering and marking them up is extremely important. Since the...

متن کامل

Biosystematics and phylogeny of Tanacetum fisherae, a new record from Iran

The chromosome number (2n=5x=44+1B) of Tanacetum fisherae, as a new record from high mountains of southern Iran (Kerman province: Hazar mountain) is reported. A new ploidy level (pentaploidy) for the genus is presented for the first time. The studied population was aneuploid, having lost one chromosome out of the 45 expected in an x=9-based pentaploid. The distribution map, description and micr...

متن کامل

Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009