Creating Digital Resources from Legacy Documents: An Experience Report from the Biosystematics Domain
نویسندگان
چکیده
Digitized legacy document marked up with XML can be used in many ways, e.g., to generate RDF statements about the world described. A prerequisite for doing so is that the document markup is of sufficient quality. Since fully automated markup-generation methods cannot ensure this, manual corrections and cleaning are indispensable. In this paper, we report on our experiences from a digitization and markup project for a large corpus of legacy documents from the biosystematics domain, with a focus on the use of modern tools. The markup created covers both document structure and semantic details. In contrast to previous markup projects reported on in literature, our corpus consists of large publications that comprise many different semantic units, and the documents contain OCR noise and layout artifacts. A core insight is that digitization and automated markup on the one hand and manual cleaning and correction on the other hand should be tightly interleaved, and that tools supporting this integration yield a significant improvement.
منابع مشابه
Efficient Conversion of Scientific Legacy Documents into Semantic Web Resources: using biosystematics as a working example
متن کامل
Analyzing registry, log files, and prefetch files in finding digital evidence in graphic design applications
The products of graphic design applications leave behind traces of digital information which can be used during a digital forensic investigation in cases where counterfeit documents have been created. This paper analyzes the digital forensics involved in the creation of counterfeit documents. This is achieved by first recognizing the digital forensic artifacts left behind from the use of graphi...
متن کاملA combining approach to Find All taxon names (FAT) in legacy biosystematics literature
— Most of the literature on natural history is hidden in millions of pages stacked up in our libraries. Various initiatives aim now at making these publications digitally accessible and searchable, applying xmlmark up technologies. The unique biological names play a crucial role to link content related to a particular taxon. Thus discovering and marking them up is extremely important. Since the...
متن کاملBiosystematics and phylogeny of Tanacetum fisherae, a new record from Iran
The chromosome number (2n=5x=44+1B) of Tanacetum fisherae, as a new record from high mountains of southern Iran (Kerman province: Hazar mountain) is reported. A new ploidy level (pentaploidy) for the genus is presented for the first time. The studied population was aneuploid, having lost one chromosome out of the 45 expected in an x=9-based pentaploid. The distribution map, description and micr...
متن کاملSemi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor
Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, ...
متن کامل