How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources

نویسندگان

  • Kimmo Kettunen
  • Tuula Pääkkönen
چکیده

Digitization of both hand-written and printed historical material during the last 10–15 years has been an ongoing academic and non-academic industry. Most probably this activity will only increase in the current Digital Humanities era. As a result of past and current work we have lots of digital historical document collections available and will have more of them in the future. The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001, 2005, 2014; Kettunen et al. 2014). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.40 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto.fi web service, also known as Digi. Part of the newspaper material (years 1771–1874) is also available freely downloadable in The Language Bank of Finland provided by the FIN-CLARIN consortium 1 . The collection can also be accessed through the Korp 2 environment that has been developed by Språkbanken at the University of Gothenburg and extended by FIN-CLARIN team at the University of Helsinki to provide concordances of text resources. A Cranfield style information retrieval test collection has also been produced out of a small part of the Digi newspaper material at the University of Tampere (Järvelin et al. 2015). An open data package of the whole collection will be released during the year 2016 (Pääkkönen et al., 2016) The web service digi.kansalliskirjasto.fi contains different material besides newspapers, including journals, and ephemera (different small prints). Recently a new service was created: it enables marking of clips and storing of them to a personal scrapbook. The user can also save links to his search keys and results in an Excel file. The web service is used, for example, by genealogists, heritage societies, researchers, and history enthusiast laymen (Hölttä, 2016). There is also an 1 https://kitwiki.csc.fi/twiki/bin/view/FinCLARIN/KielipankkiAineistotDigilibPub 2 https://korp.csc.fi/

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring Lexical Quality of a Historical Finnish Newspaper Collection ― Analysis of Garbled OCR Data with Basic Language Technology Tools and Means

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001). This collection contains approximately 1.95 million pages in Finnish and Swedish. Finnish part of the collection consists of about 2.39 billion words. The National Library’s Digital Collections are offered via the digi.kansalliskirjasto...

متن کامل

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In ...

متن کامل

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein familie...

متن کامل

Old Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala and Laura Löfberg 1 National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland [email protected] 2 Aalto University, Semantic Computing Research Group, Espoo, Finland [email protected] 3 National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland teemu.ruokolainen@h...

متن کامل

A semantic tagger for the Finnish language

This paper reports on the current status and evaluation of a Finnish semantic tagger (hereafter FST), which was developed in the EU-funded Benedict Project. In this project, we have ported the Lancaster English semantic tagger (USAS) to the Finnish language. We have re-used the existing software architecture of USAS, and applied the same semantic field taxonomy developed for English to Finnish....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1611.05239  شماره 

صفحات  -

تاریخ انتشار 2016