Parsing the Past - Identification of Verb Constructions in Historical Text

نویسندگان

  • Eva Pettersson
  • Beáta Megyesi
  • Joakim Nivre
چکیده

Even though NLP tools are widely used for contemporary text today, there is a lack of tools that can handle historical documents. Such tools could greatly facilitate the work of researchers dealing with large volumes of historical texts. In this paper we propose a method for extracting verbs and their complements from historical Swedish text, using NLP tools and dictionaries developed for contemporary Swedish and a set of normalisation rules that are applied before tagging and parsing the text. When evaluated on a sample of texts from the period 1550– 1880, this method identifies verbs with an F-score of 77.2% and finds a partially or completely correct set of complements for 55.6% of the verbs. Although these results are in general lower than for contemporary Swedish, they are strong enough to make the approach useful for information extraction in historical research. Moreover, the exact match rate for complete verb constructions is in fact higher for historical texts than for contemporary texts (38.7% vs. 30.8%).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus

In this paper, we describe the first English–Hungarian parallel corpus annotated for light verb constructions, which contains 14,261 sentence alignment units. Annotation principles and statistical data on the corpus are also provided, and English and Hungarian data are contrasted. On the basis of corpus data, a database containing pairs of English–Hungarian light verb constructions has been cre...

متن کامل

Auxiliary Verbs and Verbal Chains in European Portuguese

This paper describes auxiliary verb constructions in European Portuguese in view of their correct parsing in a fully integrated NLP chain. The paper provides data on these constructions over a large-sized corpus and evaluates the parsing system performance.

متن کامل

The Headedness of Mandarin Chinese Serial Verb Constructions: A Corpus-Based Study

Existing treebanks of Mandarin Chinese such as the Sinica Treebank, the Harbin Institute of Technology Treebank, and the Penn Chinese Treebank, parse Chinese serial verb constructions incorrectly or inconsistently in terms of headedness, i.e. which verb to be assigned with the label of syntactic and/or semantic “head”. Aspectual markers in serial verb constructions can help determine the head o...

متن کامل

Discarding Noise in an Automatically Acquired Lexicon of Support verb Constructions

We applied data-driven methods to carry out automatic acquisition of Dutch prepositional support verb constructions (SVCs) in corpora (e.g., iets in de gaten houden (“keep an eye on something”)). This paper addresses the question whether linguistic diagnostics help to discard noise from thenbest lists and how to (semi-)automatically apply such linguistic diagnostics to parsed corpora. We show t...

متن کامل

Full-coverage Identification of English Light Verb Constructions

The identification of light verb constructions (LVC) is an important task for several applications. Previous studies focused on some limited set of light verb constructions. Here, we address the full coverage of LVCs. We investigate the performance of different candidate extraction methods on two English full-coverage LVC annotated corpora, where we found that less severe candidate extraction m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012