Exploring Portability of Syntactic Information from English to Basque

نویسندگان

  • Eneko Agirre
  • Aitziber Atutxa
  • Koldo Gojenola
  • Kepa Sarasola
چکیده

This paper explores a crosslingual approach to the PP attachment problem. We built a large dependency database for English based on an automatic parse of the BNC, and Reuters (sports and finances sections). The Basque attachment decisions are taken based on the occurrence frequency of the translations of the Basque (verb-noun) pairs in the English syntactic database. The results show that with this simple technique it is possible to transfer syntactic information from a language like English in order to make PP attachment decisions in another language, in this case Basque. 1 Authors listed in alphabetical order. Introduction & Motivation This work is comprised in a broader endeavor in the context of the MEANING project (Rigau et al., 2002), with the goal of exploring the possibility of porting linguistic knowledge acquired in one language to another. This portability issue could be especially relevant for minority languages with few resources like Basque. Hence the main motivation underlying this experiment is to explore ways to overcome the limitations originated by the lack of resources. If we were able to transfer some of the linguistic knowledge available for English to other languages we would effectively reduce some of the restrictions in these languages (small corpora, lack of hand annotated corpora, etc.). Cross-language information transfer is not something new, however most of the work done relies on the usage of parallel corpora (Hwa et al 2002), which are difficult to find, specially for lesser studied languages. This is one of the reasons that lead us to consider the usage of comparable corpora, since it is easier to obtain. Another noteworthy aspect is the pair of languages selected for the experiment: English and Basque. Hypothetically, these two languages are linguistically distant enough to make this work extensible to any other language pair. The following could be a short characterization of the most relevant differences between the two languages: ??English is a head initial language with an SVO word order, while Basque is a head final free word order language. ??English does not show strong morphology, while Basque does. ??English is not a pro-drop language, and Basque is a three-way pro-drop language. ??English and Basque do not belong to the same typological family. We chose the PP attachment problem in order to explore the portability issue. This problem is especially hard for free word order languages like Basque. Our current partial parser makes attachment decisions based on certain rules and heuristics. Our experiment has been devised to transfer attachment information coming from English parsed data making the attachment decisions for Basque based on this transferred information. The basic idea behind the system presented here is that verbs show certain preferences on the nouns they appear with. Therefore, if we have a sentence with two verbs, and some noun phrases, one of the verbs will show higher preference for some of the noun phrases while the other verb will show higher preference for the others. We will make one assumption beyond this basic idea, the assumption being that these preferences happen and to some extent can be transferred cross-linguistically (Agirre et al. 2003). Note that this is a preliminary work so at this point we aim to keep the system as simple as possible. Thus, higher co-occurrence of the verb and a noun will be taken to be higher preference of that verb over that noun. The results obtained suggest that cross language transferring of knowledge acquired from comparable corpora, is worth pursuing. Even employing a very simple machinery, results seem very promising. Outline of the method Our starting point was the Basque parser described in (Aldezabal et al 2000). This parser uses a unification grammar to build syntactic structures. Having a sentence it chunks it into phrases, finds the head of each phrase and then applying certain rules and heuristics tries to link those heads to the different verbs belonging to the sentence. To test our attachment system, we selected sentences with two verbs, and used the Basque parser to obtain information about the chunks in the sentences. The attachment information provided by the parser is discarded, maintaining only the chunking information. The heads of the noun groups are extracted, and a set of all possible syntactically dependent (verb-noun) pairs are constructed. The goal was to select for each noun which verb should it be attached to from the two possibilities.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The University of Maryland Senseval-3 system descriptions

For SENSEVAL-3, the University of Maryland (UMD) team focused on two primary issues: the portability of sense disambigation across languages, and the exploitation of real-world bilingual text as a resource for unsupervised sense tagging. We validated the portability of our supervised disambiguation approach by applying it in seven tasks (English, Basque, Catalan, Chinese, Romanian, Spanish, and...

متن کامل

Syntactic Structures and Rhetorical Functions of Electrical Engineering, Psychiatry, and Linguistics Research Article Titles in English and Persian: A Cross-linguistic and Cross-disciplinary Study

A research article (RA) title is the first and foremost feature that attracts the reader's attention, the feature from which she/he may decide whether the whole article is worth reading. The present study attempted to investigate syntactic structures and rhetorical functions of RA titles written in English and Persian and published in journals in three disciplines of Electrical Engineering, Psy...

متن کامل

The BEST Dataset of Language Proficiency

Researchers investigating processes stemming from or related to multilingualism often face the challenge of correctly characterizing their multilingual samples in terms of language use, proficiency, dominance, and exposure. This is typically done by using a variety of objective (e.g., normed tests) and/or subjective (e.g., self-reports) measures which tend to vary largely across laboratories an...

متن کامل

Chunk and Clause Identification for Basque by Filtering and Ranking with Perceptrons

This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This mode...

متن کامل

Contribution of Complex Lexical Information to Solve Syntactic Ambiguity in Basque

In this study, we explore the impact of complex lexical information to solve syntactic ambiguity, including verbal subcategorization in the form of verbal transitivity and verb-noun-case or verb-noun-case-auxiliary relations. The information was obtained from different sources, including a subcategorization dictionary extracted from a Basque corpus, the web as a corpus, an English corpus and a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004