Syntactic Analysis Of Natural Language Using Linguistic Rules And Corpus-Based Patterns

نویسندگان

  • Pasi Tapanainen
  • Timo Järvinen
چکیده

We are concerned with the syntactic annotation of unrestricted text. We combine a rule-based analysis with subsequent exploitation of empirical data. The rule~based surface syntactic analyser leaves some amount of ambiguity in the output that is resolved using empirical patterns. We have implemented a system for generating and applying corpus-based patterns. Somc patterns describe the main constituents in the sentence and some the local context of the each syntactic function. There are several (partly) redmltant patterns, and t h e "pattern" parser selects analysis of the sentence ttmt matches the strictest possible pattern(s). The system is applied to an experimeutal corpus. We present the results and discuss possible refinements of the method from a linguistic point of view. 1 I N T R O D U C T I O N We discuss surface-syntactic analysis of running text. Our purpose is to mark each word with a syntactic tag. The tags denote subjects, object, main verbs, adverbials, etc. They are listed in Appendix A. Our method is roughly following • Assign to each word all the possible syntactic tags. • Disambiguate words as much as possible using linguistic information (hand-coded rules). Ilere we avoid risks; we rather leave words ambiguous than guess wrong. • Use global patterns to form alternative sentence level readings. Those alternatiw~" analyses are selected that match the strictest global pattern. [f it does not accept any of the remaining readings, the second strictest pattern is used, and so on. • Use local patterns to rank the remaining readings. The local patterns contain possible contexts for syntactic functions. The ranking of the readings depends on the length of the contexts associated with the syntactic functions of the sentece. We use both linguistic knowledge, represented as rules, and empirical data collected from tagged corpora. We describe a new way to collect information from a tagged corpus and a way to apply it. In this paper, we are mainly concerned with exploiting the empirical data and combining two different kinds of parsers. *This work was done when the author worked in the Research Unit for Computational Linguistics at the University of Itelsinki. Our work is based on work done with ENGCG, the Constraint Grammar Parser of English [Karlsson, 1990; Karlsson, 1994; Karlsson et al., 1994; Voutilainen, 1994]. It is a rule-h~ed tagger and surface-syntactic parser that makes a very small numher of errors but leaves some words ambiguous i.e. it prefers ambiguity to guessing wrong. The morphological part-of-speech analyser leaves [Voutilainen et al., 1992] only 0.3 % of all words in running text without the correct analysis when 3-6 % of words still have two or Inore I analyses. Vontilainen, Ileikkil'5. and Anttila [1992] reported that the syntactic analyser leaves :3-3.5 % of words without the correct syntactic tag, and 15-20 % of words remain amhiguos. Currently, the error rate has been decreased to 2-2.5 % and ambiguity rate to 15 % by Tirao Jiirvinen [1994], who is responsible for tagging a 200 million word corpus using I']NGCG in the Bank of English project. Althought, the ENGCG parser works very well in part-of-speech tagging, the syntactic descriptions are still problematic. In the constraint grammar framework, it is quite hard to make linguistic generalisations that can be applied reliably. To resolve the remaining ambiguity we generate, by using a tagged corpus, a knowledge-base that contains information about both the general structure of the sentences and the local contexts of tim syntactic tags. The general structure contains information about where, for example, subjects, objects and main verbs appear and how they follow one another. It does not pay any attention to their potential modiliers. The modifier-head relations are resolved by using the local context i.e. by looking at what kinds of words there are in the neighbourhood. The method is robust in the sense that it is ahle to handle very large corpora. Although rule-b~med parsers usually perlbrrn slowly, 0rot is not the ca.qe with ENGCG. With the English grammar, the Constraint Granun;~r Parser implementation by Pasi Tapanainen analyses 400 words 2 per second on a SpareStation 10/:30. q'hat is, one million words are processed in about 40 minutes. 'l'he pattern parser for empirical patterns runs somewhat slower, about 100 words per second. 1 But even then some of tile original ,xlternative analyses are removed '2InchMing all steps of preprocessing, morphologlcM analysis, disambiguation and syntactic analysis. The speed of morphological disamblguation alone exceeds 1000 words per second.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis

This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...

متن کامل

Developing a hybrid NP parser

We describe the use of energy function optimisation in very shallow syntactic parsing. The approach can use linguistic rules and corpus-based statistics, so the strengths of both linguistic and statistical approaches to NLP can be combined in a single framework. The rules are contextual constraints for resolving syntactic ambiguities expressed as alternative tags, and the statistical language m...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

Generating a Linguistic Model for Requirement Quality Analysis

In this work, we aim at identifying potential problems of ambiguity, completeness, conformity, singularity and readability in system and software requirements specifications. Those problems arise particularly when they are written in Natural Language. We describe them from linguistic point of view but the business impacts of each potential error will be considered in system engineering context ...

متن کامل

English and Persian Sport Newspaper Headlines: A comparative study of linguistic means

Abstract Using rhetorical figures in specialized languages like the language of newspaper headlines is common. The present study attempted to conduct a contrastive analysis of the English and Persian sport newspaper headlines related to the 2014 FIFA World Cup. Toward this end, a corpus consisting of 400 English and 400 Persian headlines published during 12th of June to 13th of July, 2014 was c...

متن کامل

A Corpus-Driven Study of the Variation of Co-Occurrence Patterns in Written and Spoken Registers

This paper will focus on the study of the variation of co-occurrence patterns encountered in written and spoken registers, through the analysis of a large lexical database of corpus-extracted multiword expressions (MWEs) of European Portuguese. Those MWEs were automatically extracted from a balanced 50 million word written corpus and a 1 million word spoken corpus, furthermore statistically int...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994