Formal Representation of Language Structures

نویسندگان

Jan Hajič

Eva Hajičová

Alexandr Rosen

چکیده

Building treebanks is a prerequisite for various experiments and research tasks in the area of NLP. Under a recently awarded grant, we are developing (i) a formal definition of a (dependency based) tree, and (ii) a midsize treebank based on this definition. The annotated corpus is designed to have three layers: morphosyntactic (linear) tagging, syntactic dependency annotation, and the tectogrammatical annotation. The project is being carried out jointly at the authors' Institutes. 1 The Current State and Motivation Recent decades have seen a shift towards expressing linguistic knowledge in ways which allow its verification and processing by formal means. Tools originating in mathematics, logic and computer science have been applied to human language to model its structure and functioning. Various aspects of different languages are being described within formally defined frameworks proposed by a number of interacting linguistic theories. The proposals deal with various levels of linguistic description, starting from the level of sounds (phonetics) up to the level of meaning. Partial grammars and lexicons now exist for many languages within various formal frameworks and collections of linguistic analyses of text and speech are accumulated to be employed both in theoretical research and applications. Besides approaches relying on symbolic means and 'rationalist' efforts which result in language models consisting of grammar rules and lexical entries, alternative methods employ statistics computed from input text or its analysis to produce a stochastic model. However, a common and crucial issue cutting across all types of enterprise in this domain is the need to adopt or design an adequate formal representation of language structures in order to accommodate relevant linguistic knowledge in its relation to the actual language data. There is a number of tasks which typically require soundly defined formal representation of language structures: 1. analysis (parsing) of input text or speech into a representation, tagging of text or speech collections; 2. synthesis (generation) of output text or speech from a representation; 3. mapping of one representation onto another -transfer (typically in machine translation systems). These are the elementary tasks which are parts of many natural language processing applications, some of which are listed below: • machine translation systems; • natural language interface to knowledge bases, question answering systems; • automatic abstracting and knowledge acquisition systems; • automatic acquisition of linguistic data and its integration into a language model. 1 Grant of the Grant Agency of the Czech Republic No. 405/96/0198, which has now become an integral part of a newly awarded long-term grant of the same agency No. 405/96/K214 2 When a linguistic description is implemented on computers, the usual goal is to parse sentences and produce representations of their analyses, thereby verifying the framework, the linguistic theory and the description itself. Another way to obtain (morphological and syntactic) analysis of sentences is by employing statistical methods on large samples of (already analyzed) texts in order to process a new text afterwards, performing some degree of linguistic analysis on the basis of the data acquired in the `learning' phase. Both these kinds of efforts converge and their increasing potential is reflected in the growing amount of text and speech data analyzed to a different degree for various purposes. Formal representations of language structures which have been proposed by different linguistic theories and/or used in natural language processing applications reflect their context in many respects and suitable candidates for an intended more general use are difficult to find. This is due to various aspects of their design, such as (i) specific theoretical commitment, (ii) limited expressive power in partial coverage of language phenomena and restriction to certain levels of linguistic analysis, (iii) difficulties in expressing relationships between different levels of analysis, (iv) hard-wired reliance on some characteristics of a certain language or language group and the resulting difficulty in adapting the framework to a typologically different language, and, finally, (v) application-specificity. Thus, it is difficult to express a full-fledged syntactic analysis of a 'free word-order' language by means of word-class labels and constituent brackets used for tagging (mostly English) texts. Although it is not likely that a single framework could become a universally accepted vehicle of linguistic knowledge, we believe that a higher degree of generality and flexibility can be achieved for the benefit of both theoretical studies and application-oriented projects. 2 Characteristics of a Satisfactory Solution From the conceptual point of view, an adequate design of formal representation should be able to express linguistic facts related to the following levels of description: 1. level of phonetics, phonology, graphemics: specification of phonemes, stress and prosodic patterns, etc.; 2. level of morphology: morphemes, morphological categories; 3. level of syntax: syntactic categories, syntactic structure (trees); 4. level of (linguistic) meaning: disambiguation of lexical meaning, specification of underlying structure and function, communicative dynamism and topic -focus articulation, anaphora resolution. There are several important features that should be reflected in the design to make it really useful: • It should be possible to describe a language structure in all its aspects simultaneously, i.e., to be able to relate facts from all levels of linguistic analysis in a straightforward fashion. At the same time, the design should permit access to specific aspects of the description without other aspects intervening. Thus, a user interested only in syntactic structure should be able to filter out any other information. • If a certain aspect of linguistic description can be structured and viewed differently depending on theoretical commitments, the design should provide an option to derive the desired way of presenting the linguistic facts from a common representation. Thus, both phrase-structure and dependency trees could be derived from the description. • The design should be capable of accommodating typologically different languages without substantial modifications, especially, it should provide space for stating the relation between word-order variations and higher levels and for the interplay between morphology and syntax in the case of complex expressions. • A related requirement concerns the possibility to express links between parallel structures and their analyses in different languages. This feature is important if parallel bior multilingual data are to be analyzed and studied as contrastive language structures. • The design should provide space for as little or as much linguistic facts concerning a language structure as is possible or practical to collect or express. This feature would permit to integrate text or speech samples with their analyses in a stepwise fashion, possibly starting with a bare text/speech string and leaving some levels unspecified. • It should be possible to represent at least some linguistic facts in an underspecified form. Wherever possible, an option to use a quantitative measure should accompany such cases. Disjunctions restricted to local domains, underspecified descriptions and weights could be the means to achieve this requirement. • The formal representation should be convertible to another format, as required by an application or desired by another specification covering compatible conceptual issues. • The design should be flexible in the sense that it should contain as few inherent restrictions to its extensions and modifications as possible. 3 Background, Methods and Problems Without attempting to preview the results, the following points can be made to sketch the starting point situation, the outlines of the goal, and the path towards its achievement: 1. The project will be able to profit from theoretical results and practical experience gained in the field of formal description of natural language at our sites. The fruitful results concerning word-order variations and their relation to meaning, as well as the richness of syntactic studies based on a dependency-oriented model, both widely acknowledged and faithful to the high standards of the Prague School linguistic tradition, provide a wealth of stimulating material. At both sites, a number of application-oriented research projects have been at least in some respects tackling the problems of an adequate representation of language structures. The projects include machine translation, natural language interface to knowledge bases, automatic abstracting, automatic knowledge acquisition from texts and grammar checking. 2. The smallest piece of information (typically, a linguistic category) is expressed as an attribute and its value (i.e., a 'feature'). A collection (conjunction) of such pairs is used to describe a linguistic object (typically an aspect of linguistic description of a word or a collocation), allowing for partial information (underspecification) and entering into more complex structures, where some attribute values are not atoms but structures. Through the recursive nature of such a representation, linguistic structures of arbitrary complexity can be described. Two or more attributes can share a single value, which is a possible way to implement relations between linguistic facts at different levels of description. As structures of this type have become a kind of standard in modern linguistic research, the issues of compatibility with other approaches will be substantially simplified on many levels. 3. The design will be tested by its application on language data in at least two typologically different languages. A sample of bilingual parallel text data will be provided to test the parallel link option between analyses of linguistic structures. There are a few challenging issues which call for an inventive solution: • The relation between the surface string of graphemes/phonemes, hierarchical syntactic structure and the ordering of meaning-bearing elements according to the degrees of communicative dynamism is far from straightforward. This concerns especially cases of crossing dependency (non-projective structures). If the representation is to accommodate descriptions on all levels in an integral form, a non-trivial solution has to be found. • Complex expressions like idioms, compound words and morphological categories realized by discontinuous sequences of auxiliary words present another problem of a similar kind. • The integration of all kinds of linguistic knowledge in a single formal framework capable of application to the widest range of language structures is a unique enterprise. Disregarding the undoubtedly immense practical profit for a moment, the project will probably bring the most precious theoretical fruit precisely in this domain.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identity and Representation through Language in Ghana: The Postcolonial Self and the Other

Research related to colonialism and post colonialism shows how the identities of indigenous people were constructed and how these identities are reconstructed in our contemporary world. The thrust of this paper is that colonialism brought a shift in the linguistic structure of Ghana with the introduction of the use of English among Ghanaians. The coexistence of both Ghanaian languages and Engli...

متن کامل

Organization of Gatekeeping and Mental Framework in the System of Representation and Hierarchical Relational Structures of the Modern Society

Critical discourse analysis as a type of social practice reveals how linguistic choices enable speakers to manipulate the realizations of agency and power in the representation of action.The present study examines the relationship between language and ideology and explores how such a relationship is represented in the analysis of spoken text and to show how declarative knowledge, beliefs, attit...

متن کامل

Interrelationships between Language and Literature from Old English to the Modern Period

Literature is the aesthetic manifestation of language. It is ‘as old as human language and as new as tomorrow’s sunrise.’ This paper explores the interrelationships between language and literature from 600 AD to the present day. The grammar of present-day English is closely related to that of Old English with the same tense formation and word orders. The verse unit is a single line and its org...

متن کامل

Processing Coordinated Structures in PENG Light

PENG Light is a controlled natural language designed to write unambiguous specifications that can be translated automatically via discourse representation structures into a formal target language. Instead of writing axioms in a formal language, an author writes a specification and the associated background axioms directly in controlled natural language. In this paper, we first review the contro...

متن کامل

Data Types for Graph-Based Visual Programming

19. Dipayan Gangopadhyay. A formal system for network databases and its applications to integrity based issues. A language for conveying the aliasing properties of dynamic, pointer-based data structures.

متن کامل

Frequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time

There is substantial debate over the mental representation of regular past tense forms in both first language (L1) and second language (L2) processing. Specifically, the controversy revolves around the nature of morphologically complex forms such as the past tense –ed in English and how morphological structures of such forms are represented in the mental lexicon. This study focuses on the resul...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Formal Representation of Language Structures

نویسندگان

چکیده

منابع مشابه

Identity and Representation through Language in Ghana: The Postcolonial Self and the Other

Organization of Gatekeeping and Mental Framework in the System of Representation and Hierarchical Relational Structures of the Modern Society

Interrelationships between Language and Literature from Old English to the Modern Period

Processing Coordinated Structures in PENG Light

Data Types for Graph-Based Visual Programming

Frequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time

عنوان ژورنال:

اشتراک گذاری