Disambiguation Tools for NooJ

نویسنده

  • Max Silberztein
چکیده

When NooJ performs an automatic lexical analysis of corpora, it recognizes five types of atomic linguistic units (ALUs) and represents them as annotations stored inside each text’s annotation structure (TAS). Unfortunately, the massive level of ambiguities generated by each of the five corresponding parsers produces a TAS far too heavy for most corpus linguistics applications. In consequence, most users’ queries produce too many incorrect results. In order to provide a working solution for NooJ’s lexical parser’s behavior, we have implemented a new set of tools specifically designed to deal with unwanted ambiguities in corpora and texts: automatic and semi-automatic tools as well as a manual access to edit the TAS. Introduction With NooJ, linguists can represent five types of Atomic Linguistic Units (ALUs) 1 :  Affixes, morphemes and components of contracted words, e.g. dis-, -ization, cannot  simple words and their morphological variants, e.g. a laugh, to laugh, laughed, laughable  multi-word units, semi-frozen terms and their variants, e.g. as a matter of fact, a nuclear submarine  local syntactic units, e.g. complex determiners, dates, e.g. Most of my groups of, Monday June 5th in the early afternoon  discontinuous expressions such as collocations, support verb constructions, phrasal verbs, e.g. to take ... into account, to give ... in Accordingly, NooJ provides tools to describe these five types of ALUs and recognize them in corpora automatically: dictionaries, morphological Finite State Transducers (FSTs), Recursive Transition Networks (RTNs), as well as the ability to link dictionaries and syntactic grammars to formalize lexicon-grammar tables 2 . Each level of analysis constitutes an autonomous module. NooJ processes each module one after the other, in cascade form. Therefore, each module must produce a result that has 100% recall, so that the subsequent module processes an input which lacks no potential linguistic hypothesis. The cost of this approach is that the accuracy of each module is very low: all potential linguistic hypotheses have to be produced and transmitted to the next module, however improbable they might be. The accuracy of a lexical parser will get lower and lower as the precision of the linguistic data increases: for instance, a simple parser might process the word form will as two-time ambiguous (Noun or Verb) whereas a more sophisticated parser will distinguish the different meanings of the noun will (a mental faculty, a legal declaration) as well as of the verb (used to introduce the future, or synonymous to to wish). Clearly, the more sophisticated lexical parser produces a higher level of ambiguity than the simple one, and it takes a more sophisticated linguistic analysis to solve the extra ambiguities. An ideally tagged text Because any linguistic analysis must process all types of ALUs (not only the simple words), an ideal tagger should produce a result that looks like the following: Battle-tested/A Japanese/A industrial managers/N here/ADV always/ADV buck up/V nervous/A newcomers/N with/PREP the/DET tale/N of/PREP the first of their/N countrymen/N to/PREP visit/V 1 See the NooJ manual, which is updated regularly (Silberztein 2002). 2 See (Vietri 2008) for examples of lexicon-grammar tables and their formalization in NooJ. ha l-0 04 98 04 5, v er si on 1 6 Ju l 2 01 0 Mexico/LOC, a boatload of/DET samurai warriors/N blown ashore/VPP 375 years ago/DATE. From the beginning/DATE, it took/EXP1 a/DET man/N with/PREP extraordinary/A qualities/N to/EXP1 succeed/V in/PREP Mexico/LOC, says/V Kimihide Takimura/NPR, president/N of/PREP Mitsui/NPR group’s/N Kensetsu Engineering Inc./ORG unit/N. For instance, it is crucial to tag the multi-word noun industrial managers as a unit, as opposed to produce the two tags “industrial/A” and “managers/N”. In most cases, the semantic analysis of the sequence “industrial ” is: Industrial = produced by industrial methods This productive analysis can successfully be applied to a large number of nouns, such as in the following examples: Industrial cheese = cheese produced by industrial methods Industrial food = food produced by industrial methods Industrial cloth = cloth produced by industrial methods etc. But this productive analysis does not apply to industrial manager. If one wants to translate this sequence correctly in French, one has to translate it as a whole, and not word by word. For instance, a correct French translation would be “patron de PME”, whereas “gestionnaire industriel” sounds odd at the very least, and is never used. If it was possible to automatically tag texts correctly, i.e. to automatically annotate all types of linguistic units (and not only simple words), and to produce a 100% correct result, then the example above would constitute a good way to represent the result of the lexical analysis of texts. Unfortunately, any lexical parser that aims to represent all types of linguistic units while producing a result with 100% recall will produce a high degree of ambiguities, because it is not always possible to remove all ambiguities at the lexical level. For instance, consider the following text: ... There is a round table in room A32 ... The only way an automatic parser could (maybe!) reliably choose between the analysis “round table = meeting” and the analysis “round table = round piece of furniture” is to perform some complex discourse analysis that would take a larger context into account. It is impossible to choose between these two solutions at a lexical level, i.e. at the level taggers and lexical parsers operate. Taggers designed to remove ambiguities at any cost, even when it is impossible to do so reliably, simply ignore multi-word units and use probability or other techniques-based heuristics to “flip coins” and therefore produce results that are useless for many precise NLP applications. Ambiguities are generated at each level of the five lexical analyses. For instance:  the morphological parser analyzes the word form recollect as the prefix re followed by the verb collect (meaning: collect again), whereas the dictionary lookup will analyze it as a simple verb (meaning: to control oneself). It would take a sophisticated semantic analysis (at least) to choose between the two solutions in the following text: John recollects the old coins Does he remembers about them, or did he decide to take his collecting hobby back up again?  the multi-word recognizer analyses the sequence acid rock as a noun (a style of rock music), whereas the simple word recognizer analyses it as a sequence of an adjective followed by a noun.  there are several phrasal verb entries for “to back up”, depending on the distributional property of their object complements. How could a lexical parser choose between these entries without a distributional ha l-0 04 98 04 5, v er si on 1 6 Ju l 2 01 0 analysis of the complement, and without a reference analysis when the complement is implicit? For instance: John backed his car up → John backed it up (he drove in reverse) John backed Mary’s statement up → John backed it up (he supported her statement) John backed Mary’s idea up → John backed it up (he proved her idea right) John backed his computer up → John backed it up (he saved its files) In conclusion: it is impossible to build an automatic lexical parser that would disambiguate all the linguistic units that occur in texts. Automatic taggers, which aim at this very goal, simply cannot be relied upon. The only way to build a reliable lexical parser is to allow it to represent unsolvable lexical ambiguities. These ambiguities will have to be passed to subsequent parsers that would use syntactic, semantic and/or discourse analysis techniques to solve them (and not in all cases, because there are ambiguous sentences in texts!). NooJ’s Text Annotation Structure (TAS) NooJ’s lexical parser uses parallel annotations, rather than linear tags, to represent the lexical analyses of texts 3 . Annotations have two advantages over tags:  they can represent all types of linguistic units, such as affixes (inside word forms), multi-word units and discontinuous linguistic units 4 ;  they can be stacked together and therefore represent lexical ambiguities. For instance, consider the following sentence: He cannot take the round table into account In this sentence, the word form cannot corresponds to a sequence of two linguistic units; the sequence round table is ambiguous, and the linguistic unit take into account is discontinuous. The TAS’s main application is that no potential linguistic unit is left out: for instance, if the next sentence in the text is: “we will have to postpone it”, a semantic parser can infer that round table refers to a meeting, not to a 3 See (Silberztein 2006) for a description of NooJ’s Annotation engine, and (Silberztein 2007) for linguistic applications of the Text Annotation Structure. 4 See in particular how discontinuous expressions are represented in the Text Annotation Structure in (Silberztein 2008). ha l-0 04 98 04 5, v er si on 1 6 Ju l 2 01 0 piece of furniture. Conversely, if the next sentence in the text is: “we will have to move it closer to the window”, the semantic parser can infer that the round table is a piece of furniture. Both solutions are alive and ready to be chosen. Another advantage of the TAS is that it unifies all types of linguistic units. From now on, any subsequent parser will process annotations rather than affixes, simple or multi-word units and discontinuous expressions. For instance, the following NooJ query:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Complex Annotations with NooJ

NooJ associates each text with a Text Annotation Structure, in which each recognized linguistic unit is represented by an annotation. Annotations store the position of the text units to be represented, their length, and linguistic information. NooJ can represent and process complex annotations, such as those that represent units inside word forms, as well as those that are discontinuous. We dem...

متن کامل

Morphological study of Albanian words, and processing with NooJ

We are developing electronic dictionaries and transducers for the automatic processing of the Albanian Language. We will analyze the words inside a linear segment of text. We will also study the relationship between units of sense and units of form. The composition of words takes different forms in Albanian. We have found that morphemes are frequently concatenated or simply juxtaposed or contra...

متن کامل

NooJ: a Linguistic Annotation System for Corpus Processing

One characteristic of NooJ is that its corpus processing engine uses large-coverage linguistic lexical and syntactic resources. This allows NooJ users to perform sophisticated queries that include any of the available morphological, lexical or syntactic properties. In comparison with INTEX, NooJ uses a new technology (.NET), a new linguistic engine, and was designed with a new range of applicat...

متن کامل

A New Representation Model for the Automatic Recognition and Translation of Arabic Named Entities with NooJ

Recognition and translation of named entities (NEs) are two current research topics with regard to the proliferation of electronic documents exchanged through the Internet. The need to assimilate these documents through NLP tools has become necessary and interesting. Moreover, the formal or semiformal modeling of these NEs may intervene in both processes of recognition and translation. Indeed, ...

متن کامل

Formalisation de l'amazighe standard avec NooJ (Formalization of the standard Amazigh with NooJ) [in French]

Dans cette perspective, et dans le but de développer des outils et des ressources linguistiques, nous avons entrepris de construire un module NooJ pour la langue amazighe standard (Ameur et al., 2004). Le présent article propose une formalisation de la catégorie nom permettant de générer à partir d’une entrée lexicale son genre (masculin, féminin), son nombre (singulier, pluriel), et son état (...

متن کامل

Designing a NooJ Module for Turkish Inflectional Analysis: an Example of Highly Productive Morphology

Turkish is a highly inflectional language that represents an interesting challenge to traditional corpus processing techniques. We present here the design of a basic module that allows NooJ users to lemmatize and perform morphological analysis on Turkish texts.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010