Inducing the Morphological Lexicon of a Natural Language from Unannotated Text
نویسندگان
چکیده
This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the “meaning” and “form” of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task.
منابع مشابه
Acquisition of Large Scale Categorial Grammar Lexicons
A system is presented for inducing Categorial Grammar (CG) lexicons for natural language from either unannotated or minimally annotated corpora extracted from the Penn Treebank. A combination of symbolic and stochastic methods have been used to build a computationally e ective and psychologically plausible system, which learns linguistically useful lexicons. There are a variety of parameters in...
متن کاملPoor Man’s Word-Segmentation: Unsupervised Morphological Analysis for Indonesian
We present a partially new fully unsupervised algorithm for morphological segmentation of a arbitrary natural language with only one-slot concatenative morphology. The behaviour of the algorithm is examined in detail for Indonesian as it is a good approximation of such a language. The underlying theory makes no assumptions on whether the language is prefixing or suffixing, or whether affixes ar...
متن کاملAn Empirical Approach to Conceptual Case Frame Acquisition
Conceptual natural language processing systems usually rely on case frame instantiation to recognize events and role objects in text. But generating a good set of case frames for a domain is timeconsuming, tedious, and prone to errors of omission. We have developed a corpus-based algorithm for acquiring conceptual case frames empirically from unannotated text. Our algorithm builds on previous r...
متن کاملExpanding lexicons by inducing paradigms and validating attested forms
One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the language. The morphological lexicon provides two important pieces of information for NLP applications: 1) the normalization of a word, its lemmatization, which allows the application to recognize two variants of the same word; and 2) the part-of-speech roles that the word can play, w...
متن کاملAutomatically Extending the Lexicon for Parsing
This paper describes a method for automatically extending the lexicon of wide-coverage parsers. The method is an extension to the automatic detection of coverage problems of natural language parsers, based on large amounts of raw text (van Noord 2004). The goal is to extend grammar coverage, focusing in particular on the acquisition of lexical information for missing and incomplete lexicon entr...
متن کامل