Simple Unsupervised Morphology Analysis Algorithm (SUMAA)
نویسندگان
چکیده
SUMAA is a hybrid algorithm based on letter successor varieties for an en tirely unsupervised morphological analysis. Using language pattern and structural recognition it works well on both isolated and agglutinative lan guages. This paper gives a detailed analysis of how we developed SUMAA. F-Measures (MorphoChal lenge, 2005) achieved by SUMAA for the English, Finnish and Turkish datasets were 51.83%, 60.18% and 55.94% respectively.
منابع مشابه
Unsupervised Concept Discovery In Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic
Fully unsupervised pattern-based methods for discovery of word categories have been proven to be useful in several languages. The majority of these methods rely on the existence of function words as separate text units. However, in morphology-rich languages, in particular Semitic languages such as Hebrew and Arabic, the equivalents of such function words are usually written as morphemes attache...
متن کاملStatistical Stemming for Kannada
Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorit...
متن کاملUnsupervised Learning of Morphology by using Syntactic Categories
This paper presents a method for unsupervised learning of morphology that exploits the syntactic categories of words. Previous research [4][12] on learning of morphology and syntax has shown that both kinds of knowledge affect each other making it possible to use one type of knowledge to help the other. In this work, we make use of syntactic information i.e. Part-of-Speech (PoS) tags of words t...
متن کاملUnsupervised Learning of Na ve Morphology with Genetic Algorithms
The morphological lexicon is an important part of NLP systems which is typ ically hand written with the help of linguist experts Even a partial automation of this process could decrease the cost of the lexicon being of theoretical impor tance for languages and dialects which have not been well analysed yet In this work we describe an attempt to use the minimal description length MDL as the one ...
متن کاملInducing the Morphological Lexicon of a Natural Language from Unannotated Text
This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the “meaning” and “form...
متن کامل