Rule-based Approach to Korean Morphological Disambiguation Supported by Statistical Method
نویسندگان
چکیده
Korean as an agglutinative language shows its proper types of difficulties in morphological disambiguation, since a large number of its ambiguities comes from the stemming while most of ambiguities in French or English are related to the categorization of a morpheme. The current Korean morphological disambiguation systems adopt mainly statistical methods and some of them use rules in the postprocess. In our approach, the morphological analyzer reduces the number of the candidate morpheme strings using adjacency conditions when it analyses a word into morpheme strings. And then the disambiguation depends on rules and statistics successively. As for the rules, the partial parsing using finite state automata decides the compatibility of each pair of words: a negative value is assigned if a word can not co-occur with another word, while a positive value is given if they are compatible. After applying all the rules related to the word, our system chooses only the positively valued strings. When more than two strings still have same value, the priority in the context is decided by the statistics in the next stage. The accuracy of our approach as Korean tagging system is about 97.1% and it may yeild a better result than the Korean morphological disambiguation systems.
منابع مشابه
TAKTAG: Two-phase learning method for hybrid statistical/rule-based part-of-speech disambiguation
Both statistical and rule-based approaches to part-of-speech (POS) disambiguation have their own advantages and limitations. Especially for Korean, the narrow windows provided by hidden markov model (HMM) cannot cover the necessary lexical and longdistance dependencies for POS disambiguation. On the other hand, the rule-based approaches are not accurate and flexible to new tag-sets and language...
متن کاملGeneralized unknown morpheme guessing for hybrid POS tagging of Korean
Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general le...
متن کاملSHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts
This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-b...
متن کاملCombining Hand-crafted Rules and Unsupervised Learning in Constraint-based Morphological Disambiguation
This paper presents a constraint-based morphological disambiguation approach that is applicable languages with complex morphology-specifically agglutinative languages with productive inflectional and derivational morphological phenomena. In certain respects, our approach has been motivated by Brill's recent work (Brill, 1995b), but with the observation that his transformational approach is not ...
متن کاملPii: S0306-4573(01)00044-9
Most work in NLP requires that texts have been previously segmented into sentences and words. Segmenting a text into sentences and words, however, is a complex task, due to the ambiguity of many punctuation marks and spaces. Furthermore, Web texts such as HTML documents are more difficult to make into well refined and segmented texts because they are described in a more free style, with many se...
متن کامل