Compound Noun Segmentation Based on Lexical Data Extracted from Corpus

نویسنده

  • Juntae Yoon
چکیده

Compound noun analysis is one of the crucial problems in Korean language processing because a series of nouns in Korean may appear without white space in real texts, which makes it difficult to identify the morphological constituents. This paper presents an effective method of Korean compound noun segmen-tation based on lexical data extracted from corpus. The segmentation is done by two steps: First, it is based on manually constructed built-in dictionary for segmentation whose data were extracted from 30 million word corpus. Second, a segmentation algorithm using statistical data is proposed, where simple nouns and their frequencies are also extracted from corpus. The analysis is executed based on CYK tabular parsing and min-max operation. By experiments , its accuracy is about 97.29%, which turns out to be very effective. 1 Introduction Morphological analysis is crucial for processing the agglutinative language like Korean since words in such languages have lots of morphological variants. A sentence is represented by a sequence of eojeols which are the syntactic unit~ delimited by spacing characters in Korean. Unlike in English, an eojeol is not one word but composed of a series of words (content words and functional words). In particular , since an eojeol can often contain more than one noun, we cannot get proper interpretation of the sentence or phrase without its accurate segmentation. The problem in compound noun segmentation is that it is not possible to register all compound nouns in the dictionary since nouns are in the open set of words as well as the number of them is very large. Thus, they must be treated as unseen words without a segmentation process. Furthermore, accurate compound noun segmentation plays an important role in the application system. Compound noun segmentation is necessarily required for improving recall and precision in Korean information retrieval, and obtaining better translation in machine translation. For example, suppose that a compound noun 'seol'agsan-gugrib-gongwon(Seol'ag Mountain National Park)' appear in documents. A user might want to retrieve documents about 'seol'agsan(Seol'ag Mountain)', and then it is likely that the documents with seol'agsan-gugrib-gongwon' are also the ones in his interest. Therefore, it should be exactly segmented before indexing in order for the documents to be retrieved with the query 'seol'agsan'. Also, to translate 'seol'agsan-gugrib-gongwon' to Seol'ag Mountain National Park, the constituents should be identified first through the process of segmentation. This paper presents two methods for segmentation of compound nouns. First, we extract …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmentation of Compound Nouns using Composite Mutual Information

In Korean, a compound noun may be freely formed with or without spaces between simple nouns. The exible word formation rule of Korean raises a serious problem in processing compound nouns with computers, in particular, in searching a dictionary with the compound noun as a search key. This paper describes a corpus-based method for segmenting a compound noun into simple nouns. Segmentation is per...

متن کامل

A System for Compound Noun Multiword Expression Extraction for Hindi

Compound noun multiword expressions are important for many NLP applications like machine translation and information retrieval. This paper describes a system for Hindi compound noun multiword expressions (MWE) extraction from a given corpus. We identify major categories of compound noun MWEs, based on linguistic and psycholinguistic principles. Our extraction methods use various statistical co-...

متن کامل

Interpreting compound nouns with kernel methods

This paper presents a classification-based approach to noun-noun compound interpretation within the statistical learning framework of kernel methods. In this framework, the primary modelling task is to define measures of similarity between data items, formalised as kernel functions. We consider the different sources of information that are useful for understanding compounds and proceed to defin...

متن کامل

Corpus-Based Learning Of Compound Noun Indexing

In this paper, we present a corpusbased learning method that can index diverse types of compound nouns using rules automatically extracted from a large tagged corpus. We develop an e cient way of extracting the compound noun indexing rules automatically and perform extensive experiments to evaluate our indexing rules. The automatic learning method shows about the same performance compared with ...

متن کامل

Korean Compound Noun Indexing Based on Lexical Association and Conceptual Association

Conventional methods have dealt with compound nouns with a goal to enhance the recall. That is, they've extracted only unit nouns within a full compound noun but didn't take out head modiiers which can improve the precision. This paper presents a new method of the Korean compound noun indexing which can improve both the recall and the precision. Our method extracts head modiiers from a compound...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000