Word Segmentation Standard in Chinese, Japanese and Korean

نویسندگان

  • Key-Sun Choi
  • Hitoshi Isahara
  • Kyoko Kanzaki
  • Hansaem Kim
  • Seok Mun Pak
  • Maosong Sun
چکیده

Word segmentation is a process to divide a sentence into meaningful units called “word unit” [ISO/DIS 24614-1]. What is a word unit is judged by principles for its internal integrity and external use constraints. A word unit’s internal structure is bound by principles of lexical integrity, unpredictability and so on in order to represent one syntactically meaningful unit. Principles for external use include language economy and frequency such that word units could be registered in a lexicon or any other storage for practical reduction of processing complexity for the further syntactic processing after word segmentation. Such principles for word segmentation are applied for Chinese, Japanese and Korean, and impacts of the standard are discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of the Impact of Word Segmentation on Name Tagging for Chinese and Japanese

Word Segmentation is usually considered an essential step for many Chinese and Japanese Natural Language Processing tasks, such as name tagging. This paper presents several new observations and analysis on the impact of word segmentation on name tagging; (1). Due to the limitation of current state-of-the-art Chinese word segmentation performance, a character-based name tagger can outperform its...

متن کامل

Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation

Unknown words and word segmentation granularity are two main problems in Chinese word segmentation for ChineseJapanese Machine Translation (MT). In this paper, we propose an approach of exploiting common Chinese characters shared between Chinese and Japanese in Chinese word segmentation optimization for MT aiming to solve these problems. We augment the system dictionary of a Chinese segmenter b...

متن کامل

Can Word Segmentation be Considered Harmful for Statistical Machine Translation Tasks between Japanese and Chinese?

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this...

متن کامل

Adapting Chinese Word Segmentation for Machine Translation Based on Short Units

In Chinese texts, words composed of single or multiple characters are not separated by spaces, unlike most western languages. Therefore Chinese word segmentation is considered an important first step in machine translation (MT) and its performance impacts MT results. Many factors affect Chinese word segmentations, including the segmentation standards and segmentation strategies. The performance...

متن کامل

Chinese and Korean Topic Search of Japanese News Collections

UC Berkeley participated in the pivot bilingual task of the CLIR track at NTCIR Workshop 4. Our focus was on Chinese and Korean searches against the Japanese News document collection, using English as a pivot language. For comparison of our pivot techniques, we submitted Japanese monolingual and English Japanese bilingual search rankings as well. Two different commercial translation software pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009