How does Dictionary Size Influence Performance of Vietnamese Word Segmentation?

نویسندگان

  • Wuying Liu
  • Lin Wang
چکیده

Vietnamese word segmentation (VWS) is a challenging basic issue for natural language processing. This paper addresses the problem of how does dictionary size influence VWS performance, proposes two novel measures: square overlap ratio (SOR) and relaxed square overlap ratio (RSOR), and validates their effectiveness. The SOR measure is the product of dictionary overlap ratio and corpus overlap ratio, and the RSOR measure is the relaxed version of SOR measure under an unsupervised condition. The two measures both indicate the suitable degree between segmentation dictionary and object corpus waiting for segmentation. The experimental results show that the more suitable, neither smaller nor larger, dictionary size is better to achieve the state-of-the-art performance for dictionary-based Vietnamese word segmenters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Different Criteria for Vietnamese Word Segmentation

Syntactically annotated corpora have become important resources for natural language processing due in part to the success of corpus-based methods. Since words are often considered as primitive units of language structures, the annotation of word segmentation forms the basis of these corpora. This is also an issue for the Vietnamese Treebank (VTB), which is the first and only publicly available...

متن کامل

An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation

There are two main topics in this paper: (i) Vietnamese words are recognized and sentences are segmented into words by using probabilistic models; (ii) the optimum probabilistic model is constructed by an unsupervised learning processing. For each probabilistic model, new words are recognized and their syllables are linked together. The syllable-linking process improves the accuracy of statisti...

متن کامل

Vietnamese Word Segmentation

Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages, including Chinese and Vietnamese, whitespaces are never used to determine the word boundaries, so one must resort to such higher levels of information as: informa...

متن کامل

Vietnamese Word Segmentation with CRFs and SVMs: An Investigation

Word segmentation for Vietnamese, like for most Asian languages, is an important task which has a significant impact on higher language processing levels. However, it has received little attention of the community due to the lack of a common annotated corpus for evaluation and comparison. Also, most previous studies focused on unsupervised-statistical approaches or combined too many techniques....

متن کامل

An Empirical Study on Word Segmentation for Chinese Machine Translation

Word segmentation has been shown helpful for Chinese-toEnglish machine translation (MT), yet the way different segmentation strategies affect MT is poorly understood. In this paper, we focus on comparing different segmentation strategies in terms of machine translation quality. Our empirical study covers both English-to-Chinese and Chinese-to-English translation for the first time. Our results ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016