Towards Building a Corpus-based Dictionary for Non-word- boundary Languages

نویسندگان

  • Tanapong Potipiti
  • Virach Sornlertlamvanich
  • Thatsanee Charoenporn
چکیده

Corpus-based lexicography is an effective task for building a dictionary for languages, which exhibit explicit word boundaries. However, for nonword-boundary languages such as Japanese, Chinese and Thai, it is an arduous job. Because in these languages, there are no clear criteria what words are, the most difficult task for building a corpus-based dictionary for these languages is the process of selecting word list or lexicon entries. We propose a practical solution for this task by applying the c4.5 learning algorithm for building the lexicon list. Applying our algorithm with Thai corpora, the experiment yields promising results about 85% in both training and test corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary

Decades of work have been conducted on automated building of parallel corpus and bilingual dictionary in the field of natural language processing. However, rarely have any studies been done between high-density character-based languages and medium-density word-based languages due to the lack of resources and fundamental linguistic differences. In this paper, we describe a methodology for creati...

متن کامل

Building Bilingual Corpus based on Hybrid Approach for Myanmar-English Machine Translation

Word alignment in bilingual corpora has been an active research topic in the Machine Translation research groups. In this paper, we describe an alignment system that aligns English-Myanmar texts at word level in parallel sentences. Essential for building parallel corpora is the alignment of translated segments with source segments. Since word alignment research on Myanmar and English languages ...

متن کامل

Non-Dictionary-Based Thai Word Segmentation Using Decision Trees

For languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently good, it will lead to a great number of unknown or unrecognized words. These unrecognized words certainly reduce segmentation accuracy. To solve...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Synchronizing Translated Movie Subtitles

This paper addresses the problem of synchronizing movie subtitles, which is necessary to improve alignment quality when building a parallel corpus out of translated subtitles. In particular, synchronization is done on the basis of aligned anchor points. Previous studies have shown that cognate filters are useful for the identification of such points. However, this restricts the approach to rela...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000