Automatic Corpus-based Thai Word Extraction
نویسندگان
چکیده
The Thai language is infamous in its ambiguity. One of its important ambiguities is that there is no explicit word boundary, or in other words there is no explicit definition what words are. Traditional methods on defining words, which depend on human judgement, base on unclear criteria or procedures, and have several limitations. This paper describes an automatic statistical method Thai word extraction from plain Thai text, by employing suffix-array, mutual-information and entropy techniques. Experimental results are quite impressive; our algorithm can extract 428 acceptable words from 1 MB of plain Thai text corpus and the accuracy of extraction is about 85 per cent in both training and test corpus.
منابع مشابه
Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm VIRACH SORNLERTLAMVANICH, TANAPONG POTIPITI AND THATSANEE CHAROENPORN
Word” is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria or procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation....
متن کاملAutomatic Thai Keyword Extraction from Categorized Text Corpus
Information Extraction (IE) is a process of discovering implicit and potentially important keywords underlying unstructured natural-language text corpus. Most previously proposed solutions to IE were accomplished by constructing a set of words from given text corpus during the preprocessing step. Due to the inherent chracteristic of Thai written language which does not explicitly use any word d...
متن کاملAutomatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm
Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria o1" procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation...
متن کاملMulti-stage Annotation using Pattern-based and Statistical-based Techniques for Automatic Thai Annotated Corpus Construction
An automated or semi-automated annotation is a practical solution towards largescale corpus construction. However, special characteristics of Thai language, such as lack of word-boundary and sentenceboundary markers trigger several issues in automatic corpus annotation. This paper presents a multi-stage annotation framework, containing two stages of chunking and three stages of tagging. Two chu...
متن کاملA Unified Model of Thai Romanization and Word Segmentation
Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a ful...
متن کامل