The Automatic Thai Sentence Extraction
نویسندگان
چکیده
Unlike English, there is no explicit sentence marker in the Thai language. Conventionally, space is placed at the end of sentence in Thai writing. But it does not mean that space always indicates the sentence boundary. It is also used as other purposes [Danvivathana 1987]. This paper presents an algorithm to extract sentences from paragraph by detecting the true sentence breaking spaces, by applying the statistical part-of-speech (POS) tagging technique to the space classification problem. The algorithm considers 2 consequent strings with a space in between each time for determining the space as whether a true sentence breaking space or not. We divided the ORCHID Thai POS tagged corpus into 10 portions for cross-validation test. The evaluation result shows that the average accuracy of space classification and break-space detection are 85.26% and 79.82% respectively and the average of false-break rate is 8.75%. Our approach also shows a significant improvement to the traditional statistical POS tagging technique. The average of POS tagging error rate reduction is as high as 11.3%.
منابع مشابه
A Lexicalized Tree Adjoining Grammar for Thai
This paper describes an alternative formalism for Thai syntax parsing based on a lexicalized tree adjoining grammar (LTAG). We first briefly present some formal background concerning LTAG, which is necessary for an understanding of LTAG and its application to Thai. Specifically, we address several issues regarding difficulties in parsing Thai sentences and how to resolve these issues using LTAG...
متن کاملAutomatic Corpus-based Thai Word Extraction
The Thai language is infamous in its ambiguity. One of its important ambiguities is that there is no explicit word boundary, or in other words there is no explicit definition what words are. Traditional methods on defining words, which depend on human judgement, base on unclear criteria or procedures, and have several limitations. This paper describes an automatic statistical method Thai word e...
متن کاملIssues in Thai Text - to - Speech Synthesis : The NECTEC Approach 1
This paper presents all the essential issues in developing the text-to-speech synthesis for Thai text analysis, prosody generation and speech synthesis. In the text analysis, problems in Thai text processing can be decomposed into the models of sentence extraction, phrase boundary determination and grapheme-to-phoneme conversion. The syllable duration and F0 contour generation rules are include...
متن کاملIssues in Thai Text-to-Speech Synthesis: The NECTEC Approach
This paper presents all the essential issues in developing the text-to-speech synthesis for Thai text analysis, prosody generation and speech synthesis. In the text analysis, problems in Thai text processing can be decomposed into the models of sentence extraction, phrase boundary determination and grapheme-to-phoneme conversion. The syllable duration and F0 contour generation rules are include...
متن کاملEvaluation Measures Considering Sentence Concatenation For Automatic Summarization By Sentence Or Word Extraction
Automatic summaries of text generated through sentence or word extraction has been evaluated by comparing them with manual summaries generated by humans by using numerical evaluation measures based on precision or accuracy. Although sentence extraction has previously been evaluated based only on precision of a single sentence, sentence concatenations in the summaries should be evaluated as well...
متن کامل