Pause and Stop Labeling for Chinese Sentence Boundary Detection
نویسندگان
چکیده
The fuzziness of Chinese sentence boundary makes discourse analysis more challenging. Moreover, many articles posted on the Internet are even lack of punctuation marks. In this paper, we collect documents written by masters as a reference corpus and propose a model to label the punctuation marks for the given text. Conditional random field (CRF) models trained with the corpus determine the correct delimiter (a comma or a full-stop) between each pair of successive clauses. Different tagging schemes and various features from different linguistic levels are explored. The results show that our segmenter achieves an accuracy of 77.48% for plain text, which is close to the human performance 81.18%. For the rich formatted text, our segmenter achieves an even better accuracy of 82.93%.
منابع مشابه
برچسبزنی نقش معنایی جملات فارسی با رویکرد یادگیری مبتنی بر حافظه
Abstract Extracting semantic roles is one of the major steps in representing text meaning. It refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a semantic role labeling system for Persian, using memory-based learning model and standard features. Our proposed system implements a two-phase architecture to first identify...
متن کاملEffect of topic structure and sentence length on pause in Mandarin Chinese: Comparing female with male speakers
This paper studied effects of topic structure and sentence length on acoustic parameters at intonational phrases boundaries, comparing female and male speakers. Twenty native speakers of Mandarin Chinese read 12 short discourses,which contained two sentences. The second sentence was either short or long. And, the transition between the two sentences was either topic continuation, topic elaborat...
متن کاملSentence boundary detection of spontaneous Japanese using statistical language model and support vector machines
This paper presents two different approaches utilizing statistical language model (SLM) and support vector machines (SVM) for sentence boundary detection of spontaneous Japanese. In the SLM-based approach, linguistic likelihoods and occurrence of pause are used to determine sentence boundaries. To suppress false alarms, heuristic patterns of end-of-sentence expressions are also incorporated. On...
متن کاملSentence boundaries in text and pauses in speech: Correlation or confrontation?
The paper explores the interaction between sentence boundaries marked by annotators in transcriptions of Russian spontaneous speech and actual prosodic boundaries in the signal. The aim of the research is to investigate whether annotators’ prosodic competence allows them to correctly detect sentence boundaries in speech based on textual information only. We found that inter-annotator agreement ...
متن کاملSentence Boundary Detection in Broadcast Speech Transcripts
This paper presents an approach to identifying sentence boundaries in broadcast speech transcripts. We describe finite state models that extract sentence boundary information statistically from text and audio sources. An n-gram language model is constructed from a collection of British English news broadcasts and scripts. An alternative model is estimated from pause duration information in spee...
متن کامل