A Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News
نویسندگان
چکیده
This paper presents a subword normalized cut (N-cut) approach to automatic story segmentation of Chinese broadcast news (BN). We represent a speech recognition transcript using a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence similarities. Story segmentation is formalized as a graph-partitioning problem under the N-cut criterion, which simultaneously minimizes the similarity across different partitions and maximizes the similarity within each partition. We measure inter-sentence similarities and perform N-cut segmentation on the character/syllable (i.e. subword units) overlapping n-gram sequences. Our method works at the subword levels because subword matching is robust to speech recognition errors and out-of-vocabulary words. Experiments on the TDT2 Mandarin BN corpus show that syllable-bigram-based N-cut achieves the best F1measure of 0.6911 with relative improvement of 11.52% over previous word-based N-cut that has an F1-measure of 0.6197. N-cut at the subword levels is more effective than the word level for story segmentation of noisy Chinese BN transcripts.
منابع مشابه
Multi-Scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News
This paper applies Chinese subword representations, namely character and syllable n-grams, into the TextTiling-based automatic story segmentation of Chinese broadcast news. We show the robustness of Chinese subwords against speech recognition errors, out-of-vocabulary (OOV) words and versatility in word segmentation in lexical matching on errorful Chinese speech recognition transcripts. We prop...
متن کاملCombined Use of Speaker- and Tone-Normalized Pitch Reset with Pause Duration for Automatic Story Segmentation in Mandarin Broadcast News
This paper investigates the combined use of pause duration and pitch reset for automatic story segmentation in Mandarin broadcast news. Analysis shows that story boundaries cannot be clearly discriminated from utterance boundaries by speaker-normalized pitch reset due to its large variations across different syllable tone pairs. Instead, speakerand tonenormalized pitch reset can provide a clear...
متن کاملThe SoVideo Mandarin Chinese Broadcast News Retrieval System
This paper describes the SoVideo broadcast news retrieval system for Mandarin Chinese. The system is based on technologies such as large-vocabulary continuous speech recognition for Mandarin Chinese, automatic story segmentation, and information retrieval. Currently, the database consists of 177 hours of broadcast news, which yields 3264 stories by automatic story segmentation. We discuss the d...
متن کاملMandarin Chinese Broadcast News Retrieval and Summarization Using Probabilistic Generative Models
This paper presents our recent research work on applying probabilistic generative models to Mandarin Chinese broadcast news retrieval and summarization. Most models can be trained in either a supervised or unsupervised manner. In addition, both literal term matching and concept matching strategies have been intensively investigated. This paper also presents a prototype web-based Mandarin Chines...
متن کاملSpeaker role based structural classification of broadcast news stories
This paper is concerned with automatic classification of broadcast news stories based on speaker roles such as anchor, reporter and others. The story classification is the first step for many related tasks such as browsing, indexing, and summarising the news broadcast. We use broadcast news audio and its automatic speech recogniser transcripts to implement the classification system. It builds o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009