text segmentation

Unsupervised Stylistic Segmentation of Poetry with Change Curves and Extrinsic Features

2012

Julian Brooke Adam Hammond Graeme Hirst

The identification of stylistic inconsistency is a challenging task relevant to a number of genres, including literature. In this work, we carry out stylistic segmentation of a well-known poem, The Waste Land by T.S. Eliot, which is traditionally analyzed in terms of numerous voices which appear throughout the text. Our method, adapted from work in topic segmentation and plagiarism detection, p...

متن کامل

CIPS-SIGHAN Joint Conference on Chinese Language Processing, Beijing, China, August 28-29, 2010

2010

Le Sun Keh-Jiann Chen Qun Liu

The authors propose that we need somechange for the current technology inChinese word segmentation. We shouldhave separate and different phases in theso-called segmentation. First of all, weneed to limit segmentation only to thesegmentation of Chinese characters in-stead of the so-called Chinese words. Incharacter segmentation, we will extractall the informat...

متن کامل

Acoustic indicators of topic segmentation

1998

Julia Hirschberg Christine H. Nakatani

The segmentation of text and speech into topics and subtopics is an important step in document interpretation. For text, formatting information, such as headings and paragraphing, is available to aid in this endeavor, although this information is by no means su cient. For speech, the task is even more di cult. We present results of the application of machine learning techniques to the automatic...

متن کامل

The Kyutech corpus and topic segmentation using a combined method

2016

Takashi Yamamura Kazutaka Shimada Shintaro Kawahara

Summarization of multi-party conversation is one of the important tasks in natural language processing. In this paper, we explain a Japanese corpus and a topic segmentation task. To the best of our knowledge, the corpus is the first Japanese corpus annotated for summarization tasks and freely available to anyone. We call it “the Kyutech corpus.” The task of the corpus is a decision-making task ...

متن کامل

Maximum Entropy Word Segmentation of Chinese Text

2006

Aaron J. Jacobs Yuk Wah Wong

We extended the work of Low, Ng, and Guo (2005) to create a Chinese word segmentation system based upon a maximum entropy statistical model. This system was entered into the Third International Chinese Language Processing Bakeoff and evaluated on all four corpora in their respective open tracks. Our system achieved the highest F-score for the UPUC corpus, and the second, third, and seventh high...

متن کامل

Using Collocations for Topic Segmentation and Link Detection

2002

Olivier Ferret

We present in this paper a method for achieving in an integrated way two tasks of topic analysis: segmentation and link detection. This method combines word repetition and the lexical cohesion stated by a collocation network to compensate for the respective weaknesses of the two approaches. We report an evaluation of our method for segmentation on two corpora, one in French and one in English, ...

متن کامل

Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus

2005

Toshiaki Nakazawa Daisuke Kawahara Sadao Kurohashi

Katakana, Japanese phonogram mainly used for loan words, is a trou-blemaker in Japanese word segmentation. Since Katakana words are heavily domain-dependent and there are many Katakana neologisms, it is almost impossible to construct and maintain Katakana word dictionary by hand. This paper proposes an automatic segmentation method of Japanese Katakana compounds, which makes it possible to cons...

متن کامل

Cohesion and Collocation: Using Context Vectors in Text Segmentation

1999

Stefan Kaufmann

Collocational word similarity is considered a source of text cohesion that is hard to measure and quantify. The work presented here explores the use of information from a training corpus in measuring word similarity and evaluates the method in the text segmentation task. An implementation, the VecTile system, produces similarity curves over texts using pre-compiled vector representations of the...

متن کامل

Taiwan Child Language Corpus: Data Collection and Annotation

2005

Jane S. Tsay

Taiwan Child Language Corpus contains scripts transcribed from about 330 hours of recordings of fourteen young children from Southern Min Chinese speaking families in Taiwan. The format of the corpus adopts the Child Language Data Exchange System (CHILDES). The size of the corpus is about 1.6 million words. In this paper, we describe data collection, transcription, word segmentation, and part-o...

متن کامل

Punctuation as Implicit Annotations for Chinese Word Segmentation

Journal: :Computational Linguistics 2009

Zhongguo Li Maosong Sun

Paragraphs are composed of sentences. Hence when a paragraph begins, a sentence must begin, and as a paragraph closes, some sentence must finish. This observation is the basis of the sentence boundary detection method proposed by Riley (1989). Similarly, sentences consist of words. As a sentence begins or ends there must be word boundaries. Inspired by this notion, we invent a method to learn a...

متن کامل