Cyberpunc: a lightweight punctuation annotation system for speech
نویسندگان
چکیده
Doug Beeferman Adam Berger John La erty School of Computer Science Carnegie Mellon University Pittsburgh PA 15213 dougb,aberger,la [email protected] ABSTRACT This paper describes a lightweight method for the automatic insertion of intra-sentence punctuation into text. Despite the intuition that pauses in an acoustic stream are a positive indicator for some types of punctuation, this work will demonstrate the feasibility of a system which relies solely on lexical information. Besides its potential role in a speech recognition system, such a system could serve equally well in non-speech applications such as automatic grammar correction in a word processor and parsing of spoken text. After describing the design of a punctuationrestoration system, which relies on a trigram language model and a straightforward application of the Viterbi algorithm, we summarize results, both quantitative and subjective, of the performance and behavior of a prototype system.
منابع مشابه
On Development of Consistently Punctuated Speech Corpora
Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Speci...
متن کاملRevising the annotation of a Broadcast News corpus: a linguistic approach
This paper presents a linguistic revision process of a speech corpus of Portuguese broadcast news focusing on metadata annotation for rich transcription, and reports on the impact of the new data on the performance for several modules. The main focus of the revision process consisted on annotating and revising structural metadata events, such as disfluencies and punctuation marks. The resultant...
متن کاملAutomatic Punctuation Annotation in Czech Broadcast News Speech
This paper reports our initial experiments with automatic punctuation annotation from speech. We have focused on Czech broadcast news speech. The task can be defined as a classification of each inter-word boundary into one of target classes. We considered comma, sentence boundary and “no punctuation” as the target classes. We employed two statistical models – prosodic model and language model. ...
متن کاملRecovering Capitalization and Punctuation Marks on Speech Transcriptions
This work addresses two metadata annotation tasks, involved in the production of rich transcripts: automatic capitalization, and punctuation marks recovery. The main focus concerns broadcast news, using both manual and automatic speech transcripts. Different capitalization models were analysed and compared, and results support the ideia that generative approaches capture the structure of writte...
متن کاملThe annotation of the C-ORAL-BRASIL spoken corpus using an adaptation of the Palavras Parser
This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Usi...
متن کامل