Optimizing sentence segmentation for spoken language translation
نویسندگان
چکیده
The conventional approach in text-based machine translation (MT) is to translate complete sentences, which are conveniently indicated by sentence boundary markers. However, since such boundary markers are not available for speech, new methods are required that define an optimal unit for translation. Our experimental results show that with a segment length optimized for a particular MT system, intra-sentence segmentation can improve translation performance (measured in BLEU) by up to 11% for Arabic Broadcast Conversation (BC) and 6% for Arabic Broadcast News (BN). We show that acoustic segmentation that minimizes Word Error Rate (WER) may not give the best translation performance. We improve upon it by automatically resegmenting the ASR output in a way that is optimized for translation and argue that it might be necessary for different stages of a Spoken Language Translation (SLT) system to define their own optimal units.
منابع مشابه
Enhancements in Statistical Spoken Language Translation by De-normalization of ASR Results
Spoken language translation (SLT) has become very important in an increasingly globalized world. Machine translation (MT) for automatic speech recognition (ASR) systems is a major challenge of great interest. This research investigates that automatic sentence segmentation of speech that is important for enriching speech recognition output and for aiding downstream language processing. This arti...
متن کاملAutomatic sentence segmentation and punctuation prediction for spoken language translation
This paper studies the impact of automatic sentence segmentation and punctuation prediction on the quality of machine translation of automatically recognized speech. We present a novel sentence segmentation method which is specifically tailored to the requirements of machine translation algorithms and is competitive with state-of-the-art approaches for detecting sentence-like units. We also des...
متن کاملEvaluating machine translation output with automatic sentence segmentation
This paper presents a novel automatic sentence segmentation method for evaluating machine translation output with possibly erroneous sentence boundaries. The algorithm can process translation hypotheses with segment boundaries which do not correspond to the reference segment boundaries, or a completely unsegmented text stream. Thus, the method is especially useful for evaluating translations of...
متن کاملThe ISL statistical translation system for spoken language translation
In this paper we describe the components of our statistical machine translation system used for the spoken language translation evaluation campaign. This system is based on phrase-to-phrase translations extracted from a bilingual corpus. A new phrase alignment approaches will be introduced, which finds the target phrase by optimizing the overall word-to-word alignment for the sentence pair unde...
متن کاملSpeech Segmentation and its Impact on Spoken Document Processing
Progress in both speech and language processing has spurred efforts to support applications that rely on spoken—rather than written—language input. A key challenge in moving from text-based documents to such “spoken documents” is that spoken language lacks explicit punctuation and formatting, which can be crucial for good performance. This paper describes different levels of speech segmentation...
متن کامل