Building a Paraphrase Corpus for Speech Translation

نویسندگان

  • Mitsuo Shimohata
  • Eiichiro Sumita
  • Yuji Matsumoto
چکیده

When a machine translation (MT) system receives input sentences of spoken language, the following two types of sentences are difficult to translate: (1) long sentences and (2) sentences having redundant expressions often seen in spoken language. To reduce these difficulties, we are developing methods to paraphrase input sentences into more translatable ones. In this paper, we report a preliminary Japanese paraphrase corpus. The corpus consists of original sentences derived from travel conversation and versions of them paraphrased by humans. We use three paraphrasing methods: plain, segment, and summary paraphrasing. Plain paraphrasing is applied to short sentences, where redundant expressions are replaced with plain ones. Segment and summary paraphrasing is applied to long sentences, where long sentences are converted into one or several short sentences. We also report a comparison of machine translation quality between the original sentences and the paraphrased sentences. We use two corpus-based machine translation systems in the experiment.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Multiple Metrics in Automatically Building Turkish Paraphrase Corpus

Paraphrasing is expressing similar meanings with different words in different order. In this sense it is viewed as translation in the same language. It is an important issue in natural language processing for automatic machine translation, question answering, text summarization and language generation. Studies in paraphrasing can be classified as paraphrase extraction, paraphrase generation, pa...

متن کامل

Building a Non-Trivial Paraphrase Corpus Using Multiple Machine Translation Systems

We propose a novel sentential paraphrase acquisition method. To build a wellbalanced corpus for Paraphrase Identification, we especially focus on acquiring both non-trivial positive and negative instances. We use multiple machine translation systems to generate positive candidates and a monolingual corpus to extract negative candidates. To collect nontrivial instances, the candidates are unifor...

متن کامل

Extract Domain-specific Paraphrase from Monolingual Corpus for Automatic Evaluation of Machine Translation

Paraphrase can help match synonyms or match phrases with the same or similar meaning, thus it plays an important role in automatic evaluation of machine translation. The traditional approaches extract paraphrase in general domain from bilingual corpus. Because the WMT16 metrics task consists of three subtasks, namely news domain, medical domain, and IT domain, we propose to extract domainspecif...

متن کامل

A Class-oriented Approach to Building a Paraphrase Corpus

Towards deep analysis of compositional classes of paraphrases, we have examined a class-oriented framework for collecting paraphrase examples, in which sentential paraphrases are collected for each paraphrase class separately by means of automatic candidate generation and manual judgement. Our preliminary experiments on building a paraphrase corpus have so far been producing promising results, ...

متن کامل

Turkish Paraphrase Corpus

Paraphrases are alternative syntactic forms in the same language expressing the same semantic content. Speakers of all languages are inherently familiar with paraphrases at different levels of granularity (lexical, phrasal, and sentential). For quite some time, the concept of paraphrasing is getting a growing attention by the research community and its potential use in several natural language ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004