Automatically Improved Category Labels for Syntax-Based Statistical Machine Translation

نویسنده

  • Greg Hanneman
چکیده

A common modeling choice in syntax-based statistical machine translation is the use of synchronous context-free grammars, or SCFGs. When training a translation model in a supervised setting, an SCFG is extracted from parallel text that has been statistically word-aligned and parsed by monolingual statistical parsers. However, the set of syntactic category labels used in a monolingual statistical parser is decided upon quite independently of the machine translation task, and there is no guarantee that it is optimal for a bilingual SCFG or for machine translation at all. In this thesis, we first demonstrate that the set of category labels used in a machine translation system’s grammar strongly affects three inter-related characteristics of the system: spurious ambiguity, rule sparsity, and reordering precision. We propose using these characteristics as the basis for evaluating the properties of an SCFG both outside of and within an actual translation task. Finally, as our main work, we propose three automatic relabeling methods that will create a better set of category labels for a given language pair and choice of automatic parsers. These methods involve clustering and collapsing unnecessary labels, splitting existing labels into multiple subtypes, and swapping specific instances of existing labels to correct for local errors. Improved properties of the grammar and improved translation results will be demonstrated for at least two language pairs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Syntax Based Reordering with Automatically Derived Rules for Improved Statistical Machine Translation

Syntax based reordering has been shown to be an effective way of handling word order differences between source and target languages in Statistical Machine Translation (SMT) systems. We present a simple, automatic method to learn rules that reorder source sentences to more closely match the target language word order using only a source side parse tree and automatically generated alignments. Th...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Improving Syntax-Augmented Machine Translation by Coarsening the Label Set

We present a new variant of the SyntaxAugmented Machine Translation (SAMT) formalism with a category-coarsening algorithm originally developed for tree-to-tree grammars. We induce bilingual labels into the SAMT grammar, use them for category coarsening, then project back to monolingual labeling as in standard SAMT. The result is a “collapsed” grammar with the same expressive power and format as...

متن کامل

Syntax-Augmented Machine Translation using Syntax-Label Clustering

Recently, syntactic information has helped significantly to improve statistical machine translation. However, the use of syntactic information may have a negative impact on the speed of translation because of the large number of rules, especially when syntax labels are projected from a parser in syntax-augmented machine translation. In this paper, we propose a syntax-label clustering method tha...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011