Transferring Markup Tags in Statistical Machine Translation: A Two-Stream Approach
نویسندگان
چکیده
Translation agencies are introducing statistical machine translation (SMT) into the work flow of human translators. Typically, SMT produces a first-draft translation, which is then post-edited by a person. SMT has met much resistance from translators, partly because of professional conservatism, but partly because the SMT community has often neglected some practical aspects of translation. Our paper discusses one of these: transferring formatting tags such as bold or italic from the source to the target document with a low error rate, thus freeing the post-editor from having to reformat SMT-generated text. In our “two-stream” approach, tags are stripped from the input to the decoder, then reinserted into the resulting target-language text. Tag transfer has been tackled by other SMT teams, but only a few have published descriptions of their work. This paper contributes to understanding tag transfer by explaining our approach in detail.
منابع مشابه
Clustered Word Classes for Preordering in Statistical Machine Translation
Clustered word classes have been used in connection with statistical machine translation, for instance for improving word alignments. In this work we investigate if clustered word classes can be used in a preordering strategy, where the source language is reordered prior to training and translation. Part-of-speech tagging has previously been successfully used for learning reordering rules that ...
متن کاملTreatment of Markup in Statistical Machine Translation
Treatment of Markup in Statistical Machine Translation
متن کاملDomain Adaptation in Statistical Machine Translation using Factored Translation Models
In recent years the performance of SMT increased in domains with enough training data. But under real-world conditions, it is often not possible to collect enough parallel data. We propose an approach to adapt an SMT system using small amounts of parallel in-domain data by introducing the corpus identifier (corpus id) as an additional target factor. Then we added features to model the generatio...
متن کاملCommunication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology
By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...
متن کاملTreatment of Markup in Statistical Machine Translation
We present work on handling XML markup in Statistical Machine Translation (SMT). The methods we propose can be used to effectively preserve markup (for instance inline formatting or structure) and to place markup correctly in a machinetranslated segment. We evaluate our approaches with parallel data that naturally contains markup or where markup was inserted to create synthetic examples. In our...
متن کامل