SMT-CAT integration in a Technical Domain: Handling XML Markup Using Pre & Post-processing Methods
نویسندگان
چکیده
The increasing use of eXtensible Markup Language (XML) is bringing additional challenges to statistical machine translation (SMT) and computer assisted translation (CAT) workflow integration in the translation industry. This paper analyzes the need to handle XML markup as a part of the translation material in a technical domain. It explores different ways of handling such markup by applying transducers in pre and post-processing steps. A series of experiments indicates that XML markup needs a specific treatment in certain scenarios. One of the proposed methods not only satisfies the SMT-CAT integration need, but also provides slightly improved translation results on English-to-Spanish and English-toFrench translations, compared to having no additional pre or post-processing steps.
منابع مشابه
Enhancing Statistical Machine Translation with Bilingual Terminology in a CAT Environment
In this paper, we address the problem of extracting and integrating bilingual terminology into a Statistical Machine Translation (SMT) system for a Computer Aided Translation (CAT) tool scenario. We develop a framework that, taking as input a small amount of parallel in-domain data, gathers domain-specific bilingual terms and injects them in an SMT system to enhance the translation productivity...
متن کاملIdentification of Bilingual Terms from Monolingual Documents for Statistical Machine Translation
The automatic translation of domain-specific documents is often a hard task for generic Statistical Machine Translation (SMT) systems, which are not able to correctly translate the large number of terms encountered in the text. In this paper, we address the problems of automatic identification of bilingual terminology using Wikipedia as a lexical resource, and its integration into an SMT system...
متن کاملTreatment of Markup in Statistical Machine Translation
We present work on handling XML markup in Statistical Machine Translation (SMT). The methods we propose can be used to effectively preserve markup (for instance inline formatting or structure) and to place markup correctly in a machinetranslated segment. We evaluate our approaches with parallel data that naturally contains markup or where markup was inserted to create synthetic examples. In our...
متن کاملApplication of Web Mining with XML Data using XQuery
In recent years XML has become very popular for representing semi structured data and a standard for data exchange over the web. Mining XML data from the web is becoming increasingly important. Several encouraging attempts at developing methods for mining XML data have been proposed. However, efficiency and simplicity are still a barrier for further development. Normally, preprocessing or post-...
متن کاملOn the Sequencing of Tree Structures for XML Indexing
Sequence-based XML indexing aims at avoiding expensive join operations in query processing. It transforms structured XML data into sequences so that a structured query can be answered holistically through subsequence matching. In this paper, we address the problem of query equivalence with respect to this transformation, and we introduce a performance-oriented principle for sequencing tree stru...
متن کامل