Identification and Translation of Significant Patterns for Cross-Domain SMT Applications

نویسندگان

  • Han-Bin Chen
  • Hen-Hsen Huang
  • Jengwei Tjiu
  • Ching-Ting Tan
  • Hsin-Hsi Chen
چکیده

Adaptation of statistical machine translation (SMT) systems from generic to specific domains is challenging due to the lack of training data. In this paper we propose a framework for domain adaptation by exploiting a large monolingual in-domain corpus. We identify the significant patterns to capture the domain specific writing styles. The patterns are then translated with the involvements of domain experts. The major issue of our framework is to reduce the cost of the experts and better allocate their efforts. The experimental results show the proposed methods are effective, in terms of the significance and diversity of the patterns. The approaches to integrate the mined patterns into background SMT are also discussed.

منابع مشابه

A Simplification-Translation-Restoration Framework for Cross-Domain SMT Applications

Integration of domain specific knowledge into a general purpose statistical machine translation (SMT) system poses challenges due to insufficient bilingual corpora. In this paper we propose a simplification-translation-restoration (STR) framework for domain adaptation in SMT by simplifying domain specific segments of a text. For an in-domain text, we identify the critical segments and modify th...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

Dynamically Integrating Cross-Domain Translation Memory into Phrase-Based Machine Translation during Decoding

Our previous work focuses on combining translation memory (TM) and statistical machine translation (SMT) when the TM database and the SMT training set are the same. However, the TM database will deviate from the SMT training set in the real task when time goes by. In this work, we concentrate on the task when the TM database and the SMT training set are different and even from different domains...

متن کامل

Cross-Domain and Cross-Language Porting of Shallow Parsing

English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...

متن کامل

Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In thi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011