Handling Technical OOVs in SMT
نویسندگان
چکیده
We present a project on machine translation of software help desk tickets, a highly technical text domain. The main source of translation errors were out-of-vocabulary tokens (OOVs), most of which were either in-domain German compounds or technical token sequences that must be preserved verbatim in the output. We describe our efforts on compound splitting and treatment of non-translatable tokens, which lead to a significant translation quality gain.
منابع مشابه
Using BabelNet to Improve OOV Coverage in SMT
Out-of-vocabulary words (OOVs) are a ubiquitous and difficult problem in statistical machine translation (SMT). This paper studies different strategies of using BabelNet to alleviate the negative impact brought about by OOVs. BabelNet is a multilingual encyclopedic dictionary and a semantic network, which not only includes lexicographic and encyclopedic terms, but connects concepts and named en...
متن کاملTranslation of Unknown Words in Low Resource Languages
We address the problem of unknown words, also known as out of vocabulary (OOV) words, in machine translation of low resource languages. Our technique comprises a combination of methods, inspired by the common OOV types observed. We also design evaluation techniques for measuring coverage of OOVs achieved and integrate the new translation candidates in a Statistical Machine Translation (SMT) sys...
متن کاملDomain Adaptation in SMT of User-Generated Forum Content Guided by OOV Word Reduction: Normalization and/or Supplementary Data?
This paper reports a set of domain adaptation techniques for improving Statistical Machine Translation (SMT) for usergenerated web forum content. We investigate both normalization and supplementary training data acquisition techniques, all guided by the aim of reducing the number of Out-Of-Vocabulary (OOV) items in the target language with respect to the training data. We classify OOVs into a s...
متن کاملSMT-CAT integration in a Technical Domain: Handling XML Markup Using Pre & Post-processing Methods
The increasing use of eXtensible Markup Language (XML) is bringing additional challenges to statistical machine translation (SMT) and computer assisted translation (CAT) workflow integration in the translation industry. This paper analyzes the need to handle XML markup as a part of the translation material in a technical domain. It explores different ways of handling such markup by applying tra...
متن کاملA Method to Determine How Much Power a SOT23 Can Dissipate in an Application
With the introduction of smaller surface mount (SMT) packages, it is becoming increasingly important to know their maximum power handling capability in specific applications. The power dissipation capability is directly proportional to size. As the size decreases, the amount of power that the package can dissipate decreases. Also, with the development of new high performance packages such as MS...
متن کامل