TectoMT – a Deep-Linguistic Core of the Combined Chimera MT system

نویسندگان

  • Martin POPEL
  • Roman SUDARIKOV
  • Ondřej BOJAR
  • Rudolf ROSA
  • Jan HAJIČ
  • Katarzyna BARCZEWSKA
  • Filip MALAWSKI
چکیده

Chimera is a machine translation system that combines the TectoMT deep-linguistic core with phrase-based MT system Moses. For English–Czech pair it also uses the Depfix postcorrection system. All the components run on Unix/Linux platform and are open source (available from Perl repository CPAN and the LINDAT/CLARIN repository). The main website is https://ufal.mff.cuni.cz/tectomt. The development is currently supported by the QTLeap 7 FP project (http://qtleap.eu). TectoMT and Chimera TectoMT (the deep-linguistic core of Chimera) is an open-source MT system based on the Treex platform for general natural-language processing. TectoMT uses a combination of rule-based and statistical (trained) modules (“blocks” in Treex terminology), with a statistical transfer based on HMTM (Hidden Markov Tree Model) at the level of a deep, so-called tectogrammatical representation of sentence structure. In the Chimera combination, TectoMT is complemented by a Moses PB-SMT system (factored setup with additional language models over morphological tags) and optionally also by an automatic postprocessing (correction) component called Depfix. Chimera can be thus characterized as a hybrid system that combines statistical MT with deep linguistic analysis and automatic post-correction system, which is useful especially for translation into inflectionally rich languages. The three systems are combined serially: TectoMT runs first, then an additional Moses phrase table is extracted from TectoMT’s input and output. The additional table is then used in a weighted combination with a large Moses translation table to produce pre-final output. Depfix then re-parses the output (as well as input) and generates the final output based on rules reflecting morphosyntactic properties of the target language. Chimera was transferred from English–Czech to additional three language pairs (English to Dutch, Portuguese and Spanish) within the QTLeap 7 EU project. References Dušek, O., Gomes, L., Novák, M., Popel, M., Rosa, R. (2015). New Language Pairs in TectoMT. Proceedings of the 10th Workshop on Machine Translation, ISBN 978-1-941643-32-7, ACL, Stroudsburg, PA, USA, 98–104. Rosa, R., Dušek, O., Novák, M., Popel. M. (2015). Translation Model Interpolation for Domain Adaptation in TectoMT. Proceedings of the 1st Deep Machine Translation Workshop, ISBN 978-80-904571-7-1, Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czech Republic, 89–96. Bojar, O., Tamchyna, A. (2015). CUNI in WMT15: Chimera Strikes Again. Proceedings of the 10th Workshop on Machine Translation, ISBN 978-1-941643-32-7, ACL, Stroudsburg, PA, USA, 79–83. 378 Proceedings of the 19th Annual Conference of the EAMT: Projects/Products WiTKoM Virtual Sign Language Translator Project Katarzyna BARCZEWSKA 1 , Jakub GAŁKA 1,2 , Filip MALAWSKI 1 , Mariusz MĄSIOR 1 , Dorota SZULC 1 , Tomasz WILCZYŃSKI 1,2 , Krzysztof WRÓBEL 1 AGH University of Science and Technology, Department of Electronics, Poland VoicePIN.com Sp. z o. o., Poland

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Moses & Treex Hybrid MT Systems Bestiary

Moses is a well-known representative of the phrase-based statistical machine translation systems family, which are known to be extremely poor in explicit linguistic knowledge, operating on flat language representations, consisting only of tokens and phrases. Treex, on the other hand, is a highly linguistically motivated NLP toolkit, operating on several layers of language representation, rich i...

متن کامل

Translation of "It" in a Deep Syntax Framework

We present a novel approach to the translation of the English personal pronoun it to Czech. We conduct a linguistic analysis on how the distinct categories of it are usually mapped to their Czech counterparts. Armed with these observations, we design a discriminative translation model of it, which is then integrated into the TectoMT deep syntax MT framework. Features in the model take advantage...

متن کامل

Dictionary-based Domain Adaptation of MT Systems without Retraining

We describe our submission to the ITdomain translation task of WMT 2016. We perform domain adaptation with dictionary data on already trained MT systems with no further retraining. We apply our approach to two conceptually different systems developed within the QTLeap project: TectoMT and Moses, as well as Chimera, their combination. In all settings, our method improves the translation quality....

متن کامل

PhraseFix: Statistical Post-Editing of TectoMT

We present two English-to-Czech systems that took part in the WMT 2013 shared task: TECTOMT and PHRASEFIX. The former is a deep-syntactic transfer-based system, the latter is a more-or-less standard statistical post-editing (SPE) applied on top of TECTOMT. In a brief survey, we put SPE in context with other system combination techniques and evaluate SPE vs. another simple system combination tec...

متن کامل

Adding syntactic structure to bilingual terminology for improved domain adaptation

Deep-syntax approaches to machine translation have emerged as an alternative to phrase-based statistical systems. TectoMT is an open source framework for transfer-based MT which works at the deep tectogrammatical level and combines linguistic knowledge and statistical techniques. When adapting to a domain, terminological resources improve results with simple techniques, e.g. force-translating d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016