PersianSMT: A first attempt to English-Persian Statistical Machine Translation
نویسندگان
چکیده
In this paper, an attempt to develop a phrase-based statistical machine translation between English and Persian languages (PersianSMT) is described. Creation of the largest English-Persian parallel corpus yet presented by the use of movie subtitles is a part of this work. Two major goals are followed here: the first one is to show the main problems observed in the output of the PersianSMT system and set a baseline for further experiments and the second one is to check whether movie subtitles can provide a good quality corpus for the development of a general purpose translator or not. In the end, translations made by the PersianSMT system equipped with different language models are evaluated on test sets of different domains and the results are compared to the Google statistical machine translator. According to the obtained BLEU scores, the proposed SMT system strongly outperforms the Google translator in translating both in-domain (movie subtitle) and out-of-domain sentences.
منابع مشابه
Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language
This paper is an attempt to exclusively focus on investigating the pivot language technique in which a bridging language is utilized to increase the quality of the Persian–Spanish low-resource Statistical Machine Translation (SMT). In this case, English is used as the bridging language, and the Persian–English SMT is combined with the English–Spanish one, where the relatively large corpora of e...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملPerformance evaluation of various training data in English-Persian Statistical Machine Translation
Globalization and the continued increase in international travel and commerce have made automatic translation systems an attractive area of research and development. Even as technology opens up e-commerce opportunities, companies must overcome language barriers to reach new potential customers and business partners. With the advent of Web2.0 technologies, machine translation and tools like Goog...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کامل