Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary
نویسندگان
چکیده
One of the crucial parts of any corpus-based machine translation system is a large-scale bilingual corpus that is aligned at various levels such, as the sentence and phrase levels. This kind of corpus, however, is not easy to obtain, and accordingly, there is a great need for an efficient construction method. We approach this problem by integrating two large monolingual corpora in two different languages sharing the same source of information. We often see such a situation in journalistic texts where the same events are reported in many languages. Unfortunately, they often lack article-level alignment information and the recovery of this is the first problem to solve. In this paper, we report a method of automatically aligning Japanese and English newspaper articles in the financial and economic news domain. Although conventional methods require some manual work, the proposed method works fully automatically. We show that our method can align such newspaper articles with an accuracy of 97%.
منابع مشابه
An Experiment in Hybrid Dictionary and Statistical Sentence Alignment
The task of aligning sentences in parallel corpora of two languages has been well studied using pure statistical or linguistic models. We developed a linguistic method based on lexical matching with a bilingual dictionary and two statistical methods based on sentence length ratios and sentence offset probabilities. This paper seeks to further our knowledge of the alignment task by comparing the...
متن کاملBuilding Japanese-English Dictionary based on Ontology for Machine Translation
1. I n t r o d u c t i o n This paper describes a semi-automatic method for associating a Japanese lexicon with a semantic concept taxonomy using a Japanese-English bilingual dictionary as a "bridge", in order to support semantic processing in a knowledge-based machine translation (MT) system. To enhance the semantic processing in MT systems, many system include conceptual networks called ontol...
متن کاملExtracting Bilingual Collocations from Non-Aligned Parallel Corpora
This paper proposes a new method to find correspondences of uninterrupted collocations from Japanese-English bilingual corpora without sentence-to-sentence alignment. Uninterrupted collocations in English such as “once again”, “give up”, or “gross national product” handled as a single word or a compound word in Japanese, can be automatically extracted with corresponding Japanese words using wor...
متن کاملAn Experiment in Word Alignment with a Parallel Corpus
This report documents an experiment done on word alignment using a parallel , sentence aligned corpus. The languages are English and Japanese and the corpus is derived from the Asahi Shinbun daily newspaper editorials. The aims of the experiment are To nd out how accurate word alignment is with simple pattern matching. To nd out how useful a conventional English-Japanese lexicon is. To observe ...
متن کاملAn Investigation into Bilingual Dictionary Use: Do the Frequency of Use and Type of Dictionary Make a Difference in L2 Writing Performance?
Bilingual dictionary use in L2 writing test performance has recently been the subject of debate. Opinions differ according to how the trait is understood and whether the system favors the process-oriented or product-oriented views towards the assessment and writing skill. Given the need for more empirical support, this study is aimed at investigating the availability of bilingual dictionary use...
متن کامل