Bilingual Corpus Cleaning Focusing O
نویسنده
چکیده
When we automatically acquire translation knowledge from a bilingual corpus, redundant rules are generated due to translation variety. To overcome this problem, we propose bilingual corpus cleaning based on translation literality. Word-level correspondence and phrase-level correspondence are applied as the criteria of literality. Using these criteria, a bilingual corpus was cleaned, and translation knowledge for a pattern-based MT system was acquired from the cleaned corpus. As a result, the translation quality of the MT was improved despite reductions in the the corpus size to about 81% and 87% by using word-level and phrase-level literality scores, respectively.
منابع مشابه
Bilingual corpus cleaning focusing on translation literality
When we automatically acquire translation knowledge from a bilingual corpus, redundant rules are generated due to translation variety. To overcome this problem, we propose bilingual corpus cleaning based on translation literality. Word-level correspondence and phrase-level correspondence are applied as the criteria of literality. Using these criteria, a bilingual corpus was cleaned, and transla...
متن کاملAutomatic Thesaurus Generation through Multiple Filtering
11, this paper, we propose a method of gen(',rating bilingual keyword eh.lsters or thesauri from parallel or comi.m, able bilingual corpora. The method combines nmrphological and lexical processing, bilingual word aligmnent, and graph-theoretic cluster generation. An experiment shows that the method is promising. 1 I n t r o d u c t i o n In this paper, we propose a method of automatte bilingua...
متن کاملLattice Score Based Data Cleaning For Phrase-Based Statistical Machine Translation
Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. This paper presents a novel lattice score-based data cleaning method to select proper sentence pairs from the ones extracted from a bilingual corpus by the ...
متن کاملApplication of Translation Knowledge Acquired by Hierarchical Phrase Alignment for Pattern-based MT
Hierarchical phrase alignment is a method for extracting equivalent phrases from bilingual sentences, even though they belong to different language families. The method automatically extracts transfer knowledge from about 125K English and Japanese bilingual sentences and then applies it to a pattern-based MT system. The translation quality is then evaluated. The knowledge needs to be cleaned, s...
متن کاملPrimary Data Encoding of a Bilingual Corpus
This paper discusses the building of a bilingual corpus of legal and administrative texts, focusing on the encoding of documentation and structural information according to the Corpus Encoding Standard. The corpus is one module in an ongoing research project about (semi-)automatic terminology acquisition at the European Academy Bolzano and will serve as a basis for applying term extraction prog...
متن کامل