Co-occurrence Degree Based Word Alignment in Statistical Machine Translation
نویسندگان
چکیده
To alleviate the data sparseness problem during word alignment, we propose a word alignment method based on word co-occurrence degree. In this paper, we propose a new method to get the statistical information from word cooccurrence. We combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree. Fuzzy co-occurrence weights can be obtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current word and other words in fuzzy co-occurrence word pairs. Experiments show that the quality of word alignment and the translation performance both improved.
منابع مشابه
Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus
We propose a flexible and effective framework for extracting a bilingual dictionary from comparable corpora. Our approach is based on a novel combination of topic modeling and word alignment techniques. Intuitively, our approach works by converting a comparable document-aligned corpus into a parallel topic-aligned corpus, then learning word alignments using co-occurrence statistics. This topica...
متن کاملEnhancing Statistical Machine Translation with Character Alignment
The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two di...
متن کاملUsing Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation
In this paper, we present a Bayesian Learning based method to train word dependent transition models for HMM based word alignment. We present word alignment results on the Canadian Hansards corpus as compared to the conventional HMM and IBM model 4. We show that this method gives consistent and significant alignment error rate (AER) reduction. We also conducted machine translation (MT) experime...
متن کاملCharacter-Cluster-Based Segmentation using Monolingual and Bilingual Information for Statistical Machine Translation
We present a novel segmentation approach for Phrase-Based Statistical Machine Translation (PB-SMT) to languages where word boundaries are not obviously marked by using both monolingual and bilingual information and demonstrate that (1) unsegmented corpus is able to provide the nearly identical result compares to manually segmented corpus in PB-SMT task when a good heuristic character clustering...
متن کاملImproving Statistical Word Alignment with a Rule-Based Machine Translation System
The main problems of statistical word alignment lie in the facts that source words can only be aligned to one target word, and that the inappropriate target word is selected because of data sparseness problem. This paper proposes an approach to improve statistical word alignment with a rule-based translation system. This approach first uses IBM statistical translation model to perform alignment...
متن کامل