Co-occurrence Degree Based Word Alignment in Statistical Machine Translation

نویسندگان

Chenggang Mi

Yating Yang

Lei Wang

Xiao Li

چکیده

To alleviate the data sparseness problem during word alignment, we propose a word alignment method based on word co-occurrence degree. In this paper, we propose a new method to get the statistical information from word cooccurrence. We combine the co-occurrence counts and the fuzzy co-occurrence weights as word co-occurrence degree. Fuzzy co-occurrence weights can be obtained by searching for fuzzy co-occurrence word pairs and computing differences of length between current word and other words in fuzzy co-occurrence word pairs. Experiments show that the quality of word alignment and the translation performance both improved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

We propose a flexible and effective framework for extracting a bilingual dictionary from comparable corpora. Our approach is based on a novel combination of topic modeling and word alignment techniques. Intuitively, our approach works by converting a comparable document-aligned corpus into a parallel topic-aligned corpus, then learning word alignments using co-occurrence statistics. This topica...

متن کامل

Enhancing Statistical Machine Translation with Character Alignment

The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two di...

متن کامل

Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation

In this paper, we present a Bayesian Learning based method to train word dependent transition models for HMM based word alignment. We present word alignment results on the Canadian Hansards corpus as compared to the conventional HMM and IBM model 4. We show that this method gives consistent and significant alignment error rate (AER) reduction. We also conducted machine translation (MT) experime...

متن کامل

Character-Cluster-Based Segmentation using Monolingual and Bilingual Information for Statistical Machine Translation

We present a novel segmentation approach for Phrase-Based Statistical Machine Translation (PB-SMT) to languages where word boundaries are not obviously marked by using both monolingual and bilingual information and demonstrate that (1) unsegmented corpus is able to provide the nearly identical result compares to manually segmented corpus in PB-SMT task when a good heuristic character clustering...

متن کامل

Improving Statistical Word Alignment with a Rule-Based Machine Translation System

The main problems of statistical word alignment lie in the facts that source words can only be aligned to one target word, and that the inappropriate target word is selected because of data sparseness problem. This paper proposes an approach to improve statistical word alignment with a rule-based translation system. This approach first uses IBM statistical translation model to perform alignment...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Co-occurrence Degree Based Word Alignment in Statistical Machine Translation

نویسندگان

چکیده

منابع مشابه

Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus

Enhancing Statistical Machine Translation with Character Alignment

Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation

Character-Cluster-Based Segmentation using Monolingual and Bilingual Information for Statistical Machine Translation

Improving Statistical Word Alignment with a Rule-Based Machine Translation System

عنوان ژورنال:

اشتراک گذاری