Pivot-based word alignment

نویسنده

  • Tom McCoy
چکیده

Word alignment is the task of, given two sentences that are translations of each other, determining which words correspond to each other across the two sentences. Word alignment is an important step in the pipeline of constructing a statistical machine translation system, but success at word alignment depends heavily on the quantity of training data available. The traditional methods for computational word alignment, proposed by Brown et al. (1993), require large quantities of training data. However, these methods fall short when such quantities of data are not available. To combat this problem, I propose a framework that fills in the data gap by using data from languages related to the one for which data are lacking. This technique is shown to improve significantly upon the baseline alignment error rate. In my proposed framework, aligning a sentence in a low-resource language with a sentence in a high-resource language follows a 3-step procedure. First, a pivot language is chosen such that it is both high-resource and as closely related to the low-resource language as possible. Edit distance is then used to create a correspondence between the low-resource language and the pivot language, while probabilities trained using the IBM Models are used to create a correspondence between the pivot language and the high-resource language. I test several different settings for these three basic components on the task of aligning Spanish and English, and I find that the most successful overall alignment system uses Portuguese as the pivot language, a cognate-based algorithm for calculating edit distance, and translation, distortion, and alignment probabilities from the IBM Models. With this framework settled on, I then conduct some sample translations to demonstrate the utility of this approach for machine translation. These translations are translating from Spanish to English, and despite being trained only on Portuguese the translator still manages to yield rough translations of Spanish text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pivot Alignment

Word alignment of parallel texts is typically carried out using many kinds of knowledge, or information sources, in concert, i.e., it is profitably viewed as a kind of cooperative process, where e.g. distribution, string similarity, cooccurrence statistics, and other in­ formation sources are used together. We investigate a novel such information source in this paper, namely the use of a third ...

متن کامل

You'll Take the High Road and I'll Take the Low Road: Using a Third Language to Improve Bilingual Word Alignment

While language-independent sentence alignment programs typically achieve a recall in the 90 percent range, the same cannot be said about word alignment systems, where normal recall figures tend to fall somewhere between 20 and 40 percent, in the language-independent case. As words (and phrases) for various reasons are more interesting to align than sentences, we need methods to increase word al...

متن کامل

Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

A pivot-based approach for bilingual lexicon extraction is based on the similarity of context vectors represented by words in a pivot language like English. In this paper, in order to show validity and usability of the pivot-based approach, we evaluate the approach in company with two different methods for estimating context vectors: one estimates them from two parallel corpora based on word as...

متن کامل

Multi-Task Word Alignment Triangulation for Low-Resource Languages

We present a multi-task learning approach that jointly trains three word alignment models over disjoint bitexts of three languages: source, target and pivot. Our approach builds upon model triangulation, following Wang et al., which approximates a source-target model by combining source-pivot and pivot-target models. We develop a MAP-EM algorithm that uses triangulation as a prior, and show how...

متن کامل

Building Bilingual Lexicons using Lexical Translation Probabilities via Pivot Languages

This paper proposes a method of increasing the size of a bilingual lexicon obtained from two other bilingual lexicons via a pivot language. When we apply this approach, there are two main challenges, ambiguity and mismatch of terms; we target the latter problem by improving the utilization ratio of the bilingual lexicons. Given two bilingual lexicons between language pairs Lf –Lp and Lp–Le, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017