An Experiment in Hybrid Dictionary and Statistical Sentence Alignment

نویسندگان

  • Nigel Collier
  • Kenji Ono
  • Hideki Hirakawa
چکیده

The task of aligning sentences in parallel corpora of two languages has been well studied using pure statistical or linguistic models. We developed a linguistic method based on lexical matching with a bilingual dictionary and two statistical methods based on sentence length ratios and sentence offset probabilities. This paper seeks to further our knowledge of the alignment task by comparing the performance of the alignment models when used separately and together, i.e. as a hybrid system. Our results show that for our English-Japanese corpus of newspaper articles, the hybrid system using lexical matching and sentence length ratios outperforms the pure methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Approach to Align Sentences and Words in English-Hindi Parallel Corpora

In this paper we describe an alignment system that aligns English-Hindi texts at the sentence and word level in parallel corpora. We describe a simple sentence length approach to sentence alignment and a hybrid, multi-feature approach to perform word alignment. We use regression techniques in order to learn parameters which characterise the relationship between the lengths of two sentences in p...

متن کامل

Sentence Alignment of Historical Classics based on Mode Prediction and Term Translation Pairs

Parallel corpora are essential resources for the construction of bilingual term dictionary of historical classics. To obtain large-scale parallel corpora, this paper proposes a sentence alignment method based on mode prediction and term translation pairs. On one hand, the method rebuilds the sentence alignment process according to characteristics of the translation of historical classics, and a...

متن کامل

MathLingBudapest: Concept Networks for Semantic Similarity

We present our approach to measuring semantic similarity of sentence pairs used in Semeval 2015 tasks 1 and 2. We adopt the sentence alignment framework of (Han et al., 2013) and experiment with several measures of word similarity. We hybridize the common vector-based models with definition graphs from the 4lang concept dictionary and devise a measure of graph similarity that yields good result...

متن کامل

Log-Linear Models for Word Alignment

We present a framework for word alignment based on log-linear models. All knowledge sources are treated as feature functions, which depend on the source langauge sentence, the target language sentence and possible additional variables. Log-linear models allow statistical alignment models to be easily extended by incorporating syntactic information. In this paper, we use IBM Model 3 alignment pr...

متن کامل

A New Combined Lexical and Statistical based Sentence Level Alignment Algorithm for Parallel Texts

Parallel texts alignment is an active research area in Natural Language Processing field. In this paper, we propose a method for sentence alignment of parallel texts that is based both on lexical and statistical information. The alignment procedure uses dynamic programming technique. We made our experiments for Spanish and English texts. We use lexical information from bilingual Spanish-English...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998