Full-text story alignment models for Chinese-English bilingual news corpora

نویسندگان

  • Bing Zhao
  • Stephan Vogel
چکیده

In this paper, we describe the full-text story alignment on Chinese-English bilingual corpora of news data to mine potential parallel data for machine translation. Several standard information retrieval methods are tested and two translation-model based alignment models are proposed and studied. Modeling the process of generating the parallel English story from Chinese story gives significant improvements over the standard information retrieval techniques. Refinements of the alignment model are also proposed and tested in detail. On one day s bilingual news collection, our methods improved the mean reciprocal rank from 0.31 to 0.68.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature-Based Method for Document Alignment in Comparable News Corpora

In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and E...

متن کامل

Developing Parallel Sense-tagged Corpora with Wordnets

Semantically annotated corpora play an important role in natural language processing. This paper presents the results of a pilot study on building a sense-tagged parallel corpus, part of ongoing construction of aligned corpora for four languages (English, Chinese, Japanese, and Indonesian) in four domains (story, essay, news, and tourism) from the NTU-Multilingual Corpus. Each subcorpus is firs...

متن کامل

Sentence Segmentation Using IBM Word Alignment Model 1

In statistical machine translation, word alignment models are trained on bilingual corpora. Long sentences pose severe problems: 1. the high computational requirements; 2. the poor quality of the resulting word alignment. We present a sentence-segmentation method that solves these problems by splitting long sentence pairs. Our approach uses the lexicon information to locate the optimal split po...

متن کامل

Creating a Reusable English-Chinese Parallel Corpus for Bilingual Dictionary Construction

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually trans...

متن کامل

Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria

We present a new approach to aligning sentences in bilingual parallel corpora based on punctuation, especially for English and Chinese. Although the length-based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages, such as French-English or German-English, it does not work as well for parallel corpora that are noisy or written ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002