Multilingual BERT Based Word Alignment By Incorporating Common Chinese Characters
نویسندگان
چکیده
Word alignment is an important task of detecting translation equivalents between a sentence pair. Although word no longer necessarily needed for neural machine translation, it’s still useful in wealth applications, e.g., bilingual lexicon induction, constraint decoding, and so on. However, the most well-known aligners are Giza++ fastAlign, both which implementations traditional IBM models. To keep pace with advance NMT, there has been surge interest replacing models We follow this trend but aim to boost performance Japanese Chinese, share large portion Chinese characters. Our key idea leverage these common characters languages as indicator inferring alignment; i.e., source target words should be likely aligned. Following idea, we propose three methods that mBERT-based alignment, including reward factor, representation contrastive training. Furthermore, annotate release golden dataset Japanese-Chinese alignment. Experiments on show our outperform several strong baselines terms AER score verify effectiveness exploiting
منابع مشابه
Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information
We describe a method to detect common Chinese characters between Japanese and Chinese automatically by means of freely available resources and verify the effectiveness of the detecting method. We use a joint phrase alignment model on dependency trees and report results of experiments aimed at improving the alignment quality between Japanese and Chinese by incorporating the common Chinese charac...
متن کاملWord Order Typology through Multilingual Word Alignment
With massively parallel corpora of hundreds or thousands of translations of the same text, it is possible to automatically perform typological studies of language structure using very large language samples. We investigate the domain of word order using multilingual word alignment and high-precision annotation transfer in a corpus with 1144 translations in 986 languages of the New Testament. Re...
متن کاملChinese Word Segmentation by Classification of Characters
During the process of Chinese word segmentation, two main problems occur: segmentation ambiguities and unknown word occurrences. This paper describes a method to solve the segmentation problem. First, we use a dictionary-based approach to segment the text. We apply the Maximum Matching algorithm to segment the text forwards (FMM) and backwards (BMM). Based on the difference between FMM and BMM,...
متن کاملImproving Word Alignment by Adjusting Chinese Word Segmentation
Most of the current Chinese word alignment tasks often adopt word segmentation systems firstly to identify words. However, word-mismatching problems exist between languages and will degrade the performance of word alignment. In this paper, we propose two unsupervised methods to adjust word segmentation to make the tokens 1-to-1 mapping as many as possible between the corresponding sentences. Th...
متن کاملLanguage comparison through sparse multilingual word alignment
In this paper, we propose a novel approach to compare languages on the basis of parallel texts. Instead of using word lists or abstract grammatical characteristics to infer (phylogenetic) relationships, we use multilingual alignments of words in sentences to establish measures of language similarity. To this end, we introduce a new method to quickly infer a multilingual alignment of words, usin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing
سال: 2023
ISSN: ['2375-4699', '2375-4702']
DOI: https://doi.org/10.1145/3594634