Aligning and Using an English-Inuktitut Parallel Corpus
نویسندگان
چکیده
A parallel corpus of texts in English and in Inuktitut, an Inuit language, is presented. These texts are from the Nunavut Hansards. The parallel texts are processed in two phases, the sentence alignment phase and the word correspondence phase. Our sentence alignment technique achieves a precision of 91.4% and a recall of 92.3%. Our word correspondence technique is aimed at providing the broadest coverage collection of reliable pairs of Inuktitut and English morphemes for dictionary expansion. For an agglutinative language like Inuktitut, this entails considering substrings, not simply whole words. We employ a Pointwise Mutual Information method (PMI) and attain a coverage of 72.3% of English words and a precision of 87%.
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملA Look at English-Inuktitut Word Alignment
Statistical Machine Translation (SMT) as well as other bilingual applications strongly rely on multilingual corpora aligned at the word level. Efficient alignment techniques have been proposed but are mainly evaluated on pairs of languages where the notion of word is mostly clear. We concentrated our efforts on the English-Inuktitut word alignment task and present two approaches we implemented ...
متن کاملSupporting Large English-Hindi Parallel Corpus using Word Alignment
This paper gives description about methodology to understand parallel English-Hindi sentences using word alignment. This methodology is foundation to develop the parallel EnglishHindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. Methodology of proposed system is used for the English and Hindi sentences; also the methodology can be used for othe...
متن کاملSentence Alignment in Parallel Corpora :
This report has two aims To give information about the issues behind corpus alignment and the techniques currently used. To describe a particular corpus which members of CCL were involved in constructing-the Asahi corpus. The subject of aligning parallel corpora is expanding rapidly, particularly because the bottom-up machine translation (MT) paradigms such as Example-based MT and Statistics-ba...
متن کاملAligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria
We describe our experience with automatic alignment of sentences in parallel English-Chinese texts. Our report concerns three related topics: (1) progress on the HKUST English-Chinese Parallel Bilingual Corpus; (2) experiments addressing the applicability of Gale & Church's (1991) length-based statistical method to the task of alignment involving a non-Indo-European language; and (3) an improve...
متن کامل