Lithuanian-Latvian-Lithuanian Parallel Corpus

نویسندگان

  • Andrius Utka
  • Kristine Levane-Petrova
  • Agne Bielinskiene
  • Jolanta Kovalevskaite
  • Erika Rimkute
  • Daira Vevere
چکیده

The goal of the paper is to present different problems related to the building of Parallel Corpus for two small languages, namely, Latvian and Lithuanian. The Lithuanian-Latvian-Lithuania Parallel Corpus (LILA) will contain 8 million running words; will be bidirectional, aligned on the sentence level. The problems include identifying, acquiring, preparing, and aligning parallel texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach

We discuss an experiment on automatic identification of bi-gram multiword expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown som...

متن کامل

Improving SMT with Morphology Knowledge for Baltic Languages

In the recent years, several machine translation systems have been built for the Baltic languages. Besides Google and Microsoft machine translation engines and research experiments with statistical MT for Latvian [1] and Lithuanian, there are both English-Latvian [2] and English-Lithuanian [3] rulebased MT systems available. Both Latvian and Lithuanian are morphologically rich languages with qu...

متن کامل

Latvian and Lithuanian Named Entity Recognition with TildeNER

In this paper the author presents TildeNER – an open source freely available named entity recognition toolkit and the first multi-class named entity recognition system for Latvian and Lithuanian languages. The system is built upon a supervised conditional random field classifier and features heuristic and statistical refinement methods that improve supervised classification, thus boosting the o...

متن کامل

SMT of Latvian, Lithuanian and Estonian Languages: a Comparative Study

This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and com...

متن کامل

English-Lithuanian Word Alignment with Bilingwis: Evaluation of the Alignment

This paper presents the first qualitative evaluation of the word-alignment system Bilingwis for the English-Lithuanian language pair. The evaluation was performed by scoring alignments for the most frequent autosemantic words from the English-Lithuanian parallel corpus. The main tendencies revealed by the evaluation are presented and some problematic issues as well as future improvements are di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012