Lithuanian-Latvian-Lithuanian Parallel Corpus
نویسندگان
چکیده
The goal of the paper is to present different problems related to the building of Parallel Corpus for two small languages, namely, Latvian and Lithuanian. The Lithuanian-Latvian-Lithuania Parallel Corpus (LILA) will contain 8 million running words; will be bidirectional, aligned on the sentence level. The problems include identifying, acquiring, preparing, and aligning parallel texts.
منابع مشابه
Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach
We discuss an experiment on automatic identification of bi-gram multiword expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown som...
متن کاملImproving SMT with Morphology Knowledge for Baltic Languages
In the recent years, several machine translation systems have been built for the Baltic languages. Besides Google and Microsoft machine translation engines and research experiments with statistical MT for Latvian [1] and Lithuanian, there are both English-Latvian [2] and English-Lithuanian [3] rulebased MT systems available. Both Latvian and Lithuanian are morphologically rich languages with qu...
متن کاملLatvian and Lithuanian Named Entity Recognition with TildeNER
In this paper the author presents TildeNER – an open source freely available named entity recognition toolkit and the first multi-class named entity recognition system for Latvian and Lithuanian languages. The system is built upon a supervised conditional random field classifier and features heuristic and statistical refinement methods that improve supervised classification, thus boosting the o...
متن کاملSMT of Latvian, Lithuanian and Estonian Languages: a Comparative Study
This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and com...
متن کاملEnglish-Lithuanian Word Alignment with Bilingwis: Evaluation of the Alignment
This paper presents the first qualitative evaluation of the word-alignment system Bilingwis for the English-Lithuanian language pair. The evaluation was performed by scoring alignments for the most frequent autosemantic words from the English-Lithuanian parallel corpus. The main tendencies revealed by the evaluation are presented and some problematic issues as well as future improvements are di...
متن کامل