Compilation and Exploitation of the IJS-ELAN Parallel Corpus
نویسنده
چکیده
With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a datasource for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely a Web concordancer and the extraction of bi-lingual lexica.
منابع مشابه
Normalising the IJS-ELAN Slovene-English Parallel Corpus for the Extraction of Multilingual Terminology
Various efforts have been made for the development of tools and methods dedicated to the automatic processing of multilingual terminology databases. For that purpose, multilingual parallel corpora have been used as a basis resource. However, most of the neologisms in technical and scientific domains are realised by multiword terms that are rarely identified in parallel corpora. In this paper, w...
متن کاملSlovene-English Datasets for MT
Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-E...
متن کاملThe ELAN Slovene-English Aligned Corpus
Multilingual parallel corpora are a basic resource for research and development of MT. Such corpora are still scarce, especially for lower-diffusion languages. The paper presents a sentence-aligned tokenised Slovene-English corpus, developed in the scope of the EU ELAN project. The corpus contains 1 million words from fifteen recent terminology-rich texts and is encoded according to the Guideli...
متن کاملStatistical machine translation from Slovenian to English
In this paper, we analyse three statistical models for the machine translation of Slovenian into English. All of them are based on the IBM Model 4, but differ in the type of linguistic knowledge they use. Model 4a uses only basic linguistic units of the text, i.e., words and sentences. In Model 4b, lemmatisation is used as a preprocessing step of the translation task. Lemmatisation also makes i...
متن کامل