About the creation of a parallel bilingual corpora of web-publications

نویسندگان

  • D. V. Lande
  • V. V. Zhygalo
چکیده

The algorithm of the creation texts parallel corpora was presented. The algorithm is based on the use of "key words" in text documents, and on the means of their automated translation. Key words were singled out by means of using Russian and Ukrainian morphological dictionaries, as well as dictionaries of the translation of nouns for the Russian and Ukrainianlanguages. Besides, to calculate the weights of the terms in the documents, empiric-statistic rules were used. The algorithm under consideration was realized in the form of a program complex, integrated into the content-monitoring InfoStream system. As a result, a parallel bilingual corpora of web-publications containing about 30 thousand documents, was created 1. Introduction The algorithms of singling out so-called "key words" have important functions in both theory and practice. Many algorithms of singling out key words are based in vector representation and they use statistic properties of the texts. Frequency word lists in one or several languages are mostly used in singling out key words (or word bases). The creation of a frequency word list based on the morphological dictionary (MD) using a text corpora of the documents is described in this paper, as well as the development of the algorithm of singling out key words with the use of frequency MD and a well-known approach TF IDF [1]. Based on the analysis of the key words, which were automatically singled out, and their translation into another language, the procedure of identification of duplicate documents, presented in various languages, was realized. As it is well-known, at present the task of creating multilingual parallel text bodies is very relevant [2-4]. A suggested approach made it possible to create a bilingual Ukrainian-Russian parallel corpora of the texts from web-publications in Russian and Ukrainian languages. According to the experts, the estimated accuracy of the suggested algorithm is 98 %. 2. Description of the algorithm The following procedures were used to create the parallel texts corpora:-development of MD;-creation of frequency dictionaries on the basis of existing MDs;-creation of the translation dictionaries;-realization of the algorithm of singling out key words in the document;-translation of key words of the document into another language;-realization of the algorithm of identifying duplicates based on the analysis of key words and their translations. 2.1. Making morphological dictionaries Available electronic dictionaries were taken for Russian and Ukrainian languages (ispell with over 1.102 thousand word forms in the Ukrainian language and Zalizniak's dictionary which …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Parallel Corpora and Bilingual Terminology extraction from Parallel WebSites

In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the ...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

Sentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora

We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...

متن کامل

Terminology-driven Augmentation of Bilingual Terminologies

This paper proposes a way of augmenting bilingual terminologies by using a “generate and validate” method. Using existing bilingual terminologies, the method generates “potential” bilingual multi-word term pairs and validates their status by searching web documents to check whether such terms actually exist in each language. Unlike most existing bilingual term extraction methods, which use para...

متن کامل

The Role of Parallel Corpora in Bilingual Lexicography

This paper describes an approach based on word alignment on parallel corpora, which aims at facilitating the lexicographic work of dictionary building. Although this method has been widely used in the MT community for at least 16 years, as far as we know, it has not been applied to facilitate the creation of bilingual dictionaries for human use. The proposed corpus-driven technique, in particul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/0807.0311  شماره 

صفحات  -

تاریخ انتشار 2008