MultiUN: A Multilingual Corpus from United Nation Documents
نویسندگان
چکیده
This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.
منابع مشابه
MultiUN v2: UN Documents with Multilingual Alignments
MultiUN is a multilingual parallel corpus extracted from the official documents of the United Nations. It is available in the six official languages of the UN and a small portion of it is also available in German. This paper presents a major update on the first public version of the corpus released in 2010. This version 2 consists of over 513, 091 documents, including around 9% of new documents...
متن کاملDiscovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...
متن کاملA Corpus for Evaluating Semantic Multilingual Web Retrieval Systems: The Sense Folder Corpus
In this paper, we present the multilingual Sense Folder Corpus. After the analysis of different corpora, we describe the requirements that have to be satisfied for evaluating semantic multilingual retrieval approaches. Justified by the unfulfilled requirements explained, we start creating a small bilingual hand-tagged corpus of 502 documents retrieved from Web searches. The documents contained ...
متن کاملTowards Multilingual Information Discovery through a SOM based Text Mining approach
Text mining has been gaining popularity in the knowledge discovery field, particularity with the increasing availability of digital documents in various languages from all around the world. However, currently most text mining tools mainly focus on processing monolingual documents (particularly English documents) only, little attention has been paid to apply the techniques to handle the document...
متن کاملPseudo-Aligned Multilingual Corpora
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical—as opposed to exact—correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topicbased graph for ea...
متن کامل