Exploiting a parallel TEXT - DATA corpus
نویسندگان
چکیده
In this paper, we describe SUMTIME-METEO, a parallel corpus of naturally occurring weather forecast texts and their corresponding forecast data; data that the human authors inspected while writing the forecast texts. We have analysed the corpus to acquire knowledge needed to build a text generator for automatically producing textual weather forecasts from numerical weather prediction data. Although parallel corpora are commonly used for the development and evaluation of machine translation technology, it is fairly novel in the text generation community. Our analyses of the corpus, in some cases, produced ambiguous results that are not useful and reflected inconsistencies in the underlying corpus. Despite the internal inconsistencies, the text-data parallel corpus was helpful in generating initial hypotheses, which were then tested with knowledge from other sources. We also describe how we have used the corpus for evaluating our prototype forecast text generator.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملComparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملImproving Machine Translation Performance by Exploiting Non-Parallel Corpora
We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extr...
متن کاملImproving Statistical Machine Translation Performance by Training Data Selection and Optimization
Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimization and online model optimization. The offline method...
متن کاملExploiting Variant Corpora for Machine Translation
This paper proposes the usage of variant corpora, i.e., parallel text corpora that are equal in meaning but use different ways to express content, in order to improve corpus-based machine translation. The usage of multiple training corpora of the same content with different sources results in variant models that focus on specific linguistic phenomena covered by the respective corpus. The propos...
متن کامل