Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush
نویسندگان
چکیده
This paper reports on experiments in the creation of a bi-lingual Textual Entailment corpus, using non-experts’ workforce under strict cost and time limitations ($100, 10 days). To this aim workers have been hired for translation and validation tasks, through the CrowdFlower channel to Amazon Mechanical Turk. As a result, an accurate and reliable corpus of 426 English/Spanish entailment pairs has been produced in a more cost-effective way compared to other methods for the acquisition of translations based on crowdsourcing. Focusing on two orthogonal dimensions (i.e. reliability of annotations made by non experts, and overall corpus creation costs), we summarize the methodology we adopted, the achieved results, the main problems encountered, and the lessons learned.
منابع مشابه
Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
We address the creation of cross-lingual textual entailment corpora by means of crowdsourcing. Our goal is to define a cheap and replicable data collection methodology that minimizes the manual work done by expert annotators, without resorting to preprocessing tools or already annotated monolingual datasets. In line with recent works emphasizing the need of large-scale annotation efforts for te...
متن کاملDetecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves transl...
متن کاملImproved Speech-to-Text Translation with the Fisher and Callhome Spanish–English Speech Translation Corpus
Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For SpanishEnglish translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callho...
متن کاملSupplementary Material: Multi-Task Video Captioning with Video and Entailment Generation
1.1.1 Video Captioning Datasets YouTube2Text or MSVD The Microsoft Research Video Description Corpus (MSVD) or YouTube2Text (Chen and Dolan, 2011) is used for our primary video captioning experiments. It has 1970 YouTube videos in the wild with many diverse captions in multiple languages for each video. Caption annotations to these videos are collected using Amazon Mechanical Turk (AMT). All ou...
متن کاملCollection of a Large Database of French-English SMT Output Corrections
Corpus-based approaches to machine translation (MT) rely on the availability of parallel corpora. To produce user-acceptable translation outputs, such systems need high quality data to be efficiently trained, optimized and evaluated. However, building high quality dataset is a relatively expensive task. In this paper, we describe the data collection and analysis of a large database of 10.881 SM...
متن کامل