corpora creation

Creating Research Corpora for the Computational Study of Music: the case of the CompMusic Project

2014

Xavier Serra

A fundamental concern in music information research is the use of appropriate data sets, research corpora, from which to perform the needed data processing tasks. These corpora have to be suited for the specific research problems to be addressed and the design criteria with which to create them is a research task to which not much attention has been paid. In the CompMusic project we are studyin...

متن کامل

Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model

2013

Rahma Boujelbane Mariem Ellouze Siwar BenAyed Lamia Hadrich Belguith

Since the Tunisian revolution, Tunisian Dialect (TD) used in daily life, has became progressively used and represented in interviews, news and debate programs instead of Modern Standard Arabic (MSA). This situation has important negative consequences for natural language processing (NLP): since the spoken dialects are not officially written and do not have standard orthography, it is very costl...

متن کامل

Design of a hybrid high quality machine translation system

2012

Bogdan Babych Kurt Eberle Johanna Geiß Mireia Ginestí-Rosell Anthony Hartley Reinhard Rapp Serge Sharoff Martin Thomas

This paper gives an overview of the ongoing FP7 project HyghTra (2010 – 2014). The HyghTra project is conducted in a partnership between academia and industry involving the University of Leeds and Lingenio GmbH (company). It adopts a hybrid and bootstrapping approach to the enhancement of MT quality by applying rule-based analysis and statistical evaluation techniques to both parallel and compa...

متن کامل

Accurate phrase alignment in a bilingual corpus for EBMT systems

2012

George Tambouratzis Michalis Troullinos Sokratis Sofianopoulos Marina Vassiliou

An ongoing trend in the creation of Machine Translation (MT) systems concerns the automatic extraction of information from large bilingual parallel corpora. As these corpora are expensive to create, the largest possible amount of information needs to be extracted in a consistent manner. The present article introduces a phrase alignment methodology for transferring structural information between...

متن کامل

Cross-Corpus Evaluation of Word Alignment

2008

Sylwia Ozdowska

We present the procedures we implemented to carry out system oriented evaluation of a syntax-based word aligner —ALIBI. We take the approach of regarding cross-corpus evaluation as part of system oriented evaluation assuming that corpus type may impact alignment performance. We test our system on three English–French parallel corpora. The evaluation procedures include the creation of a referenc...

متن کامل

Identifying Similar Words and Contexts in Natural Language with SenseClusters

2005

Ted Pedersen Anagha Kulkarni

SenseClusters is a freely available intelligent system that clusters together similar contexts in natural language text. Thereafter it assigns identifying labels to these clusters based on their content. It is a purely unsupervised approach that is language independent, and uses no knowledge other than what is available in raw un-annotated corpora. In addition to clustering similar contexts, it...

متن کامل

Gesture recognition corpora and tools: A scripted ground truthing method

Journal: :Computer Vision and Image Understanding 2015

Simon Ruffieux Denis Lalanne Elena Mugellini Omar Abou Khaled

This article presents a framework supporting rapid prototyping of multimodal applications, the creation and management of datasets and the quantitative evaluation of classification algorithms for the specific context of gesture recognition. A review of the available corpora for gesture recognition highlights their main features and characteristics. The central part of the article describes a no...

متن کامل

Exploiting Parallel Texts to Produce a Multilingual Sense Tagged Corpus for Word Sense Disambiguation

2005

Lucia Specia Maria das Graças Volpe Nunes Mark Stevenson

We describe an approach to the automatic creation of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The approach uses parallel corpora, translation dictionaries and a set of straightforward heuristics. In an evaluation with nine corpora containing 10 ambiguous verbs, the approach achieved an average precision of 94%, ...

متن کامل

Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation

Journal: :Natural Language Engineering 2010

Bonnie J. Dorr Rebecca J. Passonneau David Farwell Rebecca Green Nizar Habash Stephen Helmreich Eduard H. Hovy Lori S. Levin Keith J. Miller Teruko Mitamura Owen Rambow Advaith Siddharthan

This paper focuses on an important step in the creation of a system of meaning representation and the development of semantically-annotated parallel corpora, for use in applications such as machine translation, question answering, text summarization, and information retrieval. The work described below constitutes the first effort of any kind to annotate multiple translations of foreign-language...

متن کامل

Decorrelation and Shallow Semantic Patterns for Distributional Clustering of Nouns and Verbs

2008

Yannick Versley

Distributional approximations to lexical semantics are very useful not only in helping the creation of lexical semantic resources (Kilgariff et al., 2004; Snow et al., 2006), but also when directly applied in tasks that can benefit from large-coverage semantic knowledge such as coreference resolution (Poesio et al., 1998; Gasperin and Vieira, 2004; Versley, 2007), word sense disambiguation (McC...

متن کامل