corpora creation

a genre analysis of reprint request e-mails written by efl and physics professionals

Journal: :journal of teaching language skills 2012

majid hayati hossein shokouhi fahimeh hadadi

the present study aimed to analyze reprint request e-mail messages written by postgraduates (ma students) of two fields of study, namely physics and efl, to realize the differences and similarities between the two email types. to investigate the purpose of the study, a sample of 100 e-mail messages, 50 physics and 50 efl, were analyzed according to swales’ (1990) model for reprint requests and ...

متن کامل

WikiBABEL: A Wiki-style Platform for Creation of Parallel Data

2009

A. Kumaran K. Saravanan Naren Datha B. Ashok Vikram Dendi

In this demo, we present a wiki-style platform – WikiBABEL – that enables easy collaborative creation of multilingual content in many nonEnglish Wikipedias, by leveraging the relatively larger and more stable content in the English Wikipedia. The platform provides an intuitive user interface that maintains the user focus on the multilingual Wikipedia content creation, by engaging search tools f...

متن کامل

Creating Multilingual Parallel Corpora in Indian Languages

2011

Narayan Choudhary Girish Nath Jha

This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals creating parallel sentence aligned corp...

متن کامل

Development of a Multilingual Parallel Corpus and a Part-of-Speech Tagger for Afrikaans

2006

Julia S. Trushkina

This paper describes design and creation of a multilingual parallel corpus for South African languages. One of the applications of the corpus, namely, the induction of a part-of-speech tagger for Afrikaans from the data, is presented in the paper. Development of the Afrikaans part-of-speech tagger is based on a modified method for induction of linguistic tools from parallel corpora originally p...

متن کامل

A Language-Resources Approach to Emotion: Corpora for the Analysis of Expressive Speech

2006

Nick Campbell

This paper presents a summary of some expressive speech data collected over a period of several years and suggests that its variation is not best described by the term “emotion”. Further, that the term may be misleading when used as a descriptor for the creation of expressive speech corpora. The paper proposes that we might benefit from first considering what other dimensions of speech variatio...

متن کامل

Extending Community Ontology Using Automatically Generated Suggestions

2007

Vít Novácek Maciej Dabrowski Sebastian Ryszard Kruk Siegfried Handschuh

In this paper we propose an ontology (formal knowledge base) creation methodology based on integrating external ontologies into the one developed by a community of the domain experts. We present the MarcOntX agent, a service, which allows to automate the process of generating suggestions of changes to the ontology. The suggestions are inferred from the external sources, such as large corpora of...

متن کامل

A Computational Platform for Development of Morphologic and Phonetic Lexica

2000

Matej Rojc Zdravko Kacic

Statistic approaches in speech technology, either based on statistical language models, trees, hidden Markov models or neural networks, represent the driving forces for the creation of language resources (LR), e.g. text corpora, pronunciation lexica and speech databases. This paper presents the system architecture for rapid construction of morphologic and phonetic lexica for Slovenian language....

متن کامل

In Search of a Gold Standard in Studies of Deception

2012

Stephanie Gokhman Jeff Hancock Poornima Prabhu Myle Ott Claire Cardie

In this study, we explore several popular techniques for obtaining corpora for deception research. Through a survey of traditional as well as non-gold standard creation approaches, we identify advantages and limitations of these techniques for webbased deception detection and offer crowdsourcing as a novel avenue toward achieving a gold standard corpus. Through an indepth case study of online h...

متن کامل

Corpora of latin american Spanish for research in prosody and synthesis

2004

Alejandro Renato José A. Alvarez

The present article describes the creation, labelling and main characteristics of a corpus of spoken Latin American Spanish. The corpus was collected with several objectives in mind: a) to fulfill our own research needs in the study of Latin American Spanish prosodic phenomena, where the absence of available corpora has already been noticed [1][6], b) to be able to experiment with prosodic mode...

متن کامل

Towards an environment for the production and the validation of lexical semantic resources

2014

Mikaël Morardo Éric Villemonte de la Clergerie

We present the components of a processing chain for the creation, visualization, and validation of lexical resources (formed of terms and relations between terms). The core of the chain is a component for building lexical networks relying on Harris’ distributional hypothesis applied on the syntactic dependencies produced by the French parser FRMG on large corpora. Another important aspect conce...

متن کامل