corpora creation

Creating Digital Language Resources

2004

Goran Nenadić

We discuss building digital language resources (such as annotated corpora, lexicons, ontologies, terminologies, tools), which are the main prerequisite for successful communication and information management in the e-society of the 21 century. We give an overview of the main requirements and best practices, and point to necessary steps for creation and maintenance of standardsbased and reusable...

متن کامل

Developing a Multilingual Telephone Based Information System in African Languages

2000

Justus C. Roux Elizabeth C. Botha Johan A. du Preez

This paper introduces the first project of its kind within the Southern African language engineering context. It focuses on the role of idiosyncratic linguistic and pragmatic features of the different languages concerned and how these features are to be accommodated within (a) the creation of applicable speech corpora and (b) the design of the system at large. An introduction to the multilingua...

متن کامل

TEP: Tehran English-Persian Parallel Corpus

2011

Mohammad Taher Pilehvar Heshaam Faili Abdol Hamid Pilehvar

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles,...

متن کامل

A Multilingual Text Normalization Approach

2011

Brigitte Bigi

The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the aspects of methodology and linguistic engineering, which serve to develop a multipurpose multilingual text corpus. This approach was applied to French,...

متن کامل

Multilingual linked data

Journal: :Semantic Web 2015

John P. McCrae Steven Moran Sebastian Hellmann Martin Brümmer

The interaction of natural language processing and the Semantic Web have lead to the creation of a new paradigm known as Linguistic Linked Open Data (LLOD), whereby traditional language resources are made available as linked data. Conversely, the publication of corpora, machine-readable dictionaries as linked data has opened new resources to Semantic Web researchers and allowed new tools to be ...

متن کامل

Diphone collection and synthesis

2000

Kevin A. Lenzo Alan W. Black

In this paper, we describe the design and collection of corpora for diphone synthesis, the voice building process, and our experience in the creation of a new, publically available database of ten diphone sets of one American English speaker for the Festival Speech Synthesis System [3], using the FestVox document and tools [1]. In support of our goal to make the tools and techniques available f...

متن کامل

The challenge of domain-independent speech understanding

1998

Robert C. Moore

To achieve widespread acceptance, speech Understanding technology needs to be domain independent. Deep understanding, however, appears to require knowledge that is tiomain specific. Speech understanding technology, therefore, must be partitioned into domain-independent and domainspecific components. Development of domain-independent components could be promoted by creation of semantically annot...

متن کامل

Designing Annotation Tools based on Properties of Annotation Problems

2004

Dennis Reidsma Natas̃a Jovanović Dennis Hofs

The creation of richly annotated, extendable and reusable corpora of multimodal interactions is an expensive and time-consuming task. Support from tools to create annotations is indispensable. This paper argues that annotation tools should be focused on specific classes of annotation problems to make the annotation process more efficient. The central part of the paper discusses how the properti...

متن کامل

Creating a Large-Scale Arabic to French Statistical MachineTranslation System

2006

Sasa Hasan Anas El Isbihani Hermann Ney

In this work, the creation of a large-scale Arabic to French statistical machine translation system is presented. We introduce all necessary steps from corpus aquisition, preprocessing the data to training and optimizing the system and eventual evaluation. Since no corpora existed previously, we collected large amounts of data from the web. Arabic word segmentation was crucial to reduce the ove...

متن کامل

Designing a Multimodal Spoken Component of the Australian National Corpus

2009

Michael Haugh

Spoken language and interaction lie at the core of human experience. The primary medium of communication is speech, with some estimating the ratio of spoken-written language to be as high as 90%-10% (Cermák, 2009, p. 115). Yet they have remained poor cousins in the building of corpora to date. Not only are spoken corpora much smaller than written corpora (Xiao, 2008), the overwhelming focus in ...

متن کامل