corpora creation

Extrinsic Corpus Evaluation with a Collocation Dictionary Task

2014

Adam Kilgarriff Pavel Rychlý Milos Jakubícek Vojtech Kovár Vít Baisa Lucia Kocincová

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of ‘general languag...

متن کامل

Large lexica for speech-to-speech translation: from specification to creation

2003

Elviira Hartikainen Giulio Maltese Asunción Moreno Shaunie Shammass Ute Ziegenhain

This paper presents the corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-to-speech (TTS) that are needed in speech-to-speech translation (SST). These lexica will be specified, built and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during the years 2002-2005. Large lexic...

متن کامل

Web-based Collaborative Corpus Annotation: Requirements and a Framework Implementation

2010

Kalina Bontcheva Hamish Cunningham Ian Roberts Valentin Tablan

In this paper we present Teamware, a novel web-based collaborative annotation environment which enables users to carry out complex corpus annotation projects, involving less skilled, cheaper annotators working remotely. It has been evaluated by us through the creation of several gold standard corpora, as well as through external evaluation in commercial annotation projects.

متن کامل

Challenges of Cheap Resource Creation

2010

Jirka Hana Anna Feldman

We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional languages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphological tools and corpora in the usual, resource intensive way.

متن کامل

Robust Tagging System for Lexicon Creation

2006

Anna Pappa

This paper presents a robust rule-based system of shallow parsing for part-of-speech (PoS) recognition and tagging. Unlike previous work the system uses parsing to tagging based on unsupervised learning methods with no prior knowledge, nor training or pre-tagged corpora. START (System of Textual Analysis Recognition and Tagging) has been evaluated on both French and Greek non-annotated corpora,...

متن کامل

Annotating Web pages for the needs of Web Information Extraction Applications

2003

Georgios Sigletos Dimitra Farmakiotou Konstantinos Stamatakis Georgios Paliouras Vangelis Karkaletsis

This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web pages from different domains and for different information extraction tasks providing a user-friendly interface to human annotators. Annotated information is stored in a representation format that can ...

متن کامل

Collection, Annotation and Analysis of Gold Standard Corpora for Knowledge-Rich Context Extraction in Russian and German

2013

Anne-Kathrin Schumann

This paper describes the collection, annotation and linguistic analysis of a gold standard for knowledge-rich context extraction on the basis of Russian and German web corpora as part of ongoing PhD thesis work. In the following sections, the concept of knowledge-rich contexts is refined and gold standard creation is described. Linguistic analyses of the gold standard data and their results are...

متن کامل

Large, Multilingual, Broadcast News Corpora for Cooperative Research in Topic Detection and Tracking: The TDT-2 and TDT-3 Corpus Efforts

2000

Christopher Cieri David Graff Mark Liberman Nii Martey Stephanie Strassel

This paper describes the creation and content two corpora, TDT-2 and TDT-3, created for the DARPA sponsored Topic Detection and Tracking project. The research goal in the TDT program is to create the core technology of a news understanding system that can process multilingual news content categorizing individual stories according to the topic(s) they describe. The research tasks include segment...

متن کامل

The Hamburg Metaphor Database project: issues in resource creation

Journal: :Language Resources and Evaluation 2008

Birte Lönneker-Rodman

This paper concerns metaphor resource creation. It provides an account of methods used, problems discovered, and insights gained at the Hamburg Metaphor Database project, intended to inform similar resource creation initiatives, as well as future metaphor processing algorithms. After introducing the project, the theoretical underpinnings that motivate the subdivision of represented information ...

متن کامل

SemSim: Resources for Normalized Semantic Similarity Computation Using Lexical Networks

2012

Elias Iosif Alexandros Potamianos

We investigate the creation of corpora from web-harvested data following a scalable approach that has linear query complexity. Individual web queries are posed for a lexicon that includes thousands of nouns and the retrieved data are aggregated. A lexical network is constructed, in which the lexicon nouns are linked according to their context-based similarity. We introduce the notion of semanti...

متن کامل