source text

Creation of speech corpora for the multilingual Bonn Open Synthesis System

2001

Esther Klabbers Karlheinz Stöber

In this paper we present the procedure for creating a new speech corpus for the Bonn Open Synthesis System (BOSS). BOSS has several advantages which make this procedure particularly straightforward and fast. BOSS is open source, allowing flexible use of components and corpora. It shows a clear separation between data and architecture, which means that a change in corpus does not require a chang...

متن کامل

Using an underspecified ASR system as an indicator for phonetic similarity

2009

Mark Kane Julie Mauclair Julie Carson-Berndsen

This paper presents a novel approach to the identification of phonetic similarity using properties observed during the speech recognition process. An experiment is presented whereby specific phones are removed during the training phase of a statistical speech recognition system so that the behaviour of the system can be analysed to see which alternative phone is selected. The domain of the anal...

متن کامل

A Finite-state Morphological Analyser for Tuvan

2016

Francis M. Tyers Aziyana Bayyr-ool Aelita Salchak Jonathan Washington

This paper describes the development of free/open-source finite-state morphological transducers for Tuvan, a Turkic language spoken in and around the Tuvan Republic in Russia. The finite-state toolkit used for the work is the Helsinki Finite-State Toolkit (HFST), we use the lexc formalism for modelling the morphotactics and twol formalism for modelling morphophonological alternations. We presen...

متن کامل

The Development of the "Index Thomisticus" Treebank Valency Lexicon

2009

Barbara McGillivray Marco Passarotti

We present a valency lexicon for Latin verbs extracted from the Index Thomisticus Treebank, a syntactically annotated corpus of Medieval Latin texts by Thomas Aquinas. In our corpus-based approach, the lexicon reflects the empirical evidence of the source data. Verbal arguments are induced directly from annotated data. The lexicon contains 432 Latin verbs with 270 valency frames. The lexicon is...

متن کامل

Speeding Up Exact Motif Discovery by Bounding the Expected Clump Size

2010

Tobias Marschall Sven Rahmann

The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast ...

متن کامل

Data Refining for Text Mining Process in Aviation Safety Data

2009

Olli Sjöblom

Successful data mining is an iterative process during which data will be refined and adjusted to achieve more accurate mining results. Most important tools in the text mining context are list of stop words and list of synonyms. The size and richness of the lists mentioned depend on the structure of the language used in the text to be mined. English, for example, is an “easy” language for search...

متن کامل

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

2016

Amal Htait Sébastien Fournier Patrice Bellot

In this paper, we present the automatic annotation of bibliographical references’ zone in papers and articles of XML/TEI format. Our work is applied through two phases: first, we use machine learning technology to classify bibliographical and non-bibliographical paragraphs in papers, by means of a model that was initially created to differentiate between the footnotes containing or not containi...

متن کامل

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators

2015

Pablo Ruiz Thierry Poibeau Frédérique Mélanie

Entity Linking (EL) systems’ performance is uneven across corpora or depending on entity types. To help overcome this issue, we propose an EL workflow that combines the outputs of several open source EL systems, and selects annotations via weighted voting. The results are displayed on a UI that allows the users to navigate the corpus and to evaluate annotation quality based on several metrics.

متن کامل

The MATE Markup Framework

2000

Laila Dybkjær Niels Ole Bernsen

Since early 1998, the European Telematics project MATE has worked towards facilitating re-use of annotated spoken language data, addressing theoretical issues and implementing practical solutions which could serve as standards in the field. The resulting MATE Workbench for corpus annotation is now available as licensed open source software. This paper describes the MATE markup framework which b...

متن کامل

Mutability and Becoming: Materializing of Public Sector Adoption of Open Source Software

2012

Maha Shaikh

Juxtaposing two local council cases of open source software adoption in the UK we highlight their differences and similarities in open source adoption and implementation. Our narratives indicate that for both cases there was strong goodwill towards open source yet the trajectories of implementation differed widely. We draw on Deleuze and Guattari’s ideas of becoming, tracing versus mapping and ...

متن کامل