Large Linguistic Corpus Reduction with SCP Algorithms

نویسندگان

Nelly Barbot

Olivier Boëffard

Jonathan Chevelu

Arnaud Delhay

چکیده

Linguistic corpus design is a critical concern for building rich annotated corpora useful in different domains of applications. For example, speech technologies such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech) need a huge amount of speech data to train datadriven models or to produce synthetic speech. Collecting data is always related to costs (recording speech, verifying annotations, etc.), and as a rule of thumb, the more data you gather, the more costly your application will be. Within this context, we present in this article solutions to reduce the amount of linguistic text content while maintaining a sufficient level of linguistic richness required by a model or an application. This problem can be formalized as a Set Covering Problem (SCP) and we evaluate two algorithmic heuristics applied to design large text corpora in English and French for covering phonological information or POS labels. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy and we propose a second algorithm based on Lagrangian relaxation. The latter approach provides a lower bound to the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a greedy algorithm achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for 3-phoneme coverings). Usually, constraints in SCP are binary; we proposed here a generalization where the constraints on each covering feature can be multi-valued.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Set-Covering Strategies for Optimal Corpus Design

This article is interested in the problem of the linguistic content of a speech corpus. Depending on the target task, the phonological and linguistic content of the corpus is controlled by collecting a set of sentences which covers a preset description of phonological attributes under the constraint of an overall duration as small as possible. This goal is classically achieved by greedy algorit...

متن کامل

A 3-flip neighborhood local search for the set covering problem

The set covering problem (SCP) asks to choose a minimum cost family of subsets from given n subsets, which together covers the entire ground set. In this paper, we propose a local search algorithm for SCP, which have the following three features. (1) The use of 3ip neighborhood, which is the set of solutions obtainable from the current solution by exchanging at most three subsets. As the size o...

متن کامل

Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a thi...

متن کامل

Finite State Models for the Generation of Large Corpora of Natural Language Texts

Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. The problem in creating good corpora is that often natural language texts are too short with respect to the dimension required to ...

متن کامل

Optimizing of SCP Production from Sugar Beet Stillage Using Isolated Yeast

In this study fungi isolated from the effluent of ethanol factories were identified. Optimal conditions for single cell protein (SCP) production and COD reduction of sugar beet stillage are specified for a species of Hansenula in a continuous culture. Under these conditions 5.7 g dm-3 biomass was produced and 31% of COD was reduced without addition of further nutrients to the beet molasses ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computational Linguistics

دوره 41 شماره

صفحات -

تاریخ انتشار 2015

Large Linguistic Corpus Reduction with SCP Algorithms

نویسندگان

چکیده

منابع مشابه

Comparing Set-Covering Strategies for Optimal Corpus Design

A 3-flip neighborhood local search for the set covering problem

Comparing performance of different set-covering strategies for linguistic content optimization in speech corpora

Finite State Models for the Generation of Large Corpora of Natural Language Texts

Optimizing of SCP Production from Sugar Beet Stillage Using Isolated Yeast

عنوان ژورنال:

اشتراک گذاری