Large Linguistic Corpus Reduction with SCP Algorithms
نویسندگان
چکیده
Linguistic corpus design is a critical concern for building rich annotated corpora useful in different domains of applications. For example, speech technologies such as ASR (Automatic Speech Recognition) or TTS (Text-to-Speech) need a huge amount of speech data to train datadriven models or to produce synthetic speech. Collecting data is always related to costs (recording speech, verifying annotations, etc.), and as a rule of thumb, the more data you gather, the more costly your application will be. Within this context, we present in this article solutions to reduce the amount of linguistic text content while maintaining a sufficient level of linguistic richness required by a model or an application. This problem can be formalized as a Set Covering Problem (SCP) and we evaluate two algorithmic heuristics applied to design large text corpora in English and French for covering phonological information or POS labels. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy and we propose a second algorithm based on Lagrangian relaxation. The latter approach provides a lower bound to the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a greedy algorithm achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for 3-phoneme coverings). Usually, constraints in SCP are binary; we proposed here a generalization where the constraints on each covering feature can be multi-valued.
منابع مشابه
Comparing Set-Covering Strategies for Optimal Corpus Design
This article is interested in the problem of the linguistic content of a speech corpus. Depending on the target task, the phonological and linguistic content of the corpus is controlled by collecting a set of sentences which covers a preset description of phonological attributes under the constraint of an overall duration as small as possible. This goal is classically achieved by greedy algorit...
متن کاملA 3-flip neighborhood local search for the set covering problem
The set covering problem (SCP) asks to choose a minimum cost family of subsets from given n subsets, which together covers the entire ground set. In this paper, we propose a local search algorithm for SCP, which have the following three features. (1) The use of 3ip neighborhood, which is the set of solutions obtainable from the current solution by exchanging at most three subsets. As the size o...
متن کاملComparing performance of different set-covering strategies for linguistic content optimization in speech corpora
Set covering algorithms are efficient tools for solving an optimal linguistic corpus reduction. The optimality of such a process is directly related to the descriptive features of the sentences of a reference corpus. This article suggests to verify experimentally the behaviour of three algorithms, a greedy approach and a lagrangian relaxation based one giving importance to rare events and a thi...
متن کاملFinite State Models for the Generation of Large Corpora of Natural Language Texts
Natural languages are probably one of the most common type of input for text processing algorithms. Therefore, it is often desirable to have a large training/testing set of input of this kind, especially when dealing with algorithms tuned for natural language texts. The problem in creating good corpora is that often natural language texts are too short with respect to the dimension required to ...
متن کاملOptimizing of SCP Production from Sugar Beet Stillage Using Isolated Yeast
In this study fungi isolated from the effluent of ethanol factories were identified. Optimal conditions for single cell protein (SCP) production and COD reduction of sugar beet stillage are specified for a species of Hansenula in a continuous culture. Under these conditions 5.7 g dm-3 biomass was produced and 31% of COD was reduced without addition of further nutrients to the beet molasses ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational Linguistics
دوره 41 شماره
صفحات -
تاریخ انتشار 2015