Phonotactic Modeling of Extremely Low Resource Languages
نویسندگان
چکیده
This paper presents a novel approach to low resource language modeling. Here we propose a model for word prediction which is based on multi-variant ngram abstraction with weighted confidence level. We demonstrate a significant improvement in word recall over ”traditional” KneserNey back-off model for most of the examined low resource languages.
منابع مشابه
Combining Weak Tokenisers for Phonotactic Language Recognition in a Resource-Constrained Setting
In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeniser in matched condition is not always availabl...
متن کاملA Language Independent Approach To Acquiring Phonotactic Resources for Speech Recognition
Building and developing linguistic resources for languages is of prime importance with many areas of application. This paper focusses on a fully automatic approach to the aquisition of a syllable phonotactics for a particular language. In this approach the phonotactic constraints for a language are encoded in a finite-state phonotactic automaton the structure of which can be automatically deriv...
متن کاملGrammars leak: Modeling how phonotactic generalizations interact within the grammar
I present evidence from Navajo and English that weaker, gradient versions of morpheme-internal phonotactic constraints, such as the ban on geminate consonants in English, hold even across prosodic word boundaries. I argue that these lexical biases are the result of a MAXIMUM EN-TROPY phonotactic learning algorithm that maximizes the probability of the learning data, but that also contains a smo...
متن کاملThe BLZ Submission to the NIST 2011 LRE: Data Collection, System Development and Performance
This paper describes the most relevant features of a collaborative multi-site submission to the NIST 2011 Language Recognition Evaluation (LRE), consisting of one primary and three contrastive systems, each fusing different combinations of 13 state-of-the-art (acoustic and phonotactic) language recognition subsystems. The collaboration focused on collecting and sharing training data for those t...
متن کاملModeling code-Switching speech on under-resourced languages for language identification
This paper presents an integration of phonotactic information to perform language identification (LID) in a mixed-language speech. A single-pass front-end recognition system is employed to convert the spoken utterances into a statistical occurrence of phone sequences. To process such phone sequences, a hidden Markov model (HMM) is utilized to build robust acoustic models that can handle multipl...
متن کامل