Phonotactic Modeling of Extremely Low Resource Languages

نویسندگان

  • Andrei Shcherbakov
  • Ekaterina Vylomova
  • Nick Thieberger
چکیده

This paper presents a novel approach to low resource language modeling. Here we propose a model for word prediction which is based on multi-variant ngram abstraction with weighted confidence level. We demonstrate a significant improvement in word recall over ”traditional” KneserNey back-off model for most of the examined low resource languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combining Weak Tokenisers for Phonotactic Language Recognition in a Resource-Constrained Setting

In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeniser in matched condition is not always availabl...

متن کامل

A Language Independent Approach To Acquiring Phonotactic Resources for Speech Recognition

Building and developing linguistic resources for languages is of prime importance with many areas of application. This paper focusses on a fully automatic approach to the aquisition of a syllable phonotactics for a particular language. In this approach the phonotactic constraints for a language are encoded in a finite-state phonotactic automaton the structure of which can be automatically deriv...

متن کامل

Grammars leak: Modeling how phonotactic generalizations interact within the grammar

I present evidence from Navajo and English that weaker, gradient versions of morpheme-internal phonotactic constraints, such as the ban on geminate consonants in English, hold even across prosodic word boundaries. I argue that these lexical biases are the result of a MAXIMUM EN-TROPY phonotactic learning algorithm that maximizes the probability of the learning data, but that also contains a smo...

متن کامل

The BLZ Submission to the NIST 2011 LRE: Data Collection, System Development and Performance

This paper describes the most relevant features of a collaborative multi-site submission to the NIST 2011 Language Recognition Evaluation (LRE), consisting of one primary and three contrastive systems, each fusing different combinations of 13 state-of-the-art (acoustic and phonotactic) language recognition subsystems. The collaboration focused on collecting and sharing training data for those t...

متن کامل

Modeling code-Switching speech on under-resourced languages for language identification

This paper presents an integration of phonotactic information to perform language identification (LID) in a mixed-language speech. A single-pass front-end recognition system is employed to convert the spoken utterances into a statistical occurrence of phone sequences. To process such phone sequences, a hidden Markov model (HMM) is utilized to build robust acoustic models that can handle multipl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016