Modeling Documents for Structure Recognition Using Generalized N-Grams
نویسندگان
چکیده
منابع مشابه
Improved Degraded Document Recognition with Hybrid Modeling Techniques and Character N-Grams
In this paper a robust multifont character recognition system for degraded documents such as photocopy or fax is described. The system is based on Hidden Markov Models (HMMs) using discrete and hybrid modeling techniques, where the latter makes use of an information theory-based neural network. The presented recognition results refer to the SEDAL-database of English documents using no dictionar...
متن کاملN-Gram Language Modeling for Robust Multi-Lingual Document Classification
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models ar...
متن کاملInterpolated Dirichlet Class Language Model for Speech Recognition Incorporating Long-distance N-grams
We propose a language modeling (LM) approach incorporating interpolated distanced n-grams in a Dirichlet class language model (DCLM) (Chien and Chueh, 2011) for speech recognition. The DCLM relaxes the bag-of-words assumption and documents topic extraction of latent Dirichlet allocation (LDA). The latent variable of DCLM reflects the class information of an n-gram event rather than the topic in...
متن کاملOF THE IEEE , AUGUST 2000 1 Exploiting Latent Semantic Information in Statistical Language Modeling Jerome
| Statistical language models used in large vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more diÆcult to handle within a data-driven formalism. This paper focuses on the us...
متن کاملLanguage Modeling for limited-data domains
With the increasing focus of speech recognition and natural language processing applications on domains with limited amount of in-domain training data, enhanced system performance often relies on approaches involving model adaptation and combination. In such domains, language models are often constructed by interpolating component models trained from partially matched corpora. Instead of simple...
متن کامل