Script Identification in Printed Bilingual Documents
نویسندگان
چکیده
Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognition of the page. Much work has already been reported in this area relating to Roman, Arabic, Chinese, Korean and Japanese scripts. In the Indian context, though some results have been reported, the task is still at its infancy. In the work presented in this paper, a successful attempt has been made to identify the script, at the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three distinct spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a word using Gabor filters with suitable frequencies and orientations. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.
منابع مشابه
Script Identification from Bilingual Gujarati-English Documents
In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script ...
متن کاملA Comparative Analysis of Classifiers Accuracies for Bilingual Printed Documents (Oriya-English)
Bilingual document recognition has been the subject of intensive research and our focus is on the recognition of an Oriya-English bilingual documents. In most of our official papers, school text books, it is observed that English words interspersed within the Indian languages. So there is need for an Optical Character Recognition (OCR) system which can recognize these bilingual documents and st...
متن کاملIdentification of Printed Punjabi Words and English Numerals Using Gabor Features
Script identification is one of the challenging steps in the development of optical character recognition system for bilingual or multilingual documents. In this paper an attempt is made for identification of English numerals at word level from Punjabi documents by using Gabor features. The support vector machine (SVM) classifier with five fold cross validation is used to classify the word imag...
متن کاملCharacter Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents
Nowadays, it is observed that English script has interspersed within the Indian languages. So there is a need for an optical character recognition (OCR) system which can recognize these bilingual documents and store it for future use. Hence, in this paper an OCR system is proposed that can read documents containing Gujarati and English scripts (Only digits). These scripts have many features in ...
متن کاملGlobal Approach for Script Identification using Wavelet Packet Based Features
In a multi script environment, an archive of documents having the text regions printed in different scripts is in practice. For automatic processing of such documents through Optical Character Recognition (OCR), it is necessary to identify different script regions of the document. In this paper, a novel texture-based approach is presented to identify the script type of the collection of documen...
متن کامل