Convolution Based Technique for Indic Script Identification from Handwritten Document Images
نویسندگان
چکیده
Determination of script type of document image is a complex real life problem for a multi-script country like India, where 23 official languages (including English) are present and 13 different scripts are used to write them. Including English and Roman those count become 23 and 13 respectively. The problem becomes more challenging when handwritten documents are considered. In this paper an approach for identifying the script type of handwritten document images written by any one of the Bangla, Devnagari, Roman and Urdu script is proposed. Two convolution based techniques, namely Gabor filter and Morphological reconstruction are combined and a feature vector of 20 dimensions is constructed. Due to unavailability of a standard data set, a corpus of 157 document images with an almost equal ratio of four types of script is prepared. During classification the dataset is divided into 2:1 ratio. An average identification accuracy rate of 94.4% is obtained on the test set. The average Bi-script and Tri-script identification accuracy rate was found to be 98.2% and 97.5% respectively. Statistical performance analysis is done using different well known classifiers.
منابع مشابه
Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script
India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to identify the scripts of handwritten text correctly. In this paper, we present a system, which automatically separates the scripts of handwritten words from a docum...
متن کاملA new dataset of word-level offline handwritten numeral images from four official Indic scripts and its benchmarking using image transform fusion
Handwritten document image dataset development is one of the most tedious and time consuming tasks in optical character recogniser (OCR) related experimental work. Special attention need to be given in terms of feasibility, realness, clarity etc. while collecting real life data from different writers. Few efforts can be found in the literature for development of handwritten NIdb (numeral image ...
متن کاملHandwritten Script Identification from a Bi-Script Document at Line Level using Gabor Filters
In a country like India where more number of scripts are in use, automatic identification of printed and handwritten script facilitates many important applications including sorting of document images and searching online archives of document images. In this paper, a Gabor feature based approach is presented to identify different Indian scripts from handwritten document images. Eight popular In...
متن کاملHandwritten Text Image Compression for Indic Script Document
In this paper, compression scheme is presented for Indian Language handwritten text document images. Document image compression is an active area of research. Current OCR technology is not effective for handling the handwritten text images. The proposed compression scheme deals with the handwritten gray level document in Devnagri script. The method is based on the separation of foreground and b...
متن کاملIndic Handwritten Script Identification using Offline-Online Multimodal Deep Network
In this paper, we propose a novel approach of word-level Indic script identification using only character-level data in training stage. The advantages of using character level data for training have been outlined in section I. Our method uses a multimodal deep network which takes both offline and online modality of the data as input in order to explore the information from both the modalities j...
متن کامل