Dataset and Ground Truth for Handwritten Text in Four Different Scripts
نویسندگان
چکیده
In document image analysis (DIA) especially in handwritten document recognition, standard databases play signi ̄cant roles for evaluating performances of algorithms and comparing results obtained by di®erent groups of researchers. The ̄eld of DIA regard to Indo-Persian documents is still at its infancy compared to Latin script-based documents; as such standard datasets are not still available in literature. This paper is an e®ort towards alleviating this gap. In this paper, an unconstrained handwritten dataset containing documents of Persian, Bangla, Oriya and Kannada (PBOK) is introduced. The PBOK contains 707 text-pages written in four di®erent languages (Persian, Bangla, Oriya and Kannada) by 436 individuals. Total number of text-lines, words/subwords and characters are 12 565, 104 541 and 423 980, respectively. Inmost documents of PBOK dataset contain either an overlapping or a touching text-lines. The average number of text-lines in text-pages of the PBOK dataset is 18. Two types of ground truths, based on pixels information and content information, are generated for the dataset. Because of such ground truths, the PBOK dataset can be utilized in many areas of document image processing for instance text-line segmentation, word segmentation and word recognition. To provide an insight for other researches, recent text-line segmentation results on this dataset are also reported.
منابع مشابه
UCOM offline dataset-an urdu handwritten dataset generation
A benchmark database for character recognition is an essential part for efficient and robust development. Unfortunately, there is no comprehensive handwritten dataset for Urdu language that would be used to compare the state of the art techniques in the field of optical character recognition. In this paper, we present a new and publically available dataset comprising 600 pages of handwritten Ur...
متن کاملBaseline Estimation in Arabic Handwritten Text-Line - Evaluation on AHTID/MW Database
Baseline extraction is one of the most important phases for handwriting recognition. Due to the complexity of the Arabic scripts, baseline detection of Arabic handwritten text-lines is a difficult task compared to other languages. In this work, a method which combines some baseline extraction techniques used in literature was presented to provide a fine estimation of baseline in Arabic handwrit...
متن کاملA New Multipurpose Comprehensive Database for Handwritten Dari Recognition
In this paper, we present the creation of the first comprehensive database for research and development on handwritten recognition of Dari language. This new handwritten database consists of many aspects of Dari scripts such as: handwritten isolated characters, isolated digits, numeral strings of various lengths, many words/terms, dates, and some special symbols. For each handwritten image in t...
متن کامل\textitTexT TexT - Text Extractor Tool for Handwritten Document Transcription and Annotation
This paper presents a framework for semi-automatic transcription of large-scale historical handwritten documents and proposes a simple user-friendly text extractor tool, TexT for transcription. The proposed approach provides a quick and easy transcription of text using computer assisted interactive technique. The algorithm finds multiple occurrences of the marked text on-the-fly using a word sp...
متن کاملMapping Transcripts to Handwritten Text
In the analysis and recognition of handwriting, a useful first task is to assign ground truth for words in the writing. Such an assignment is useful for various subsequent machine learning tasks for performing automatic recognition, writer verification, etc. Since automatic word segmentation and recognition can be error prone, an intermediate approach is to use a text file that is a transcripti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJPRAI
دوره 26 شماره
صفحات -
تاریخ انتشار 2012