Optimizing OCR accuracy for bi-tonal, noisy scans of degraded Arabic documents
نویسندگان
چکیده
Acquiring foreign language from degraded hardcopy documents is of interest to military and border control applications. Bi-tonal image scans are desirable because file size is small. However, the nature of hardcopy degradations and the scanner or image enhancement software capabilities used directly affect the quality of the captured image and the extent of language acquisition. We applied a collection of manual treatments to hardcopy Arabic documents to develop a corpus of bi-tonal images. We then used this corpus in an exploratory study to derive conclusions about how bi-tonal images could be enhanced. This paper discusses the manually degraded Arabic document corpus, the image enhancement study, and the significant optical character recognition (OCR) improvements obtained with simple scanner driver adjustments.
منابع مشابه
Performance Evaluation of Two Arabic OCR Products
Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th ge...
متن کاملEnglish/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text
In this paper, a novel for Query Translation and Expansion for enabling English/Arabic CLIR for both normal and OCR-Degraded Arabic Text model has been proposed, implemented, and tested. First, an English/Arabic Word Collocations Dictionary has been established plus reproducing three English/Arabic Single Words Dictionaries. Second, a modern Arabic Corpus has been built. Third, a model for simu...
متن کاملProcessing of Degraded Documents for Long-Term Archival Us- ing WaferficheTM Technology
Adaptive binarization techniques are proposed for the restoration of degraded documents for WaferficheTMarchival. WaferficheTMis a compact archival solution for long-term preservation of documents in a human-accessible image based format. Bi-level images are preferable for the lithographic fabrication utilized in WaferficheTMproduction. Binarization of degraded documents poses challenges due to...
متن کاملQuery Translation and Expansion for Searching Normal and OCR-Degraded Arabic Text
This paper provides a novel model for English/Arabic Query Translation to search Arabic text, and then expands the Arabic query to handle Arabic OCR-Degraded Text. This includes detection and translation of word collocations, translating single words, transliterating names, and disambiguating translation and transliteration through different approaches. It also expands the query with the expect...
متن کاملAutomatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System
Arabic spelling errors occur in different types of documents, such as handwritten by non experienced users, optical character recognition (OCR) documents and machine translated documents. Many researchers had tried to solve this dilemma but till now there is no a radical solution. This paper proposes a hybrid system based on the confusion matrix and the noisy channel spelling correction model t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005