Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition

نویسندگان

  • Nadi Tomeh
  • Nizar Habash
  • Ryan Roth
  • Noura Farra
  • Pradeep Dasigi
  • Mona T. Diab
چکیده

Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency. In this paper we incorporate linguistically and semantically motivated features to an existing OCR system. To do so we follow an n-best list reranking approach that exploits recent advances in learning to rank techniques. We achieve 10.1% and 11.4% reduction in recognition word error rate (WER) relative to a standard baseline system on typewritten and handwritten Arabic respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Features Extraction Method for Arabic Characters Based on Pixel Orientation Technique

This paper presents a features extraction module for isolated handwritten Arabic characters. The collected core features are based on pixels orientations according to Freeman chain code. The input to this module is Arabic character (in its basic-shapes i.e. without diacritics). The features extractor module, fed with a skeleton of an isolated character basic-shape, yields global and local featu...

متن کامل

A Persian-English Cross-Linguistic Dataset for Research on the Visual Processing of Cognates and Noncognates

Finding out which lexico-semantic features of cognates are critical in cross-language studies and comparing these features with noncognates helps researchers to decide which features to control in studies with cognates. Normative databases provide necessary information for this purpose. Such resources are lacking in the Persian language. We created a dataset and determined norms for the essenti...

متن کامل

On the Optical Character Recognition and Machine Translation Technology in Arabic: Problems and Solutions

The report addresses the basic problems of the Arabic language formalization based on analysis of linguistic errors in software products. Reviewing the principles of modern information systems operation the authors come to the conclusion that the existing methods of the Arabic formalization allow to note a shift towards the technological aspects of the linguistic processing of facts, however, t...

متن کامل

Optimizing Feature Selection for Recognizing Handwritten Arabic Characters

Recognition of characters greatly depends upon the features used. Several features of the handwritten Arabic characters are selected and discussed. An off-line recognition system based on the selected features was built. The system was trained and tested with realistic samples of handwritten Arabic characters. Evaluation of the importance and accuracy of the selected features is made. The recog...

متن کامل

A Proposed Hybrid Technique for Recognizing Arabic Characters

Optical character recognition systems improve human-machine interaction and are urgently required for many governmental and commercial departments. A considerable progress in the recognition techniques of Latin and Chinese characters has been achieved. By contrast, Arabic Optical Character Recognition (AOCR) is still lagging although the interest and research in this area is becoming more inten...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013