Transformation-based Learning in Document Format Processing

نویسندگان

James R. Curran

Raymond K. Wong

چکیده

Recent work in computational linguistics (Brill 1992; 1994) has described a transformation-based learner with impressive accuracy, speed and a lucid, concise representation. This work presents a set-based formal model of ambiguity, tagging and the transformationbased learning paradigm. We apply the model to the automatic learning of document format generation and recognition on multiple levels of structural semantics. This supports general applicability of the model and results in a novel linear time document format processor.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data element...

متن کامل

Provide a model for the establishment of the school in accordance with the indicators and requirements of the Education Transformation Document

Purpose: The aim of this study was to provide a model for school establishment in accordance with the indicators and requirements of the Education Transformation Document. Methodology: The research method was basic-applied in terms of purpose, descriptive-survey in terms of data collection method and combined in terms of data type. The statistical population of the study in the qualitative sect...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Transformation-based Learning in Document Format Processing

نویسندگان

چکیده

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

رفع اعوجاج هندسی متون به‌کمک اطلاعات هندسی خطوط متن

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

Provide a model for the establishment of the school in accordance with the indicators and requirements of the Education Transformation Document

عنوان ژورنال:

اشتراک گذاری