Transformation-based Learning in Document Format Processing
نویسندگان
چکیده
Recent work in computational linguistics (Brill 1992; 1994) has described a transformation-based learner with impressive accuracy, speed and a lucid, concise representation. This work presents a set-based formal model of ambiguity, tagging and the transformationbased learning paradigm. We apply the model to the automatic learning of document format generation and recognition on multiple levels of structural semantics. This supports general applicability of the model and results in a novel linear time document format processor.
منابع مشابه
روش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملرفع اعوجاج هندسی متون بهکمک اطلاعات هندسی خطوط متن
Document images produced by scanners or digital cameras usually have photometric and geometric distortions. If either of these effects distorts document, recognition of words from such a document image using OCR is subject to errors. In this paper we propose a novel approach to significantly remove geometric distortion from document images. In this method first we extract document lines from do...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملXML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents
Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data element...
متن کاملProvide a model for the establishment of the school in accordance with the indicators and requirements of the Education Transformation Document
Purpose: The aim of this study was to provide a model for school establishment in accordance with the indicators and requirements of the Education Transformation Document. Methodology: The research method was basic-applied in terms of purpose, descriptive-survey in terms of data collection method and combined in terms of data type. The statistical population of the study in the qualitative sect...
متن کامل