text documents

A New Algorithm for Detecting Text Line in Handwritten Documents

2006

Yi Li Yefeng Zheng David Doermann Stefan Jaeger

Curvilinear text line detection and segmentation in handwritten documents is a significant challenge for handwriting recognition. Given no prior knowledge of script, we model text line detection as an image segmentation problem by enhancing text line structure using a Gaussian window, and adopting the level set method to evolve text line boundaries. Experiments show that the proposed method ach...

متن کامل

A Genetic Semi-supervised Fuzzy Clustering Approach to Text Classification

2003

Hong Liu Shang-Teng Huang

A genetic semi-supervised fuzzy clustering algorithm is proposed, which can learn text classifier from labeled and unlabeled documents. Labeled documents are used to guide the evolution process of each chromosome, which is fuzzy partition on unlabeled documents. The fitness of each chromosome is evaluated with a combination of fuzzy within cluster variance of unlabeled documents and misclassifi...

متن کامل

Multi-lingual Text Leveling

2014

Salim Roukos Jerome Quin Todd Ward

Determining the language proficiency level required to understand a given text is a key requirement in vetting documents for use in second language learning. In this work, we describe our approach for developing an automatic text analytic to estimate the text difficulty level using the Interagency Language Roundtable (ILR) proficiency scale. The approach we take is to use machine translation to...

متن کامل

Automatic Distinction of Fernando Pessoas' Heteronyms

2015

João Teixeira Marco Couto

Text Mining has opened a vast array of possibilities concerning automatic information retrieval from large amounts of text documents. A variety of themes and types of documents can be easily analyzed. More complex features such as those used in Forensic Linguistics can gather deeper understanding from the documents, making possible performing difficult tasks such as author identification. In th...

متن کامل

What You Saw Is What You Want: Using Cases to Seed Information Retrieval

1997

Jody J. Daniels Edwina L. Rissland

This paper presents a hybrid case-based reasoning (CBR) and information retrieval (IR) system, called SPIRE, that both retrieves documents from a full-text document corpus and from within individual documents, and locates passages likely to contain information about important problem-solving features of cases. SPIRE uses two case-bases, one containing past precedents, and one containing excerpt...

متن کامل

Probabilistic Methods for Structured Document Classification at INEX'07

2007

Luis M. de Campos Juan M. Fernández-Luna Juan F. Huete Alfonso E. Romero

This paper exposes the results of our participation in the Document Mining track at INEX’07. We have focused on the task of classification of XML documents. Our approach to deal with structured document representations uses classification methods for plain text, applied to flattened versions of the documents, where some of their structural properties have been translated to plain text. We have ...

متن کامل

Recognizing Documents versus Meta-Documents by Tree Kernel Learning

2015

Boris A. Galitsky Nina Lebedeva

The problem of classifying text with respect to metalanguage and language object patterns is formulated and its application areas are proposed. Examples of metalanguage patterns in text are foreign language grammar lessons and tutorials on how to write engineering documents. The method targets the text classification tasks where keyword statistics is insufficient do distinguish between such abs...

متن کامل

Compherensive Review Of Text Classification Using Machine Learning

2015

Nisha Gautam Abhishek Bhardwaj

Text Classification, also known as text categorization, is the task of automatically allocating unlabeled documents into predefined categories. Text Classification means allocating a document to one or more categories or classes. The ability to accurately perform a classification task depends on the representations of documents to be classified. Text representations transform the textural docum...

متن کامل

An Approach for Concept-based Automatic Multi- Document Summarization using Machine Learning

2012

G. PadmaPriya

Text Summarization is compressing the source text into a shorter version preserving its information content and overall meaning. It is very complicated for human beings to manually summarize large documents of text. Text summarization plays an important role in the area of natural language processing and text mining. Many approaches use statistics and machine learning techniques to extract sent...

متن کامل

Test Model for Text Categorization and Text Summarization

Journal: :CoRR 2011

Khushboo Thakkar Urmila Shrawankar

Abstract—Text Categorization is the task of automatically sorting a set of documents into categories from a predefined set and Text Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner. Document Summarization is an emerging technique for understanding the main purpose of any kind of documen...

متن کامل