text document classification

Unsupervised Classification of Text-Centric XML Document Collections

2006

Antoine Doucet Miro Lehtonen

This paper addresses the problem of the unsupervised classification of text-centric XML documents. In the context of the INEX mining track 2006, we present methods to exploit the inherent structural information of XML documents in the document clustering process. Using the k-means algorithm, we have experimented with a couple of feature sets, to discover that a promising direction is to use str...

متن کامل

Early text classification: a Naïve solution

2016

Hugo Jair Escalante Manuel Montes-y-Gómez Luis Villaseñor Pineda Marcelo Luis Errecalde

Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A docum...

متن کامل

Effect of Document Representation on the Performance of Medical Document Classification

2006

Fathi H. Saad Beatriz de la Iglesia Duncan G. Bell

Text classification in the medical domain is a real world problem with wide applicability. This paper investigates extensively the effect of text representation approaches on the performance of medical document classification. To accomplish this objective, we evaluated seven different approaches to represent real word medical documents. The text representation approaches investigated in this pa...

متن کامل

Is Naïve Bayes a Good Classifier for Document Classification?

2011

S. L. Ting W. H. Ip Albert H.C. Tsang

Document classification is a growing interest in the research of text mining. Correctly identifying the documents into particular category is still presenting challenge because of large and vast amount of features in the dataset. In regards to the existing classifying approaches, Naïve Bayes is potentially good at serving as a document classification model due to its simplicity. The aim of this...

متن کامل

Unsupervised Multi-label Text Classification Using a World Knowledge Ontology

2012

Xiaohui Tao Yuefeng Li Raymond Y. K. Lau Hua Wang

The development of text classification techniques has been largely promoted in the past decade due to the increasing availability and widespread use of digital documents. Usually, the performance of text classification relies on the quality of categories and the accuracy of classifiers learned from samples. When training samples are unavailable or categories are unqualified, text classification...

متن کامل

Efficiency investigation of manifold matching for text document classification

Journal: :Pattern Recognition Letters 2013

Ming Sun Carey E. Priebe

Manifold matching works to identify embeddings of multiple disparate data spaces into the same low-dimensional space, where joint inference can be pursued. It is an enabling methodology for fusion and inference from multiple and massive disparate data sources. In this paper three methods of manifold matching are considered: PoM, which stands for Multidimensional Scaling (MDS) composed with Proc...

متن کامل

MeSH Up: effective MeSH text classification for improved document retrieval

Journal: :Bioinformatics 2009

Dolf Trieschnigg Piotr Pezik Vivian Lee Franciska de Jong Wessel Kraaij Dietrich Rebholz-Schuhmann

MOTIVATION Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small...

متن کامل

Temporal Classification of Text and Automatic Document Dating

2006

Angelo Dalli

Temporal information is presently underutilised for document and text processing purposes. This work presents an unsupervised method of extracting periodicity information from text, enabling time series creation and filtering to be used in the creation of sophisticated language models that can discern between repetitive trends and non-repetitive writing pat-terns. The algorithm performs in O(n ...

متن کامل

Text Document Classification: an Approach Based on Indexing

2012

B S Harish S Manjunath

In this paper we propose a new method of classifying text documents. Unlike conventional vector space models, the proposed method preserves the sequence of term occurrence in a document. The term sequence is effectively preserved with the help of a novel datastructure called ‘Status Matrix’. Further the corresponding classification technique has been proposed for efficient classification of tex...

متن کامل

A semantic partition based text mining model for document classification

2018

Catherine Inibhunu

Feature Extraction is a mechanism used to extract key phrases from any given text documents. This extraction can be weighted, ranked or semantic based. Weighted and Ranking based feature extraction normally assigns scores to extracted words based on various heuristics. Highest scoring words are seen as important. Semantic based extractions normally try to understand word meanings, and words wit...

متن کامل