نتایج جستجو برای: text document classification
تعداد نتایج: 765658 فیلتر نتایج به سال:
Side information is available along with text document in several text mining application. They are the different kind of side information such as document provenance information, the link in the document, other non textual attributes which are contained into the document or user access behavior from web logs. Some attributes may contain extremely large amount of information for clustering purp...
We present the Word Mover’s Distance (WMD), a novel distance function between text documents. Our work is based on recent results in word embeddings that learn semantically meaningful representations for words from local cooccurrences in sentences. The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document nee...
Classification of text documents presents a unique challenge to conventional classification algorithms. Due to the existence of large number of features in the datasets, providing a desired representation for text documents can be seen as another problem. In this paper a simple but effective representation model for text documents to tackle the classification problem is discussed. Two different...
India is a multilingual multi-script country. There are totally 18 official languages and 12 scripts in India. For Optical Character Recognition (OCR) of such a multi-lingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Malayalam, Telugu, Tam...
In this paper, we investigate the use of multivariate Poisson model and feature weighting to learn naive Bayes text classifier. Our new naive Bayes text classification model assumes that a document is generated by a multivariate Poisson model while the previous works consider a document as a vector of binary term features based on the presence or absence of each term. We also explore the use of...
Content-addressable network is a scalable and robust distributed hash table providing distributed applications to store and retrieve information in an efficient manner. We consider design and implementation issues of a document sharing system over a content-addressable overlay network. Improvements and their applicability on a document sharing system are discussed. We describe our system protot...
Text classification refers to determine the class of an unknown text according to its content in the given classification system. In this paper the enhanced features are used to find distribution of a word in a single document or multiple number of documents. It can be exploited by a TF-IDF style equation, and different features are combined using ensemble learning techniques. Features are not ...
Clustering is a widely studied data mining problem in the text documents. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this paper, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the...
This paper presents an investigation into the summarisation of the free text element of questionnaire data using hierarchical text classification. The process makes the assumption that text summarisation can be achieved using a classification approach whereby several class labels can be associated with documents which then constitute the summarisation. A hierarchical classification approach is ...
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید