Text Mining with Support Vector Machines and Non-negative Matrix Factorization Algorithms by Neelima Guduru a Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science University of Rhode Island
نویسندگان
چکیده
The objective of this thesis is to develop efficient text classification models to classify text documents. In usual text mining algorithms, a document is represented as a vector whose dimension is the number of distinct keywords in it, which can be very large. Consequently, traditional text classification can be computationally expensive. In this work, feature extraction through the non-negative matrix factorization (NMF) algorithm is used to reduce the dimensionality of the documents. This was accomplished in the Oracle data mining software, which has the NMF algorithm, built in it. Following the feature extraction to reduce the dimensionality, a support vector machine (SVM) algorithm was used for classification of the text documents. The performance of models with SVM alone and models with NMF and SVM are compared by applying them to classify the biomedical documents from a subset of the MEDLINE database into user-defined categories. Since models based on SVM alone use documents with full dimensionality, their classification performance is very good; however, they are computationally expensive. On the data set, the dimensionality is 1617 and the SVM models achieve an accuracy of approximately 98%. With the NMF feature extraction, the dimensionality is reduced to a number as small as 4-100, dramatically reducing the complexity of the classification model. At the same time, the model accuracy is as high as 70 – 92%. Thus, it is concluded that, NMF feature extraction results in a large decrease in the computational time, with only a small reduction in the accuracy.
منابع مشابه
A Comparison of Hensel Lifting Techniques for Univariate Polynomials by Matthew A. Kayala a Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science University of Rhode Island
Hensel lifting algorithms are often used in exact polynomial factorization. While the quadratic algorithm due to Zassenhaus [1, 2] is asymptotically faster, it has been shown that the original linear construction often performs better in practice because it involves less computation per step [3]. A hybrid algorithm combining the two has also been proposed [4]. There are no clear cut results on ...
متن کاملA QUADRATIC MARGIN-BASED MODEL FOR WEIGHTING FUZZY CLASSIFICATION RULES INSPIRED BY SUPPORT VECTOR MACHINES
Recently, tuning the weights of the rules in Fuzzy Rule-Base Classification Systems is researched in order to improve the accuracy of classification. In this paper, a margin-based optimization model, inspired by Support Vector Machine classifiers, is proposed to compute these fuzzy rule weights. This approach not only considers both accuracy and generalization criteria in a single objective fu...
متن کاملThesis Submitted in Partial Fulfillment of the requirement for the Degree of M.A/M. Sc In School consultant
Goal: The aim of this study is assess and compare emotional ability of deaf. Semi _ deaf and hearing students (14 _ 20) in Mashhad. Method: To do this experiment out of studies evidence generally 105 students selecting randomly. From each group, choose the number of normal boys and girls 35, deaf boys and girls and semi deaf boys and girls .this article is useful and explanatory .in this stud...
متن کاملA Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملExact Algorithms for the Reversal Median Problem
OF THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science Computer Science The University of New Mexico Albuquerque, New Mexico
متن کامل