Classification for high-dimension low-sample size data
نویسندگان
چکیده
High-dimension and low-sample-size (HDLSS) data sets have posed great challenges to many machine learning methods. To deal with practical HDLSS problems, development of new classification techniques is highly desired. After the cause over-fitting phenomenon identified, a criterion for sets, termed tolerance similarity, proposed emphasize maximization within-class variance on premise class separability. Leveraging this criterion, novel linear binary classifier, No-separated Data Maximum Dispersion classifier (NPDMD), designed. The main idea NPDMD spread samples two classes in large interval respective positive or negative space along projecting direction when distance between projection means enough. salient features are: (1) operates well sets; (2) solves objective function entire feature avoid data-piling phenomenon. (3) leverages low-rank property covariance matrix accelerate computation speed. (4) suitable different real-word applications. (5) can be implemented readily using Quadratic Programming. Not only theoretical properties been derived, but also series evaluations conducted one simulated six real-world benchmark including face mRNA classification. Experimental results comprehensive studies demonstrate superiority terms correct rate, mean within-group rate area under ROC curve.
منابع مشابه
Deep Neural Networks for High Dimension, Low Sample Size Data
Deep neural networks (DNN) have achieved breakthroughs in applications with large sample size. However, when facing high dimension, low sample size (HDLSS) data, such as the phenotype prediction problem using genetic data in bioinformatics, DNN suffers from overfitting and high-variance gradients. In this paper, we propose a DNN model tailored for the HDLSS data, named Deep Neural Pursuit (DNP)...
متن کاملGeometric representation of high dimension, low sample size data
High dimension, low sample size data are emerging in various areas of science. We find a common structure underlying many such data sets by using a non-standard type of asymptotics: the dimension tends to 1 while the sample size is fixed. Our analysis shows a tendency for the data to lie deterministically at the vertices of a regular simplex. Essentially all the randomness in the data appears o...
متن کاملAsymptotics for High Dimension, Low Sample Size data and Analysis of Data on Manifolds
SUNGKYU JUNG: Asymptotics for High Dimension, Low Sample Size data and Analysis of Data on Manifolds. (Under the direction of Dr. J. S. Marron.) The dissertation consists of two research topics regarding modern non-standard data analytic situations. In particular, data under the High Dimension, Low Sample Size (HDLSS) situation and data lying on manifolds are analyzed. These situations are rela...
متن کاملPca Consistency in High Dimension , Low Sample Size Context
Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows (i.e. High Dimension, Low Sample Size (HDLSS)) are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLS...
متن کاملClustering High Dimension, Low Sample Size Data Using the Maximal Data Piling Distance
We propose a new hierarchical clustering method for high dimension, low sample size (HDLSS) data. The method utilizes the fact that each individual data vector accounts for exactly one dimension in the subspace generated by HDLSS data. The linkage that is used for measuring the distance between clusters is the orthogonal distance between affine subspaces generated by each cluster. The ideal imp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Pattern Recognition
سال: 2022
ISSN: ['1873-5142', '0031-3203']
DOI: https://doi.org/10.1016/j.patcog.2022.108828