Stepwise regression for unsupervised learning
نویسنده
چکیده
I consider unsupervised extensions of the fast stepwise linear regression algorithm [5]. These extensions allow one to efficiently identify highly-representative feature variable subsets within a given set of jointly distributed variables. This in turn allows for the efficient dimensional reduction of large data sets via the removal of redundant features. Fast search is effected here through the avoidance of repeat computations across trial fits, allowing for a full representative-importance ranking of a set of feature variables to be carried out in O(nm) time, where n is the number of variables and m is the number of data samples available. This runtime complexity matches that needed to carry out a single regression and is O(n) faster than that of naive implementations. I present pseudocode suitable for efficient forward, reverse, and forward-reverse unsupervised feature selection. To illustrate the algorithm’s application, I apply it to the problem of identifying representative stocks within a given financial market index – a challenge relevant to the design of Exchange Traded Funds (ETFs). I also characterize the growth of numerical error with iteration step in these algorithms, and finally demonstrate and rationalize the observation that the forward and reverse algorithms return exactly inverted feature orderings in the weakly-correlated feature set regime.
منابع مشابه
Local Filter Selection Boosts Performance of Automatic Speechreading
We examine general purpose unsupervised techniques for visual preprocesing in machine vision tasks. In particular we analyze a wide variety of principal component and independent component techniques in combination with stepwise regression methods for variable selection. The task at hand is recognition of the first four digits spoken in English using hidden Markov models (HMM) for the recogniti...
متن کاملBuilding an asynchronous web-based tool for machine learning classification
Various unsupervised and supervised learning methods including support vector machines, classification trees, linear discriminant analysis and nearest neighbor classifiers have been used to classify high-throughput gene expression data. Simpler and more widely accepted statistical tools have not yet been used for this purpose, hence proper comparisons between classification methods have not bee...
متن کاملKnowledge-based Clustering as a Conceptual and Algorithmic Environment of Biomedical Data Analysis
While a genuine abundance of biomedical data available nowadays becomes a genuine blessing, it also posses a lot of challenges. The two fundamental and commonly occurring directions in data analysis deal with its supervised or unsupervised pursuits. Our conjecture is that in the area of biomedical data processing and understanding where we encounter a genuine diversity of patterns, problem desc...
متن کاملNeural networks based EEG-Speech Models
In this paper, we describe three neural network (NN) based EEG-Speech (NES) models that map the unspoken EEG signals to the corresponding phonemes. Instead of using conventional feature extraction techniques, the proposed NES models rely on graphic learning to project both EEG and speech signals into deep representation feature spaces. This NN based linear projection helps to realize multimodal...
متن کاملInformation Theoretic Clustering
Clustering is one of the important topics in pattern recognition. Since only the structure of the data dictates the grouping (unsupervised learning), information theory is an obvious criteria to establish the clustering rule. This paper describes a novel valley seeking clustering algorithm using an information theoretic measure to estimate the cost of partitioning the data set. The information ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1706.03265 شماره
صفحات -
تاریخ انتشار 2017