Single-Pass PCA of Large High-Dimensional Data
نویسندگان
چکیده
Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the top singular vectors of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm’s accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of highdimensional data stored as a 150 GB file, the algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.
منابع مشابه
Identification of mineralization features and deep geochemical anomalies using a new FT-PCA approach
The analysis of geochemical data in frequency domain, as indicated in this research study, can provide new exploratory informationthat may not be exposed in spatial domain. To identify deep geochemical anomalies, sulfide zone and geochemical noises in Dalli Cu–Au porphyry deposit, a new approach based on coupling Fourier transform (FT) and principal component analysis (PCA) has beenused. The re...
متن کاملStreaming, Memory-Limited PCA
In this paper, we consider a streaming one-pass-over-the-data model for Principal Component Analysis (PCA). The input, in this case, is a stream of p-dimensional vectors, and the output is a collection of k, p-dimensional principal components that span the best approximating subspace. Consequently, the minimum memory requirement for such problems is O(kp). Yet the standard PCA algorithm require...
متن کاملFace Detection at the Low Light Environments
Today, with the advancement of technology, the use of tools for extracting information from video are much wider in terms of both visual power and the processing power. High-speed car, perfect detection accuracy, business diversity in the fields of medical, home appliances, smart cars, humanoid robots, military systems and the commercialization makes these systems cost effective. Among the most...
متن کاملLinear Modelling for Spectral Images based on Truncated Fourier Series
Reflectance spectra of hyperspectral images of the natural scenes are supposed to represent the real world better than any certain classes of natural and man-made spectral reflectance. But spectral images contain a large volume of data and place considerable demands on computer hardware and software compared with standard trichromatic images. Although principal component analysis (PCA) based lo...
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کامل