Scalable High Performance Dimension Reduction
نویسنده
چکیده
Dimension reduction is a useful tool for visualization of such high-dimensional data to make data analysis feasible for such vast volume and high-dimensional scientific data. Among the known dimension reduction algorithms, multidimensional scaling algorithm is investigated in this proposal due to its theoretical robustness and high applicability. Multidimensional scaling is known as a non-linear optimization problem so that it is easy to be trapped in local optima if EM-like hill-climbing approach is used to solve it. In order to avoid local optima, the author has applied deterministic annealing (DA) approach to the well-known EM-like multidimensional scaling algorithm called SMACOF. Furthermore, the MDS algorithm is necessary to be parallelized to deal with large amount of data via distributed-memory environment, such as multicore cluster systems, since multidimensional scaling requires O(N2) physical memory as well as O(N2) computational resources. Although parallelization enables SMACOF algorithm to deal with tens of thousands or even hundreds of thousands points, it is still difficult to run parallel SMACOF algorithm with millions points, since it requires too much memory and computation to execute. Thus, the author proposes an interpolated approach to utilizing the known mapping of only a subset of the given data, named in-sample data. This approach effectively reduces computational complexity. With minor trade-off of approximation, interpolation method makes it possible to process millions of data points with modest amounts of computation and memory requirement. Since huge amount of data are dealt, the author presents how to parallelize proposed interpolation algorithms, as well. As we expected, the applying DA approach to SMACOF algorithm enables the proposed algorithm not to be stucked in the local optima but to find better results consistently with tested biological sequence data. Also, applying distributed parallelism to SMACOF algorithm helps to run with bigger data size which is not apt to a single compute node. The author is going to compare pure MPI parallel model with hybrid (MPI-Threading) parallel model to aim at finding better parallel model for the SMACOF (and DA-SMACOF) algorithm. Also, the experimental results illustrate that the quality of interpolated mapping results are comparable to the mapping results of original algorithm only. In parallel performance aspect, the interpolation method is parallelized with high efficiency. With the proposed interpolation method, it is possible to construct a configuration of two-million out-of-sample data into the target dimension, and the number of out-of-sample data can be increased further. The affect of the weight function in the STRESS value will also be investigating with several non-uniform weight function as well as uniform weight function.
منابع مشابه
Cost Effective and Scalable Synthesis of MnO2 Doped Graphene in a Carbon Fiber/PVA: Superior Nanocomposite for High Performance Flexible Supercapacitors
In the current study, we report new flexible, free standing and high performance electrodes for electrochemical supercapacitors developed througha scalable but simple and efficient approach. Highly porous structures based on carbon fiber and poly (vinyl alcohol) (PVA) were used as a pattern. The electrochemical performances of Carbon fiber/GO-MnO2/CNT supercapacitors were characteriz...
متن کاملUMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. e result is a practical scalable algorithm that applies to real world data. e UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of...
متن کاملScalable Mobile Visual Classification by Kernel Preserving Projection Over High-Dimensional Features
Scalable mobile visual classification – classifying images/videos in a large semantic space on mobile devices in real time – is an emerging problem as observing the paradigm shift towards mobile platforms and the explosive growth of visual data. Though seeing the advances in detecting thousands of concepts in the servers, the scalability is handicapped in mobile devices due to the severe resour...
متن کاملAdaptive Randomized Dimension Reduction on Massive Data
The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational...
متن کاملScalable, Automated Performance Analysis with TAU and PerfExplorer
Scalable performance analysis is a challenge for parallel development tools. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information, and to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores. In addition, many exploratory analysis proce...
متن کاملKnowledge support and automation for performance analysis with PerfExplorer 2.0
The integration of scalable performance analysis in parallel development tools is difficult. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information. Simply to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores requires new scalable anal...
متن کامل