Three Algorithms for Cholesky Factorization on Distributed Memory Using Packed Storage
نویسندگان
چکیده
We present three algorithms for Cholesky factorization using minimum block storage for a distributed memory (DM) environment. One of the distributed square blocked packed (SBP) format algorithms performs similar to ScaLAPACK PDPOTRF, and with iteration overlapping outperforms it by as much as 67%. By storing the blocks in a standard contiguous way, we get better performing BLAS operations. Our DM algorithms are almost insensitive to memory hierarchy effects and thus gives smooth and predictable performance. We investigate the intricacies of using RFP format in a DM ScaLAPACK environment and point out some advantages and drawbacks. 1 Near Minimal Storage in a Serial Environment Rectangular full packed (RFP) format is a standard full storage two-dimensional array for triangular or symmetric matrices requiring minimum storage [3]. For the lower triangular case, blocks A11, A21, A22 are stored as submatrices in a rectangular full storage array. This allows for using level 3 BLAS as well as making it easy to write LAPACK-style code for this format [3]. SBP format is a generalization of standard full storage. The matrix is partitioned into square blocks of order NB, and in the case of storing symmetric or triangular matrices, only the triangular blocks are stored. Each square block is contiguous in memory; the blocks are either stored row or column-wise. Each square diagonal block wastes NB(NB-1)/2 elements, a total of N(NB-1)/2 elements summed over all N/NB diagonal blocks. Each square block will map into L1 cache in an optimal way resulting in efficient BLAS operations. 2 Minimum Block Storage in a Distributed Environment The current industry standard for distributed memory computing views the processors as a PxQ mesh and uses a 2D Block Cyclic Layout (BCL) of full format arrays. This has proven to be a good choice for achieving effective load balancing. However, this wastes about half the storage for triangular and symmetric matrices. There is currently no industry standard for packed storage. The SBP storage of Section 1 is a possibility.
منابع مشابه
Optimizing Locality of Reference in Cholesky Algorithms1
This paper presents the principle ideas involved in hierarchical blocking, introduces the block packed storage scheme, and gives the implementation details and the performance rates of the hierarchically blocked Cholesky factorization. In some cases the newly developed routines are faster by an order of magnitude than the corresponding Lapack routines. Introduction Most current computers based ...
متن کاملA distributed packed storage for large dense parallel in-core calculations
We propose in this paper a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared to a full storage for a wide range of operations. This work has been motivated by the fact that, contrary to sequential linear algebra libraries (e.g. LAPACK [4]), there...
متن کاملHigh Performance Cholesky Factorization via Blocking and Recursion That Uses Minimal Storage
We present a high performance Cholesky factorization algorithm , called BPC for Blocked Packed Cholesky, which performs better or equivalent to the LAPACK DPOTRF subroutine, but with about the same memory requirements as the LAPACK DPPTRF subroutine, which runs at level 2 BLAS speed. Algorithm BPC only calls DGEMM and level 3 kernel routines. It combines a recursive algorithm with blocking and ...
متن کاملLAPACK Working Note ? LAPACK Block Factorization Algorithms on the Intel iPSC / 860 ∗
The aim of this project is to implement the basic factorization routines for solving linear systems of equations and least squares problems from LAPACK—namely, the blocked versions of LU with partial pivoting, QR, and Cholesky on a distributed-memory machine. We discuss our implementation of each of the algorithms and the results we obtained using varying orders of matrices and blocksizes.
متن کاملEfficient Methods for Out-of-Core Sparse Cholesky Factorization
We consider the problem of sparse Cholesky factorization with limited main memory. The goal is to e ciently factor matrices whose Cholesky factors essentially ll the available disk storage, using very little memory (as little as 16 Mbytes). This would enable very large industrial problems to be solved with workstations of very modest cost. We consider three candidate algorithms. Each is based o...
متن کامل