\quick" Implementation of Block Lu Algorithms on the Cm-200

نویسنده

Claus Bendtsen

چکیده

The CMSSL library only includes a limited amount of mathematical algorithms. Hence, when writing code for the Connection Machine one often has to implement LAPACK-style routines already developed for other architec-tures, in the hope that acceptable performance can thus be obtained relatively quickly. Due to the massively parallel structure of the CM, algorithms with a serial or global structure|such as LU factorization|tend to yield poor performance (global here meaning that elements do not only interact with elements in their neighborhood). The purpose of this note is partly to show what performance one can typically obtain when implementing a global algorithm on the CM-200 in a relatively limited period of time, and partly to examine the pros and contras for using a block algorithm. The testing has been performed on the LU factorization in normal as well as in a blocked version. The implementation has been performed by the use of BLAS level 3 equivalent routines and the results have been compared to the LU factorization present in the CMSSL library. The implementation and optimization of both the normal and blocked version have altogether been carried out within 14 days. The obtained performance is very disappointing: only 4% of the CMSSL routine for large matrices. The implementation of the LU factorization is done by walking down the diagonal and subtracting outer products along the way. The outer products are calculated by means of spreads in the unoptimized version and by the use of a mask and scans in the optimized version. The bottleneck of the operation is undoubtedly the spread/scan operations since these demand a tremendous amount of communication and are thus very time consuming even though the communication is along a single axis. Pivoting and equilibration are not implemented. Since the functions need global communication the best results have been obtained by choosing a :NEWS layout. The solver is implemented in a way similar to the factorization. The backward as well as the forward substitution is performed be means of spreads (unoptimized) and scans (optimized). In the optimized version care is taken not to create unnecessary temporaries. Random test systems lead to the performance shown in Fig. 1. The timings shown are elapsed times computed on a 8K CM-200 using double precision. The timings show a complexity|for large N |of N 2:6. For the factorization the optimized version is a little faster than the unoptimized (typically a factor of two)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...

متن کامل

Block-Cyclic Dense Linear Algebra

Block{cyclic order elimination algorithms for LU and QR factorization and solve routines are described for distributed memory architectures with processing nodes conngured as two{dimensional arrays of arbitrary shape. The cyclic order elimination together with a consecutive data allocation yields good load{balance for both the factorization and solution phases for the solution of dense systems ...

متن کامل

Computing a block incomplete LU preconditioner as the by-product of block left-looking A-biconjugation process

In this paper, we present a block version of incomplete LU preconditioner which is computed as the by-product of block A-biconjugation process. The pivot entries of this block preconditioner are one by one or two by two blocks. The L and U factors of this block preconditioner are computed separately. The block pivot selection of this preconditioner is inherited from one of the block versions of...

متن کامل

Numerical Investigation on Compressible Flow Characteristics in Axial Compressors Using a Multi Block Finite Volume Scheme

An unsteady two-dimensional numerical investigation was performed on the viscous flow passing through a multi-blade cascade. A Cartesian finite-volume approach was employed and it was linked to Van-Leer's and Roe's flux splitting schemes to evaluate inviscid flux terms. To prevent the oscillatory behavior of numerical results and to increase the accuracy, Monotonic Upstream Scheme for Conservat...

متن کامل

Fistulipora Microparallela (Yang and Lu, 1962) from Lower Permian Bryozoans of Lut Block, Central Iran

The Fistulipora microparallela (Yang and Lu, 1962) species is described for the first time from the Sakmarian deposits of the Sarab section in Lut Block, Central Iran. This species has been reported only from the Permian (Cisuralian-Guadalupian) of the Qilianshan and Kankerin formations, and the Baliqliq Group (Upper Carboniferous to Lower Permian) of Western Xinjiang, China.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

\quick" Implementation of Block Lu Algorithms on the Cm-200

نویسنده

چکیده

منابع مشابه

Parallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform

Block-Cyclic Dense Linear Algebra

Computing a block incomplete LU preconditioner as the by-product of block left-looking A-biconjugation process

Numerical Investigation on Compressible Flow Characteristics in Axial Compressors Using a Multi Block Finite Volume Scheme

Fistulipora Microparallela (Yang and Lu, 1962) from Lower Permian Bryozoans of Lut Block, Central Iran

عنوان ژورنال:

اشتراک گذاری