Performance Measurements of the 3D FFT on the Blue Gene/L Supercomputer
نویسندگان
چکیده
This paper presents performance characteristics of a communicationsintensive kernel, the complex data 3D FFT, running on the Blue Gene/L architecture. Two implementations of the volumetric FFT algorithm were characterized, one built on the MPI library using an optimized collective all-to-all operation [2] and another built on a low-level System Programming Interface (SPI) of the Blue Gene/L Advanced Diagnostics Environment (BG/L ADE) [17]. We compare the current results to those obtained using a reference MPI implementation (MPICH2 ported to BG/L with unoptimized collectives) and to a port of version 2.1.5 the FFTW library [14]. Performance experiments on the Blue Gene/L prototype indicate that both of our implementations scale well and the current MPI-based implementation shows a speedup of 730 on 2048 nodes for 3D FFTs of size 128×128×128. Moreover, the volumetric FFT outperforms FFTW port by a factor 8 for a 128×128×128 complex FFT on 2048 nodes.
منابع مشابه
Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements
This paper presents results on a communications-intensive kernel, the three-dimensional fast Fourier transform (3D FFT), running on the 2,048-node Blue Genet/L (BG/L) prototype. Two implementations of the volumetric FFT algorithm were characterized, one built on the Message Passing Interface library and another built on an active packet Application Program Interface supported by the hardware br...
متن کاملVectorization techniques for the Blue Gene/L double FPU
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Genet/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimiza...
متن کاملPerformance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer
QCDOC is a massively parallel supercomputer with tens of thousands of nodes distributed on a six-dimensional torus network. The 6D structure of the network provides the needed communication resources for many communication-intensive applications. In this paper, we present a parallel algorithm for three-dimensional Fast Fourier Transform and its implementation for a 4096-node QCDOC prototype. Tw...
متن کاملTask placement of parallel multi-dimensional FFTs on a mesh communication network
For many scientific applications, the Fast Fourier Transformation (FFT) of multi-dimensional data is the kernel which limits scalability to large numbers of processors. This paper investigates an extension of a traditional parallel threedimensional FFT (3D-FFT) implementation. The extension within a parallel 3D-FFT consists of customized MPI task mappings between the virtual processor grid of t...
متن کاملFFT specific compilation on IBM blue gene
Bei vielen numerischen Codes gelingt es verfügbaren Compilern nicht, das Leistungspotential moderner Prozessoren zufriedenstellend auszuschöpfen. Als Alternative zum Hand-Coding und -Tuning von numerischen Grundroutinen wurde der MAP Special-Purpose-Compiler entwickelt und speziell an die Anforderungen von Codes aus der Domäne der Signalverarbeitung angepaßt. Die neue, an IBM Blue Gene Supercom...
متن کامل