Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs
نویسندگان
چکیده
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of ‘performance enhancing’ features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimised storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11Hz for both GPUs, independent of the size of the problem.
منابع مشابه
Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results
GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies ...
متن کاملOptimizing Stencil Computations for NVIDIA Kepler GPUs
We present a series of optimization techniques for stencil computations on NVIDIA Kepler GPUs. Stencil computations with regular grids had been ported to the older generations of NVIDIA GPUs with significant performance improvements thanks to the higher memory bandwidth than conventional CPU-only systems. However, because of the architectural changes introduced with the latest generation of the...
متن کاملConjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs
Lattice Quantum Chromodynamics simulations typically spend most of the runtime in inversions of the Fermion Matrix. This part is therefore frequently optimized for various HPC architectures. Here we compare the performance of the Intel R Xeon Phi TM to current Kepler-based NVIDIA R Tesla TM GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverti...
متن کاملPerformance Evaluation of Numeric Compute Kernels on nVIDIA GPUs
Graphics processing units provide an astonishing number of floating point operations per second and deliver memory bandwidths of one magnitude greater than common general purpose central processing units. With the introduction of the Compute Unified Device Architecture, a first step was taken by nVIDIA to ease access to the vast computational resources of graphics processing units. The aim of t...
متن کاملSailfish: A flexible multi-GPU implementation of the lattice Boltzmann method
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Physics Communications
دوره 185 شماره
صفحات -
تاریخ انتشار 2014