Cache efficient bidiagonalization using BLAS 2.5 operators
نویسندگان
چکیده
منابع مشابه
Reproducible, Accurately Rounded and Efficient BLAS
Numerical reproducibility failures rise in parallel computation because floating-point summation is non-associative. Massively parallel and optimized executions dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger op...
متن کاملEfficient Reproducible Floating Point Summation and BLAS
We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness [1]. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point additi...
متن کاملCache-efficient numerical algorithms using graphics hardware
We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical a...
متن کاملEfficient local transformation estimation using Lie operators
Conventional translation-only motion estimation algorithms cannot cope with transformations of objects such as scaling, rotations and deformations. Motion models characterizing non-translation motions are thus beneficial as they offer more accurate motion estimation and compensation. In this paper, we introduce low-complexity transformation estimation methods with four motion models based on Li...
متن کاملCache-Efficient Matrix Transposition
We investigate the memory system performance of several algorithms for transposing an N N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of var...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Mathematical Software
سال: 2008
ISSN: 0098-3500,1557-7295
DOI: 10.1145/1356052.1356055