Towards resilient parallel linear Krylov solvers: recover-restart strategies

نویسندگان

  • Emmanuel Agullo
  • Luc Giraud
  • Abdou Guermouche
  • Jean Roman
  • Mawussi Zounon
چکیده

The advent of extreme scale machines will require the use of parallel resources at an unprecedented scale, probably leading to a high rate of hardware faults. High Performance Computing (HPC) applications that aim at exploiting all these resources will thus need to be resilient, i.e., be able to compute a correct solution in presence of faults. In this work, we investigate possible remedies in the framework of the solution of large sparse linear systems that is the inner most numerical kernel in many scienti c and engineering applications and also one of the most time consuming part. More precisely, we present recovery followed by restarting strategies in the framework of Krylov subspace solvers where lost entries of the iterate are interpolated to de ne a new initial guess before restarting the Krylov method. In particular, we consider two interpolation policies that preserve key numerical properties of well-known solvers, namely the monotony decrease of the A-norm of the error of the conjugate gradient (CG) or the residual norm decrease of GMRES. We assess the impact of the recovery method, the fault rate and the number of processors on the robustness of the resulting linear solvers. We consider experiments with CG, GMRES and Bi-CGStab. Key-words: Resilience, linear Krylov solvers, linear and least-square interpolation, monotonic convergence. ∗ Inria Bordeaux-Sud Ouest, France † Université de Bordeaux 1, France ha l-0 08 43 99 2, v er si on 1 12 J ul 2 01 3 Vers des solveurs linéaires de Krylov parallèles résilients Résumé : Les machines exa ops annoncées pour la n de la décennie seront très probablement sujettes à des taux de panne très élevés. Dans ce rapport nous présentons des techniques d'interpolation pour recouvrer des erreurs matérielles dans le contexte des solveurs linéaires de type Krylov. Pour chacune des techniques proposées nous démontrons qu'elles permettent de garantir des propriétés de décroissance monotone de la norme des résidus ou de la norme-A de l'erreur pour des méthodes telles que le gradient conjugué ou GMRES. A travers de nombreuses expérimentations numériques nous étudions qualitativement le comportement des di érentes variantes lorsque le nombre de c÷urs de calcul et le taux de panne varie. Mots-clés : Résilience, solveurs de Krylov linéaires, interpolation linéaire ou de moindres carrés, convergence monotone. ha l-0 08 43 99 2, v er si on 1 12 J ul 2 01 3 Towards resilient parallel linear Krylov solvers 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Non-stationary iterative solvers on a PC cluster

In this study, we introduce cost effective strategies and algorithms for parallelizing the Krylov subspace based non-stationary iterative solvers such as Bi-CGM and Bi-CGSTAB for distributed computing on a cluster of PCs using ANULIB message passing libraries. We investigate the effectiveness of the parallel solvers on the linear systems resulting in numerical solution of some 2D and 3D nonline...

متن کامل

Development of Krylov and AMG Linear Solvers for Large-Scale Sparse Matrices on GPUs

This research introduce our work on developing Krylov subspace and AMG solvers on NVIDIA GPUs. As SpMV is a crucial part for these iterative methods, SpMV algorithms for single GPU and multiple GPUs are implemented. A HEC matrix format and a communication mechanism are established. And also, a set of specific algorithms for solving preconditioned systems in parallel environments are designed, i...

متن کامل

PSPIKE: A Parallel Hybrid Sparse Linear System Solver

The availability of large-scale computing platforms comprised of tens of thousands of multicore processors motivates the need for the next generation of highly scalable sparse linear system solvers. These solvers must optimize parallel performance, processor (serial) performance, as well as memory requirements, while being robust across broad classes of applications and systems. In this paper, ...

متن کامل

A Brief Introduction to Krylov Space Methods for Solving Linear Systems

With respect to the " influence on the development and practice of science and engineering in the 20th century " , Krylov space methods are considered as one of the ten most important classes of numerical methods [1]. Large sparse linear systems of equations or large sparse matrix eigenvalue problems appear in most applications of scientific computing. Sparsity means that most elements of the m...

متن کامل

Towards Ultra Rapid Restarts

We observe a trend regarding restart strategies used in SAT solvers. A few years ago, most state-of-the-art solvers restarted on average after a few thousands of backtracks. Currently, restarting after a dozen backtracks results in much better performance. The main reason for this trend is that heuristics and data structures have become more restart-friendly. We expect further continuation of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013