Addressing failures in exascale computing

نویسندگان

  • Marc Snir
  • Robert W. Wisniewski
  • Jacob A. Abraham
  • Sarita V. Adve
  • Saurabh Bagchi
  • Pavan Balaji
  • Jim Belak
  • Pradip Bose
  • Franck Cappello
  • Bill Carlson
  • Andrew A. Chien
  • Paul Coteus
  • Nathan DeBardeleben
  • Pedro C. Diniz
  • Christian Engelmann
  • Mattan Erez
  • Saverio Fazzari
  • Al Geist
  • Rinku Gupta
  • Fred Johnson
  • Sriram Krishnamoorthy
  • Sven Leyffer
  • Dean Liberty
  • Subhasish Mitra
  • Todd S. Munson
  • Rob Schreiber
  • Jon Stearley
  • Eric Van Hensbergen
چکیده

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable and Highly Available Fault Resilient Programming Middleware for Exascale Computing

A hierarchical master-worker model is believed to be a promising programming paradigm that can achieve weak scaling on exascale-level high performance computers [1]. However, “fault resiliency” is one of the most important issues for exascale computing because the Mean Time Between Failure (MTBF) of such computers will be short [2]. We propose a fault resilient programming middleware called Fal...

متن کامل

Towards an Exascale Enabled Sparse Solver Repository

As we approach the Exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new ‘Exascale enabled’ sparse solver repository (the ESSR) that addresses these challenges—from fundamental design considerations and development processes to actual implementations of some prototyp...

متن کامل

Algorithms and Scheduling Techniques for Exascale Systems

Exascale systems to be deployed in the near future will come with deep hierarchical parallelism, will exhibit various levels of heterogeneity, will be prone to frequent component failures, and will face tight power consumption constraints. The notion of application performance in these systems becomes multi-criteria, with fault-tolerance and power consumption metrics to be considered in additio...

متن کامل

Paving the Road to Exascale with Many-Task Computing

Exascale systems will bring significant challenges. This work attempts to addresses them through the Many-Task Computing (MTC) paradigm, by delivering data-aware job scheduling systems and fully asynchronous distributed architectures. MTC applications are structured as DAG graphs of tasks, with dependencies forming the edges. The asynchronous nature of MTC makes it more resilient than tradition...

متن کامل

Toward Exascale Resilience

Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJHPCA

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2014