Performance-Portable High-Level Accelerator Programming
نویسنده
چکیده
The OpenMP API provides a portable model for efficient, high level thread-parallel programming across platforms, vendors, operating systems. We are developing a model with the same advantages to address compute accelerators. In this talk, we explore today’s accelerator landscape, along with the perils of current programming methods. We demonstrate why OpenCL, while impressive and important, doesn’t already solve the programming problem. We close with a summary of the PGI Accelerator Model and the closelyrelated model being developed by the OpenMP Accelerator subcommittee. 1. WHY ACCELERATORS? An accelerator is additional hardware added to a computer system in order to do some task faster than the computer could do without the accelerator. At one point, hardware floating point units were designed as an accelerator to a single-chip microprocessor. We use accelerators to stay within a power×cost×performance envelope. You can often get two out of those three, but to get all three often requires a hardware accelerator. A more iconic accelerator would be the attached processors of the 1970s and 1980s, such as array processors from Floating Point Systems. These were designed to attach to a minicomputer, like a Digital VAX, and to deliver the performance of a mainframe at the cost of a mini. More recently, Clearspeed designed a single chip parallel processor as a compute accelerator for the high performance and embedded market. IBM used a variant of the Cell processor as an accelerator, most notably for the first petaflop system, the Roadrunner. Convey Computer has designed a computer system using Intel microprocessors and a tightly integrated reconfigurable computer accelerator; an interesting feature of the Convey is the accelerator is implemented using FPGAs, though most of the FPGA programming is hidden from all but the most aggressive programmers. The accelerators most on people’s minds today are high Permission to make copies of all or part of this work, in any format, is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies have a full citation. Presented at SMC-IT 2011, August 3, 2011, Palo Alto, CA, USA performance GPUs from NVIDIA or AMD. GPUs themselves are really graphics accelerators, relative to doing graphics on the central processors. GPUs as compute accelerators have the distinct advantage that the cost of developing the chip itself is amortized over its whole market, including all the graphics customers. Since the cost of designing a large computer chip can be on the order of a billion dollars, the design must have a large market to make it affordable. Some upcoming processors from Intel and AMD will incorporate GPU capability on the chip itself; we will discuss how that affects the compute accelerator world. In the next year or so, Intel has promised to deliver a manycore chip, the Knights Corner, to address the same highly parallel, scalable applications that are now being ported to GPUs. The big potential advantage of this chip is it shares most of its instruction set with the Intel x86 processors, so initial code porting may be quite a bit easier. A key point with accelerators is that we choose them to get performance. If performance were not a goal, we could find another solution that didn’t involve the complexity of designing and programming the accelerator. Since performance is the goal, we have to pay attention to performance when we design our applications and write our programs, and we are going to have to be willing to spend the time to tune these programs.
منابع مشابه
Experiences with High-Level Programming Directives for Porting Applications to GPUs
HPC systems now exploit GPUs within their compute nodes to accelerate program performance. As a result, high-end application development has become extremely complex at the node level. In addition to restructuring the node code to exploit the cores and specialized devices, the programmer may need to choose a programming model such as OpenMP or CPU threads in conjunction with an accelerator prog...
متن کاملDesign and Implementation of a Portable Web Server Accelerator
In this paper, we describe the design, implementation and performance evaluation of a portable web server accelerator, called Tornader. Tornader resides in front of a web server and improves performance by efficiently delivering cached response. Tornader boosts the throughput of the most widely used Apache web server up to 150% under heavy load. Furthermore, Tornader is easily portable since it...
متن کاملA Portable Accelerator Control Toolkit
In recent years, the expense of creating good control software has led to a number of collaborative efforts among laboratories to share this cost. The EPICS collaboration is a particularly successful example of this trend. More recently another collaborative effort has addressed the need for sophisticated high level software, including model driven accelerator controls. This work builds upon th...
متن کاملPerformance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
With the appearance of the heterogeneous platform OpenPower, many-core accelerator devices have been coupled with Power host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGP...
متن کاملEarly Experiences Writing Performance Portable OpenMP 4 Codes
In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these a...
متن کامل