Extending OmpSs to support CUDA and OpenCL in C, C++ and Fortran Applications
نویسندگان
چکیده
CUDA and OpenCL are the most widely used programming models to exploit hardware accelerators. Both programming models provide a C-based programming language to write accelerator kernels and a host API used to glue the host and kernel parts. Although this model is a clear improvement over a low-level and ad-hoc programming model for each hardware accelerator, it is still too complex and cumbersome for general adoption. For large and complex applications using several accelerators, the main problem becomes the explicit coordination and management of resources required between the host and the hardware accelerators that introduce a new family of issues (scheduling, data transfers, synchronization, ...) that the programmer must take into account. In this paper, we propose a simple extension to OmpSs –a data-flow programming model– that dramatically simplifies the integration of accelerated code, in the form of CUDA or OpenCL kernels, into any C, C++ or Fortran application. Our proposal fully replaces the CUDA and OpenCL host APIs with a few pragmas, so we can leverage any kernel written in CUDA C or OpenCL C without any performance impact. Our compiler generates all the boilerplat code while our runtime system takes care of kernels scheduling, data transfers between host and accelerators and synchronizations between host and kernels parts. To evaluate our approach, we have ported several native CUDA and OpenCL applications to OmpSs by replacing all the CUDA or OpenCL API calls by a few number of pragmas. The OmpSs versions of these applications have competitive performance and scalability but with a significantly lower complexity than the original ones.
منابع مشابه
IPMACC: Open Source OpenACC to CUDA/OpenCL Translator
In this paper we introduce IPMACC, a framework for translating OpenACC applications to CUDA or OpenCL. IPMACC is composed of set of translators translating OpenACC for C applications to CUDA or OpenCL. The framework uses the system compiler (e.g. nvcc) for generating final accelerator’s binary. The framework can be used for extending the OpenACC API, executing OpenACC applications, or obtaining...
متن کاملExtending the Capabilities of the Cray Programming Environment with Clang-LLVM Framework Integration
Recent developments in programming for multicore processors and accelerators using C++11, OpenCL and Domain Specific Languages (DSL) have prompted us to look into tools that offer compilers and both static and runtime analysis toolchains to complement the Cray Programming Environment capabilities. In this paper we report our preliminary experiences from using the CLang-LLVM framework on a hybri...
متن کاملScalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment
With heterogeneous computing becoming mainstream, researchers and software vendors have been trying to exploit the best of the underlying architectures like GPUs or CPUs to enhance performance. Parallel programming models play a crucial role in achieving this enhancement. One such model is OpenCL, a parallel computing API for cross platform computations targeting heterogeneous architectures. Ho...
متن کاملAn investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device type...
متن کاملEvaluating Performance and Portability of OpenCL Programs
Recently, OpenCL, a new open programming standard for GPGPU programming, has become available in addition to CUDA. OpenCL can support various compute devices due to its higher abstraction programming framework. Since there is a semantic gap between OpenCL and compute devices, the OpenCL C compiler plays important roles to exploit the potential of compute devices and therefore its capability sho...
متن کامل