Evaluation of DGEMM Implementation on Intel Xeon Phi Coprocessor
نویسندگان
چکیده
In this paper we will present a detailed study of implementing double-precision matrix-matrix multiplication (DGEMM) utilizing the Intel Xeon Phi Coprocessor. We discuss a DGEMM algorithm implementation running "natively" on the coprocessor, minimizing communication with the host CPU. We will run DGEMM across a range of matrix sizes natively as well using Intel Math Kernel Library. Our optimizations were designed to support maximal reuse of on-die cache, which significantly reduces transfer from GDDR. Finally we analyze the improvement of a classic matrix multiplication implementation based on Cauchy algorithm compared to the latest results achieved using the Intel Math Kernel Library DGEMM subroutine.
منابع مشابه
Implementing Strassen's Algorithm with BLIS
We dispel with “street wisdom” regarding the practical implementation of Strassen’s algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is...
متن کاملPerformance Evaluation of Breadth-First Search on Intel Xeon Phi
Breadth-First Search (BFS) is one of the most important kernels in graph computing. It is the main kernel of the Graph500 rating that evaluates performance of large supercomputers and multiprocessor nodes in terms of traversed edges per second (TEPS). In this paper we present the results of BFS performance evaluation on a recently released high-performance Intel Xeon Phi coprocessor. We examine...
متن کاملSIMD Implementation of a Multiplicative Schwarz Smoother for a Multigrid Poisson Solver on an Intel Xeon Phi Coprocessor
In this paper, we discuss an efficient implementation of the three-dimensional multigrid Poisson solver on a many-core coprocessor, Intel Xeon Phi. We have used the modified block red-black (mBRB) Gauss-Seidel (GS) smoother to achieve sufficient degree of parallelism and high cache hit ratio. We have vectorized (SIMDized) the GS steps in the smoother by introducing a partially SIMDizing techniq...
متن کاملPhiTM for DNA Sequence Analysis
Genetic information is increasing exponentially, doubling every 18 months. Analyzing this information within a reasonable amount of time requires parallel computing resources. While considerable research has addressed DNA analysis using GPUs, so far not much attention has been paid to the Intel Xeon Phi coprocessor. In this paper we present an algorithm for large-scale DNA analysis that exploit...
متن کاملPorting FEASTFLOW to the Intel Xeon Phi: Lessons Learned
In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the cod...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JCP
دوره 9 شماره
صفحات -
تاریخ انتشار 2014